– Diversity Workbench –
The Geographical Thesaurus DiversityGazetteer 2.1

by G. Hagedorn
Preliminary documentation, version 0.9, 6. Mar. 2002

Table of contents
 
Introduction
  Using the place name selection dialog
  Installing the Analysis Supplement
  The information model
  The object model (interface)
  Coverage of the DiversityGazetteer
  Proposed citation

Introduction

The DiversityGazetteer is an attempt to georeference geographic names that are used in other components of the DiversityWorkbench. It provides a unique Geographical Name ID that can be stored together with the client object (e.g. a specimen record in DiversityCollection) and can be analyzed at a later point. The DiversityGazetteer itself allows basic analyses of geographical data stored using the DiversityGazetteer and provides links to an external GIS (geographical information system) through the provided geographical coordinates. It is, however, not intended to replace a full hierarchical gazetteer. Instead it can be used as a gateway into complex geographical gazetteers (like the Getty Thesaurus of Geographical names or the German GN250) to perform advanced geographical analyses. See the chapter 'The information model' for further information about this.

The gazetteer is used primarily from other Diversity Workbench components. If a collection unit, e. g., a herbarium specimen, is recorded in the DiversityCollection component, the geographical name can be entered exactly as written on the specimen label. Such raw, unprocessed information may in fact be very valuable. However, even when correcting spelling errors, and adding standard information like the political country, many geographical names will be ambiguous. Storing not only a name, but also an associated ID number provided by a gazetteer disambiguates geographical names. It allows capturing the information that the editor of data has, and which may not even be present on a specimen label. For example, the scientist may have access to additional information provided by field books or journey itineraries. The ID number can then be used as a link to other information, like point coordinates or geographical hierarchy information present in the gazetteer.

Definition of 'Gazetteer':  A book or section of a book that lists and describes places; a geographical index or dictionary


Using the place name selection dialog

(Note: To test the place name selection dialog box, open the testing application DiversityGazetteerTest.mdb in Microsoft Access.)

The primary interface to the thesaurus for the purpose of recording geographical names is the following dialog box:

Place name selection dialog box

Fig. 1: User interface to select place names from the DiversityGazetteer.

Firstly, select the region (Europe, Asia, Africa, etc.) to which the place name belongs. Europe is set as the default. Selecting a region first provides faster access to the names and avoids confronting the user with unnecessary choices between homonymous names. For example, Berlin exists only a single time in Europe, but 22 times in North America.

Then start typing the name of the place which you are looking for in the thesaurus. Before you can open the pick list (click on the drop down button) you must have typed at least two characters, once you have selected 4 characters, the box will open automatically:

Select list for dialog box

Fig. 2: Selection list with name, preferred name and place type, and place hierarchy.

The three columns show a list of place names, the preferred place name plus an indication of the place type (settlement, reservoir, mountain, etc.), and finally a hierarchical description of the place. Often different places have the same name and can be distinguished only by means of the hierarchy description. One the other hand, identical names may refer to the same place, but with different geographical boundaries and extension. In the example above, the inhabited place 'Serrai' will give a much better indication of the location, than the entire department 'Serrai'. However, if the geographical information on a specimen label only refers to the department, it should be selected to indicate the lower geographical precision.

Note that the preferred place name is not necessarily the better place name. Selecting the preferred name instead could lead to later confusion during proof reading. If the place name on the label is a historical one, it is much more appropriate to select the historical name. The selected name should therefore also match the language on the label, as far as names in that language are available.

If you click on the 'information' label, you get a short introduction text that is intended as a rudimentary online help. The following text will be displayed:

 

Introduction to the geographical thesaurus ('Diversity Gazetteer, Version 2.1')
     
The purpose of this tool is to be able to uniquely identify place names during analysis and data retrieval. It is often difficult to know whether a place name is unique in a given country, or whether synonymous names exist and the place name must be therefore qualified with the region. With large cities, this may be common knowledge (e.g. 'Frankfurt a. Main', 'Frankfurt a. d. Oder'), but for smaller cities or villages this requires prior knowledge of all relevant names. A further problem is that spelling errors in place names entered as free text are difficult to catch. A gazetteer (= geographical thesaurus) allows only to enter names known to the thesaurus. The name is saved together with a unique identifier, which allows to distinguish between identical names.
      If you have a name which does not appear in the thesaurus, you must enter a higher name, e.g. the name of the province. If you find multiple identical place names and you are uncertain about the correct one, but you think that you can make a good guess, you should add a note describing your decision. Note that it would be very desirable to obtain additional thesauri/gazetteers specialized on Europe. Please contact us if you know about data sources that we can legally integrate into an open concept!
      Currently the place names for the world are derived from the Getty Thesaurus of Geographical Names (TGN) in the version released 7/2000 and from the German list of Geographical names (GN250), under license to the BBA in Germany. The Getty Thesaurus itself can be licensed without cost, but a written license agreement must be obtained before the full hierarchical thesaurus can be used. Therefore we cannot provide you with the full thesaurus itself.
      The DiversityGazetteer contains over 1 million names in several languages. Accessing and choosing from this number of names is relatively cumbersome, both in terms of the user interface (e.g. selection between a dozen different "Berlin") and in terms of database performance. The thesaurus has therefore been broken down by continent, with Oceania (Australia and Pacific) and Antarctica combined, but Europe separated from Asia. The number of names available for each part differs drastically, with North America yielding the vast majority of names: Europe ca. 112 000 names (ca. 65000 from TGN, 47000 from GN250), Asia ca. 65000 names, Africa ca. 19000, Oceania and Antarctica ca. 8350 names, North and Central America: ca. 894 000 names, South America: ca. 11500 names.
      To use the thesaurus you must select a name from the pick list. However, you can open the pick list only after at least 2 characters have been typed. This operation is necessary because an unrestricted pick list containing all possible names would be either too large for your computer to handle, or would react very slowly. If you type more than 2 characters before you open the pick list, the list will be more restricted and open faster. Note that opening a pick list with the mouse (clicking on the drop-down) can be relatively slow. You can also open any pick list in the DiversityWorkbench applications using the keyboard by pressing the function key F4, or the Alt-Cursor-Down key combination.

G. Hagedorn, 16.1.2002


Installing the Analysis Supplement

The base DiversityGazetteer  (containing only the place names and the associated IDs, descriptions, etc.) already has a size of ca. 82 MB (compressed size ca. 30 MB). Including the analysis data (links to the source ID in the Getty TGN or the German GN250, and geographical coordinates) more than doubles the size of the file. The base DiversityGazetteer  is sufficient to record place name information, so that many users will not need the analysis data. The analysis data have therefore been separated and are provided separately as the DiversityGazetteer Analysis Supplement.

Download the compressed file for the analyis supplement and execute it (it is a self-expanding archive). You should extract the file DiversityGazetteer_AnalysisSupplement.mdb into the same folder in which you have installed all other DiversityWorkbench application. Open the file by double-clicking on the mdb file. A dialog box will appear that allows you to install the supplement into the DiversityGazetteer installed in the same folder.

Note that although the analysis supplement has a size of approximately 58 MB (compressed size ca. 16 MB), importing it will increase the size of the Gazetteer by almost twice this amount of storage space. The difference is due to additional space required for indices and relational foreign keys.

Once you have imported the analysis information into the main Gazetteer, you may safely delete the file DiversityGazetteer_AnalysisSupplement.mdb to save disk space.


The information model

This chapter contains technical in-depth information. It may be helpful for users who want to create customized analyses from your data.

The Gazetteer itself is split into several packages that can be used together or independently. The first two packages can be downloaded directly from the Workbench distribution site: The base DiversityGazetteer contains only a look up table for place names. The primary application is to provide pick lists of names and name ID so that geographical places can be recorded unambiguously in other databases (e. g. DiversityCollection or DiversityReferences).

The separate analysis data component ('GazetteerAnalyzer', installable as 'AnalysisSupplement') provides additional information about each place name. Among other attributes, it contains geographical coordinates for each place name. The geographical coordinates are provided as decimal coordinates to simplify the integration of name data into most other GIS systems. For most place names, the coordinates provided are point coordinates without any indication of the spatial extension. Thus a country like Germany will be rendered as a single point coordinate. To reduce the complexity of the system, the 'GazetteerAnalyzer' is not truly provided as a separated module. Rather, it can be downloaded separately and is then imported into the base DiversityGazetteer (see chapter 'Installing the Analysis Supplement' for further information).

Some information that may be relevant to analysis, like place type and abbreviated hierarchy information, is also necessary to distinguish identical placenames that refer to different places (homonyms). These attributes are therefore already stored in the base DiversityGazetteer. Note that both place type and hierarchy are currently stored together with place name (denormalized to allow fast display together with names), but are truly tied to the place, not the name.

Overview over packages of the DiversityGazetteer

Fig. 3: Overview over packages of the DiversityGazetteer.

The analysis package also provides information about the source of a name. Most names in the DiversityGazetteer are compiled from other sources, which are documented here. The ID that was used in the source is also provided, enabling advanced analysis using the source gazetteer itself. Thus all the complex information contained in the TGN or the German GN250 can be used to analyze biodiversity data that have been georeferenced with the DiversityGazetteer.

Note that source gazetteers (like TGN or GN250) can not be provided with the DiversityWorkbench. For these data collections you must obtain licenses directly from the copyright holders. Institutions that have such a licence can use the SourceID provided by the analysis module to link directly into their own databases. Alternatively, after providing proof of license, they may download the data in a pre-processed format that is compatible with the other DiversityWorkbench data formats. For example, importing the Getty Thesaurus is significant work, since the data provided by Getty contain referential integrity errors and use a proprietary coding system to code accented characters (ä, â, etc.) into the US-ASCII range. The TGN database that can be provided has been converted into a fully relational database using Unicode.

The following diagram shows the entity relationship diagram of the DiversityGazetteer:

Entity relationship diagram for the DiversityGazetteer

Fig. 4: Entity relationship diagram for the DiversityGazetteer.

Information for Access users: The relationships are not modeled in the DiversityGazetteer itself. The database engine will internally create indices for foreign keys, which are not required for normal use of the DiversityGazetteer. To minimize the size of the Gazetteer, the information model has been set up using additional queries provided for this purpose. If you view the Access relationship window, you can find the tables with only minimal relations defined on the left side, and the additional views with all relations defined on the right side.

The place name entity has been split for performance reasons into multiple regional tables, which are combined again in a union query. The place name records belonging directly to the entire world have been duplicated in each regional table. The union query GeoPlaceName_Union will contain these as duplicates. Note that queries based on union views are comparatively slow. If you want to work with a single NameID, regardless of region, you should use the stored procedure GeoPlaceName_UnionSingleNameID, which contains the condition prior in each part of the union query.

Place name tables and Union query

Fig. 5: Distribution of place names to regional tables and combining union query.


Coordinate precision

In the analysis part of the DiversityGazetteer geographical coordinates are stored as decimal degrees in double precision (IEEE 32-bit floating-point numbers). This can lead to a certain rounding error. Coordinate data provided by the TGN contain only minutes, no seconds; coordinates for Germany from the GN250 are exact to minutes and second.

Most places have a considerable extension. It would be highly desirable to obtain the extension of objects. However, this information not be achieved easily from the sources used. The German GN250 uses the coordinates of the administrative seat for the place. Thus even in the case of a city or village, the coordinate given may not be identical with the coordinate of the geometric center of the place. The coordinate for a state of country are the coordinates of the administrative seat in the capital. The coordinates for rivers in the German GN250 are identical with the mouth of the river.

The object model (interface)

This chapter contains technical in-depth information. It may be helpful for users who want to develop applications that use the DiversityGazetteer component.

Currently, the DiversityGazetteer can only be accessed from another Microsoft Access application or through the Windows COM interface. We plan to port the Gazetteer to VB.NET and make it available in the form of both a web service and a .NET component.

Starting with version 2.1, the DiversityGazetteer can be accessed through an object oriented interface model. The basic rules that govern all DiversityWorkbench interfaces are defined in a separate document [*ADD LINK!*].

The object model can currently not be described in detail here (a UML documentation is planned for a future version of this document , but we don't have the resources at the moment). However, I will provide some hints how to explore the potential of the DiversityGazetteer:

The most important objects are the GeoName and GeoPlace objects present in the DiversityGazetteerInterface object. The GeoName and the GeoPlace object can be opened with the Open method to a NameID or PlaceID, respectively. In a desktop application, the GeoName object can also be opened directly by calling the provided user interface method.

Once the GeoName or GeoPlace are opened, their properties can be read and used for analysis. The GeoName object automatically creates another GeoPlace object to provide the place information (e.g. geographical coordinates) for the place. Note that the GeoPlace object that is provided directly through the DiversityGazetteerInterface is NOT opened automatically when a GeoName object is opened. Only the GeoPlace object that is provided within GeoName is opened (GeoName.GeoPlace). The DiversityGazetteerInterface.GeoPlace object is provided for the direct analysis of PlaceIDs, for which no NameID has been stored.

A single geographical place may have multiple names (München, Munich, etc.). The GeoPlace object contains a collection GeoNames of GeoName objects. This collection will list all names that are known for a given place.

In your current Microsoft Access application, add a vba reference to the DiversityGazetteer code component (currently DiversityGazetteer.mdb). Now you can use the Visual Basic object browser to view the object model. Helpful hints about the options provided can be found in the testing application DiversityGazetteerTest.mdb box that is provided together with the DiversityGazetteer. This component opens a dialog box to allow the direct lookup of names. It gives an example of how to implement the selection of a geographical name. After closing the dialog box, the NameID is evaluated, and the information that can be obtained from the object model is shown. It is hoped that this will be a valuable coding example for your own applications.


Coverage of the DiversityGazetteer

Table 1: Number of place names (global distribution)

 

Source

 

 

Region

TGN

 

GN250

North and Central America

893 556

 

-

South America

11 575

 

-

Europe

65 172

1  

(Germany: 47120)

Asia

65584

 

-

Oceania, Australia, and Antarctica

8356

 

-

Africa

18 864

 

-

Footnotes: 1 The German GN250 was introduced in Version 2.0 of the DiversityGazetteer. Due to its introduction, 6703 names for Germany from the Getty TGN have been removed from the place name pick list and are now only available as a separate table (GeoPlaceName_Germany_TGNDuplicates). Note that, however, the analysis tables still contain the place information for these records, so that records already connected with these names can be analyzed. The place name list for Europe therefore contains only 65172-6703 = 58469 records for Europe.

Table 2: German place names by states in the GN250

German state

GN250

Baden-Württemberg

4536

Bayern

9924

Berlin

37

Brandenburg

2517

Bremen

27

Hamburg

58

Hessen

2795

Mecklenburg-Vorpommern

2299

Niedersachsen

5528

Nordrhein-Westfalen

4190

Rheinland-Pfalz

3063

Saarland

382

Sachsen

2568

Sachsen-Anhalt

2167

Schleswig-Holstein

2079

Thüringen

2286

Not associated with
a state (rivers, etc.)

2664

Sum

47120

Proposed citation

Hagedorn, G. 2002. The Geographical Thesaurus DiversityGazetteer 2.1. URL: www.DiversityCampus.net/Workbench/Gazetteer/Docu10/DiversityGazetteer.html.