Back
natural sciences

Enabling geographic search for scientific papers through text mining and geocoding: How to better find environmental research articles via location

geo1
Share

Figure 1. Geographical distribution of place names mentioned in research articles. The figure also shows which areas have been studied more intensively in the analysed scientific journals.

In a joint study by the University of Tartu (Estonia), the Institute of Geological and Nuclear Sciences (New Zealand) and the University of Salzburg (Austria) more than 5,800 scientific articles from three environmental research journals were digitised and analysed. In addition, a geographical search method was developed to identify the location of a studied area.

Many research articles concern environmental processes in certain areas. But until now it has been impossible to find research articles via maps or coordinates of an area of interest.

Our question was: why can’t scientific journals be searched via a map? While earning considerable revenues from publishing and subscription, journal publishers have begun to support interactive web maps via supplemental materials, but there is still no spatial search on those websites.

The current open data movement is in a similar situation. Data repositories support extensive metadata – metadata is data about data, including search fields like authors, keywords, topics, etc. – but websites do not provide interoperable options for using geographical coordinates for an area or region of interest via metadata elements.

There are some positive examples for this type of search: the Estonian Land Board’s Geoportal and the New Zealand Land Information Data Portal. They provide central online metadata catalogues for important national datasets. The catalogues can be explored via a map and, more importantly, can be queried with a database language that supports geographic coordinates based on international standards (ISO/OGC). Many other countries have also established similar catalogues.

In this study we investigated how to make it easier to find research articles and reports with search criteria that also included location. To do so, we had to find place names in journal articles and code them into coordinates with the help of a gazetteer. A gazetteer is a dictionary or directory that references place names with their geographic locations and thereby links natural language to geographic locations via place names.

Based on a New Zealand case study we analysed 5,800 articles (published since 1967) from three geoscientific journals to ascertain whether or not there were enough locational references in research articles to apply a geographical search method. We searched titles, abstracts and full texts for place name occurrences that matched the records of the official gazetteer of the Land Information New Zealand (LINZ), a government department responsible for the functions related to location information.

We parallelised the search algorithm computation on a small computing cluster (4 CPUs and 16 GB RAM combined) to test each of the 28.5 million words (the overall word count in the 5,800 articles). The processing took about 17 hours.

This was followed by a manual review of around 5% of the articles and an evaluation of the place names – whether or not they were correctly identified and relevant to the respective article. On average there were 15 place names mentioned in every automatically georeferenced full text. Many papers had no place names in titles or abstracts, but the ones that had were mostly correct. However, most of the place names in combined full texts were determined incorrectly.

There were three main sources of error. First, if a place name was correct but had duplicates (e.g. multiple entries in the gazetteer had exactly the same name but referred to different locations, which actually happens quite often), we assumed that only one of them was correct. For example, there are 14 streams named Muddy Creek in New Zealand. However, if a study was related to one of them, we concluded that the other 13 Muddy Creeks were probably incorrect. Overall there were 978 unique place names (213 had duplicates) mentioned in total 4,157 times.

In addition, the authors’ names (e.g. Alexandra and Ashley) and the addresses of authors and publishers (e.g. Howick, Auckland and Wellington) contributed to errors.

The third issue stemmed from ambiguous topics in the texts. For example, Rock (a hill in Taranaki district) and Rocks (a hill in Canterbury district and a hill in Marlborough district) caused many errors because the journals dealt with geological topics which often involved, among other things, rocks.

The results showed that the use of carefully chosen relevant place names in a title or abstract of a journal article allowed unstructured textual information to be georeferenced more successfully and enabled to search articles with location-based queries.

The spatial distribution of correct place names provides an overview of the most thoroughly investigated areas in earth sciences in New Zealand (see Figure 1).

We created metadata records (incl. geographic references) for each article on the basis of text mining and georeferencing results. Subsequently we uploaded them to a standards-based catalogue server. The catalogue can now be queried using the names of authors, titles, keywords, topics and, most importantly, map locations and coordinates.

We developed an exemplary web application that can query similar compatible catalogues. Users can query and retrieve metadata records for journal articles. The application’s map window provides spatial context, so that users need not enter coordinates manually (Figure 2). The map on the left shows the applied spatial bounding box, which can be zoomed and panned around to adjust it to the desired spatial context of the search. Queries are sent to the catalogue server and results are collated in a list.

Figure 2. The search form of the implemented web application. In addition to keywords, the visible map area on the left can be used as a geographical search parameter for a catalogue query – users don’t have to enter location coordinates manually.

We highlight the significance of the integrative aspects of the ISO/OGC metadata standard (and the encoding that was adopted) for the overall implementation and application of spatial search. The same protocols are used in the geoportals of Estonia and New Zealand.

The study was published under an open access Creative Commons license (CC-BY 4.0) in the International Journal of Geo-Information with the title “Enhancing Location-Related Hydrogeological Knowledge” (http://www.mdpi.com/2220-9964/7/4/132).

This research was funded by the Mobilitas Pluss postdoctoral researcher grant no. MOBJD233 of the Estonian Research Council, an individual fellowship offered in the framework of Marie Skłodowska-Curie Actions by the Horizon 2020 Programme of the Research Executive Agency (grant agreement no. 660391), the Ernst Jaakson Scholarship from the University of Tartu Foundation and the SMART Aquifer Characterisation Programme of the Ministry of Business, Innovation and Employment of New Zealand (contract no. C05X1102).

Authors: Alexander Kmoch and Evelyn Uuemaa, Department of Geography, University of Tartu

The Estonian version of this article has been published at the Estonian Public Broadcasting science news portal Novaator.

Read more

Get our monthly newsletterBe up-to-date with all the latest news and upcoming events