Database development process

Main authors:	Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor:	Jane Brandt
Source document:	»Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

Data collection was carried out first through surveys sent to the leaders of the 13 FAIRWAY case studies. The case study leaders found data of various types, quality, quantity and formats, and sent us those datasets. An Excel workbook template with detailed instructions was used for this purpose, to propose a uniform data exchange format for all case studies. Yet, many of the datasets we obtained from the case study leaders did not follow the template for various reasons. These “raw” datasets could still be processed and added to the database, in most cases. However, as expected, it was much easier and more time-efficient to integrate datasets prepared by the case study leaders using the template, versus “raw” datasets, which required special handling.

In 2020 and 2021, the case study leaders were contacted again to invite them to send additional data (for the CS that could collect usable data during the first surveys and were already present in the database) or all data (for the case studies that could not send anything earlier).

In parallel, we explored European and national databases online to find relevant data that could be added to the Excel database to enhance a case study data sheet when the case study leaders could not provide that type of information. This allowed us to add some interesting parameters to the database for some of the case studies, notably:

concentrations of nitrate from reports on water quality published on the EEA-website (EEA – Eionet – Central Data Repository: http://cdr.eionet.europa.eu/);
atmospheric deposition of nitrogen from the EMEP MSC-W model (EMEP MSC-W modelled air concentrations and depositions: https://emep.int/mscw/mscw_moddata.html);
national-average application rates of a given pesticide in Northern Ireland (AFBI website's search page: https://www.afbini.gov.uk/search?query=Pesticide+usage+report).

This was done mostly to test the feasibility and relevance of using very large-scale databases; it was not exhaustive. Overall, we found that the content of those databases was often hard to exploit either due to the data formats (too raw, unclear columns or metadata) or to the aggregated nature of the data (often averaged over one to several years and sometimes also over a large area). This is further discussed in »Conclusions.

In the end, the amounts of data that could be added to the database vary considerably between the 13 case studies (from <100 to >100,000 data rows: see Table 1. One of the main limiting factors to the effective availability of data for these research purposes is protecting personal data enforced by the new (May 2019) General Data Protection Regulation (GDPR) of the EU. In some case studies, the GDPR indeed made it difficult to get access to data, and more especially to plot-scale data (see »Evaluating agri-drinking water quality indicators in three case studies and »Conclusions: definition of boundaries).

We thus received, explored, and processed the data provided by case study leaders (along with the data from other sources) to gather it all in the database under development. Data reformatting was a major and time-consuming task. The provided data came from various sources and thus had many different formats, data types, and more or less accompanying information. The work to do was important even when the “Excel workbook template” was used properly by the case study leader, as there are several ways to copy-paste data from various sources during data collection, and because of the very varied formats of the raw data originally collected by the case study leaders, among other reasons.

Data cleaning was done in parallel to data reformatting, when weird/impossible/unusable values were detected in a dataset (e.g., a -999 numeric value for a parameter reporting daily precipitation rate in mm, or an “ND” text value in some numerical parameter). Those values, in most cases, were deleted.

Time information required special care, due to the various time scales being used. Dates having lost their proper format due to an unchecked copy-paste or format change of dates in an Excel environment were detected and corrected, for instance. Considering the time scales varied from year to month, day and evenly instant (date & hour of the day) timeframe, we chose to report the time information using two fields: the first giving the format (field “Date_Format”) and the second giving the format-dependent time value (field “Date”). These fields are detailed in »Detailed structure of the database.

An effort was made to harmonize the parameter names as much as possible within and between the case studies. Moreover, parameters were organized into groups and divided in sub-parameters when necessary, in order to facilitate exploration and exploitation of the database. Identification of a parameter is thus given in three levels:

parameter group,
parameter name and
sub-parameter name (again, see »Detailed structure of the database for more detail).

There were even a few cases where the parameter names were changed (corrected) after we better understood the actual nature of the data provided. For instance, a parameter originally named “crop yield” (with data in kg of N per ha) in a file we received was providing information on the “nitrogen consumption by crops”, which is quite different.

Another step of data processing has been to aggregate the data in time and/or in space, when relevant. This was done most largely with the data of Island Tunø and Aalborg, DK. They had many data at the farm level (main crop type, nitrogen consumption by crops, mineral fertilisation) and detailed groundwater-quality monitoring data with several observed values per year. Aggregation of those data was done using Python and R scripts, which calculated yearly total or average values at the catchment scale. The aggregated values were then added to the database (in addition to the raw data).

Table 1: Total amount of data rows per case study in the Excel database

D3.3 tab01

Once parameters corresponding to the Agri-Drinking Water quality Indicators (Hansen et al., 2021) required to calculate a compound indicator such as “Total mineralisation” or “N surplus” became available at a common time and spatial scale in the database for a given case study, the compound was calculated and the resulting time series was added to the database. Moreover, the area of some geographical objects of interest such as the catchments or field plots was calculated and added to the database as well when possible and relevant.

In the end, the Excel database contains 389,265 “Results” values (data rows) for >65 parameters and >500 sub-parameters. It includes some long time series of pressure and state indicator data that are especially interesting to statistically analyse and compare (see examples in »Using the database).

Note: For full references to papers quoted in this article see

» References

Go To Top