Main authors: Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor: Jane Brandt
Source document: »Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

 

In this article we define all fields that are present in the Excel database (and which can be downloaded from »Indicator database architecture).

 
Contents table
1. Datasheets: detailed description of the fields (data columns)
2. Summary sheets

1. Datasheets: detailed description of the fields (data columns)

The data sheets of the Excel database use a uniform scheme in terms of field (column) names and order. This data scheme is documented in the following two tables, which describe the mandatory fields (Table 2), and the optional fields (Table 3) added to the right end of a few CS datasheets. Note that mandatory fields marked by an asterisk must always be filled with a value and that the optional fields (columns) are always placed after the mandatory ones, so that it remains easy to merge several data sheets when necessary..

Table 2: Description of the mandatory fields in all data sheets of the Excel database. Fields are presented in the order they must follow. The asterisk put on a field name indicates that this field must always have a value (cannot be blank). The leftmost column of the table gives the Excel column letter (index) for that field in an Excel worksheet environment.

  Field name Description
A Case_Study*

Number-based identifier of the case study

Example: “CS_1” for case study n°1

Note: This field would be very useful to one who wants to prepare a large table with all data from the Excel database by merging all CS datasheets.

B Date_Format* Symbolic format of the “Date” values: several formats are allowed including:
  • “yyyy”: year only (yearly data),
  • “yyyy-mm”: year and month (monthly data),
  • “yyyy-mm-dd”: full date (daily data),
  • “yyyy-mm-dd HH:MM”: full date & time (instant data),
  • “mm”: (average monthly data: e.g., 30-year climatic average monthly precipitation data)
Note: The “Date_Format” must be uniform for a given parameter / sub-parameter of a CS.
C Data_Type* Tells if the reported “Results” value is of a “numerical” or “categorical” (any text) type This greatly facilitates the reading and processing of the “Results” field values. Note: The “Data_Type” must be uniform for a given parameter / sub-parameter of a CS.
D Date*

Time (instant time or period of time) at/during which the reported “Results” value is deemed representative

The “Date” value must strictly respect the associated “Date_Format”.

E Time_Scale*

The representative time scale for the reported value. Can be: “independent” (~assumed constant), “year” (or “multiyear” if the value applies during several consecutive years), “month”, “week”, “day”, or “instant”.

To better understand: an “instant” time scale is for a measurement representing an instantaneous state (typically observed at a precise date & time), whereas a “day” time scale is for a result value that is related to a full day (e.g., daily precipitation in mm). Non-instant time scales are most often used for parameter values that are totals, averages, or temporary constants (e.g., the main crop of a field plot for a given year).

F Sampling_Point_ID*

Identifier of the sampling site location, or more precisely: unique identifier to the geographical object that was sampled to produce the value reported in the “Results” field. The same ID can be found in the corresponding GIS file (in general).

Note: The ID for catchment-scale results can be either “Catchment”, the full name of the catchment or study area, or some other variant.

G Spatial_Scale* Representative spatial scale for the reported value. Can be: “catchment”, “sub-catchment”, “local…” (plot, piezometer, groundwater well, spring, along with/near stream…), “national”, “international”, etc. International or national scales are for some reference values (e.g., of Crop yields).
H Parameter_Group*

Group (family) of parameters, which is mostly there to facilitate exploration of a CS’ data or summary sheets

Possible values: “Site data”, “Quality data from farmers”, “Simple quantit farm data”, “Compound or calculated quantit farm data”, “State indicator”, “Link indicator”. (Note: “quantit” means “quantitative”.)

I Parameter_Name*

Parameter, which sometimes also acts as a group subdivided into several sub-parameters (see below)

Note: Some parameter names are implicitly defining subgroups by using a colon (“:”) to separate the main name (left part) and the subgroup names (right part), in special cases where the “Subparameter_Name” field is already used.

J Subparameter_Name (can be blank)

Subparameter: Most often “” (blank) whenever a parameter name suffices. A Subparameter can either be a subdivision of a Parameter, or provide additional information on the Parameter.

Note: Some sub-parameter names implicitly define subgroups by using a colon (“:”) or a comma (“,”) to separate a group name (left part) and its subgroup names (right part), or to provide additional information.

K CAS_Number (can be blank)

CAS Registry Number ideally provided for all non-element chemical substances. The CAS number may prevent some confusion when reading substance names.

Search for chemicals using EACH s. engine: https://echa.europa.eu/search-for-chemicals

L Results*

Measured, reported, or calculated value for the Parameter (and Subparameter) at the given location (Sampling_Point_ID) and time (Date)

Value: number or text (consistently with the specified “Data_Type”)

M Unit*

Measurement unit for the reported “Results” value. Mandatory. Specified as simple text avoiding special characters (e.g., “μ” →”u”).

Examples: “%”, “ug/L”, “boolean”, “kg N/ha”, “mg/L”, “m3/s”, “uS/cm”, “text”, “ha” … 

N Below_LQ (can be blank) 

Boolean (yes/no) values telling whether the reported “Results” value is below the Limit of Quantification. This column contains only the reported information because it is difficult and risky to assume something missing. Therefore, this field is blank most of the time.

Possible values: “Y”, “N” or “” (unknown) 

O LQ_Value (can be blank) Limit of Quantification value: rarely reported in most of the CS. No attempt was made to fill the blanks (for the same reasons as for the “Below_LQ” field). When it is reported, it is expressed in the measurement unit (“Unit”).
P Origin (should not be blank)

Information about where the data comes from:

  • “as reported” (in most cases) means that the data in the database comes from the datasets initially provided by the CS leaders during the surveys or first effective data exchanges;
  • “added in 20XX…” indicates additional data that was added to the database later on;
  • “reformatted from reported” indicates data that needed major changes in terms of formatting;
  • “translated from reported” indicates data that is not expressed in the language originally used.

Note: The “Origin” field was used quite freely to keep a trace of the changes made in the database. It is for information only.

Q Analytical_Method (can be blank) Analytical method or instrument used for the measurement. This information, primarily used for chemical measurements, is rarely reported. Accordingly, the field was also used to store various other information on the procedures used to prepare/obtain some of the calculated values (e.g., “Calculated with ArcGIS based on…”) or on the detailed source of the data.
R Top_of_screen_Depth

(can be blank) These 3 fields contain additional information about the precise Z location of the sampling:

  • The top-of-screen depth usually gives this information.
  • The bottom-of-screen depth is used only to give this specific information.
  • The approximate water-table depth, used only in CS 6 of Greece, was otherwise not used .

The “…screen…” fields were used to store screen-related information in CS 1 and 2 of Denmark only. In CS 10 of Norway, the “Top_of_screen_Depth” field was rather used to store information about the sampling depth in a lake or river (m below the water surface).

The 3 fields are completely blank in 9 of the 13 case studies.

Unit: meters below ground surface (general), or meters below the water surface (in CS 10)

Note: The fact that “approximate water-table depth” information was almost never provided by the CS leaders most likely means that such information was not readily available to them in any of the datasets they had access to. We think that in many cases this additional information did not already exist in the monitoring data sets, so that obtaining this information would have required a lot of extra work. In other words, this blank field tells us that this information is virtually never stored alongside the water-quality monitoring data, in practice.

S Bottom_of_screen_Depth
T Approx_WaterTable_Depth
U Place (can be blank)

U Indicates where the analysis was carried out (mainly for chemical substances), or gives another name for the sampling site, as the place of analysis was very rarely reported.

Typical values: “Field”, “Lab”, or “Unknown”.

V Confidentiality

Level of confidentiality of the reported “Results” data.

Possible values: “public” or “restricted”.

Only the “public” data can be made publicly available online. The “restricted” data should not be shared.

Table 3: Description of the optional fields that may be present to the right of the mandatory fields (columns) in some of the datasheets of the Excel database. These fields provide extra information that may be useful in specific situations.

Optional field name Description
(…) Sampling_Point_NAME (optional field) This optional field is sometimes used in some CS data sheets to store the long name of the sampling site when the “Sampling_Point_ID” value is a coded identifier. It plays a role similar to the “Place” field described above. For instance: In CS 10 of Norway, “003-38229” is the coded identifier (in “Sampling_Point_ID”) corresponding to the long name “Saebyvannet” (stored in the “Sampling_Point_NAME” field)
(…) EIONET_Source_URL This optional field is used in the CS data sheets where public data from the European database EIONET has been imported. It gives a link (URL) to the web page that was accessed in order to download the data.
(…) Orig_Parameter_Name
(…) Orig_Sampling_Point_ID
(…) Orig_Results
(…) Orig_Analytical_Method (optional fields)
Extra fields added to a CS datasheet to facilitate backward linking with the previous (more raw) versions of the dataset. They are used notably when numerous texts had to be translated to English, when parameter names changed profoundly in terms of format or language, or when some Sampling_Point_ID had to be corrected for consistency reasons.

2. Summary sheets

The summary sheets in the Excel database provide an overall picture of the content of each CS datasheet. The different parameters present in the datasheet are listed, along with the measurement units, scales of observation, and data sources. Some basic statistics including the number of reported results and different sampling sites, and the minimum & maximum dates of observation, are given for each parameter. The GIS-related information required to link the tabular data to the GIS data is also provided in two complementary columns (Table 4). An illustration of the linking process between a parameter of a CS in the Excel database and the GIS data is presented again in (Figure 3).

Table 4: Description of the columns providing GIS related information in a summary sheet of the Excel database

Optional field name Description
GIS_Related_Files This column gives the filename (and relative path when relevant) of the GIS file containing the spatial features related to the parameter. If no GIS data is linked to the parameter, GIS_Related_Files = “NONE”. Note: If there is more than one GIS file linked to the parameter, the multiple filenames listed in the column are separated by a vertical bar symbol (“ | ”).
GIS_Key_Field This column specifies the name of the field of the GIS file’s attribute table that should contain the IDs that are present in the datasheet for that parameter.
GIS_Key_Value This column specifies the ID of a specific spatial feature of the GIS file. Not needed in most cases.

D3.3 fig03
Figure 3

 


Note: For full references to papers quoted in this article see

» References

 

Main authors: Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor: Jane Brandt
Source document: »Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

 

Articles cited in this section of FAIRWAYiS

  • Brown C.D, Holmes C., Williams R., Beulke S., van Beinum W., Pemberton E., Wells C. (2007) How does crop type influence risk from pesticides to the aquatic environment? Environmental Toxicology and Chemistry, 26:1818–1826
  • Cooper RJ, Hiscock KM, Lovett AA, Dugdale SJ, Sünnenberg G, Vrain E. Temporal hydrochemical dynamics of the River Wensum (2020), UK: Observations from long-term high-resolution monitoring (2011-2018). Sci Total Environ. 724:138253. doi: 10.1016/j.scitotenv.2020.138253.
  • Dubus, I.G., Surdyk N. (2006). State-of-the-art review on pesticide fate models and environmental indicators. Report DL#4 of the FP6 EU-funded FOOTPRINT project, 39p.
  • Kim, H.; Surdyk, N.; Møller, I.; Graversgaard, M.; Blicher-Mathiesen, G.; Henriot, A.; Dalgaard, T.; Hansen, B (2020). Lag Time as an Indicator of the Link between Agricultural Pressure and Drinking Water Quality State. Water , 12, 2385. https://doi.org/10.3390/w12092385
  • Hansen, B., Kim, H., Møller, I., Henriot, A., Laurencelle, M., Dalgaard, T., Graversgaard, M., Klages, S., Heidecke, C. Surdyk, N. 2021 Evaluation of ADWI's: Agri-Drinking Water quality Indicators in three case studies. Fairway Delivrable WP3 3.2
  • Klages, S., Surdyk N., Christophoridis C., Hansen, B., Heidecke, C., Henriot, A., Kim, H., Schimmelpfennig, S. (2018). Review report of Agri-Drinking Water quality Indicators and IT/sensor techniques, on farm level, study site and drinking water source. Fairway Delivrable WP3 3.1
  • Klages, S.; Heidecke, C.; Osterburg, B.; Bailey, J.; Calciu, I.; Casey, C.; Dalgaard, T.; Frick, H.; Glavan, M.; D’Haene, K.; Hofman, G.; Leitão, I.A.; Surdyk, N.; Verloop, K.; Velthof, G. (2020) Nitrogen Surplus—A Unified Indicator for Water Pollution in Europe? Water , 12, 1197. https://doi.org/10.3390/w12041197

 

Main authors: Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor: Jane Brandt
Source document: »Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

 

The indicators database can be used in several ways. It may be used to explore data or to calculate additional indicators. Depending of the case studies interests, the most commonly available State indicators are about nitrate and pesticides concentrations in water. From a practical point of view based on its actual content, the database may notably be used to explore statistical relations between related Pressure and State indicators.
Contents table
1. Cross-correlation analysis
2. Exploratory data analysis 
3. Using the database outside Excel

To illustrate our reasoning, Figure 4 compares between selected features of pesticide fate models and environmental risk indicators. Environmental indicators for pesticides are mainly devised to be used by farmers and extension advisers in each set of agro-environmental conditions that are usually specific to one country. Indicators generally have low data requirements, are easy to use, can be calculated quickly, are amenable to the non-expert, but they often suffer from a lack of scientific validation and the fact that the combination of the various processes is done on a subjective basis. Given the very significant efforts put into the evaluation and validation of pesticide fate models over the last 20 years and, at the same time, the opposite and complementary profiles of the two types of tools, an objective of the FAIRWAY project has thus been to improve the evaluation of some environmental indicators. That is why in this Monitoring & Indicators part of the research programme we focused on investigating statistical relationships (correlations) between pressure and state indicators.

D3.3 fig04
Figure 4

Two main families of relations between Pressure and State indicators were thus explored using the database, (see also »Evaluating agri-drinking water quality indicators in three case studies):

  • The relation between gross nitrogen budget (Pressure) indicators (e.g., Nitrogen surplus ) and water quality (State) indicators (e.g., Nitrate concentration in groundwater), and
  • The relation between pesticide application rate (Pressure) indicators (e.g., the estimated total volume of atrazine applied each year) and concentration of pesticide (State) indicators (e.g., the concentration of atrazine and DEA in spring water).

Note that simpler Pressure indicators may be considered when it is impossible to compute the Nitrogen surplus indicator, for instance: Mineral fertilisation or Main crop type ideally combined with an N surplus (or/and crop fertilisation) indicator. However, these simple indicators remain overall less effective than the compound N surplus indicator (see »Lag time estimation using other indicators). Similarly, a pressure indicator not directly quantifying application rate (e.g., N surplus or Mineral fertilisation) may be tried as a proxy for pesticide application rate when no such indicator is directly available (assuming a positive relation between pesticide application rate and N surplus). It may not produce significant correlations in several cases, but it is worth a try still.

1. Cross-correlation analysis

The exploration of statistical relationships between pairs of Pressure and State time series aims to provide answers to the applied question “How long do we have to wait to see the impacts of changes in agricultural practices?” This delay was investigated through cross-correlation analyses of the aggregated data series (by year, for the whole catchment) in the case studies where data were available. These analyses produced “lag time” quantitative information.

1.1 Lag time as a Link indicator

The “lag time” is an estimation of the average delay in the response of an “output” time series to the pulses (fluctuations) of an “input” time series. Technically, it corresponds to the delay that (when cancelled in the output time series by subtracting it to its time values) maximises the correlation between the two thus “aligned” time series. From an applied perspective, the lag time provides an efficient means to estimate the travel time of contaminants in the groundwater system up to the sampling points. For instance, a pair of related indicators that were studied using cross-correlation analysis consisted of Nitrogen surplus as the input signal and nitrate concentration in groundwater as the output signal.

That work on lag time estimation and interpretation for three FAIRWAY case studies (Island Tunø, Aalborg and La Voulzie), was published in a scientific paper (Kim et al., 2020) and is summarised in »Evaluating agri-drinking water quality indicators in three case studies. In that work, the computed lag times were also compared to groundwater ages estimated using environmental tracers, which revealed that calculated lag times are generally shorter than tracer-based groundwater age estimates (Kim et al., 2020).

Those lag-time and groundwater-age data sets were added to the database as “Link” indicators when available.

1.2 Lag time related to nitrate contamination of groundwater

Figure 5 shows an example of graphical results from a cross-correlation analysis of pressure and state indicators related to nitrate contamination of groundwater. The top graph in the figure shows how the correlation coefficient r varies as a function of the considered lag (delay removed from the “NO3 conc.” time series). The dotted blue line indicates the r threshold below which the r values are not statistically significant. This graph thus informs that the (peak) correlation (r = 0.79) found at a lag = +19 years is significant. However, the found lag cannot be deemed precise since several r values are very close to the peak r value. Nonetheless, the bottom graph confirms that the two-time series show a similar variation over time once the delay of the response of the “NO3 conc.” series is removed. In this particular case, it will be interesting to repeat this cross correlation analysis in 5–15 years to see if the NO3 concentrations continue to decline, as to suggest in this present-day finding.

D3.3 fig05
Figure 5
D3.3 fig06
Figure 6

Figure 6 shows another example of cross correlation analysis of NO3 concentration and N surplus. In this case, the r threshold (below which r values are not statistically significant) is very high, there is only one r-value > 0.5, and it is not significant. This is because the two times series are very short in duration and do not share enough common features. This example highlights that it is often impossible to assess the lag time when the pressure and/or state time series are too short compared to the time interval it would take to fully capture the transient feature we wish to study in the two signals.

1.3 Lag time related to pesticide contamination of groundwater

Figure 7 shows an example of results from a cross correlation analysis of pressure and state indicators related to atrazine (pesticide) contamination of groundwater. The top graph shows that a (peak) correlation (r = 0.94) found at a lag = +22 years is significant. However, as for the nitrate related analysis, the lag identified cannot be deemed precise since the r values around it are very close to the peak value. Nonetheless, the bottom graph shows that the two series fit quite well visually once the delay of the response of the state series is removed. Furthermore, this particular analysis reveals the importance of having longer time series of both pressure and state indicators. Indeed, the seemingly satisfying fitting of the two series after the lag of 22 years was removed may not be the best answer in this case. Maybe a lag time of 15 years, 9 years, or even less would be a closer estimate of the actual “age” of the atrazine contaminant since it was introduced in the groundwater system. Having longer and more continuous and detailed time series would likely help narrow down the uncertainty in the lag time estimate.

D3.3 fig07
Figure 7

Overall, the above interpretations of three examples of cross-correlation analyses underscore the importance of carefully examining the detailed results of such analysis, especially when the length or quality of the time series is limited.

1.4 Lag time estimation using other indicators

In addition, to the indicators presented above, some other indicators could be tested (as previously presented in Hansen et al, 2021). In the La Voulzie, FR case study, lag times were also estimated using the main crop area (i.e. wheat in that case) as the pressure indicator. Thus, the correlation coefficients were smaller than those obtained with other N pressure indicators (N budget and Mineral fertilisation). For example, the estimated lag time was 0 year at the main spring with a weak (r = 0.54) yet statistically significant correlation (Table 5). Using soil occupation directly as an indicator was shown insufficient but the connection between crop area and the presence of specific herbicides in surface waters is documented (e.g Brown et al, 2007).

Maybe that no correlation was found in La Voulzie because the wheat-area time series was available from 1994-2014, whereas the mineral fertilizer input and potential N leaching were available from the 1950s to 2018.

In La Voulzie statistically significant correlations were also found between mineral fertilisation and yearly average concentrations of nitrate in the springs. The estimated lag times at the top, main, and bottom springs were 7, 14 and 14 years, respectively. The lag times estimated using the potential N leaching were similar to those obtained with mineral fertilisation (Table 5). The mineral fertilisation seems to be an appropriate pressure indicator at the catchment scale if a lag time is to be calculated. Nevertheless, the mineral fertilisation indicator can only be applied in areas where no organic fertilizers are applied, that is to say, regions with no or negligible animal breeding.

Table 5: Estimated lag times (in years) using main crop surface or mineral fertilisation as pressure indicator in La Voulzie, FR case study (From Hansen et al., 2021)

Monitoring points Lag time with main crop surface as pressure indicator (correlation coefficient) Lag time with mineral fertilisation as pressure indicator (correlation coefficient)
Top spring 4(0.65)** 7( 0.95)**
Main spring 0(0.54)* 14(0.86)**
Bottom spring 8(0.70)* 14(0.93)**

*Statistically significant at p<0.05; **p<0.005; † Outer protection zone

Statistically significant correlations were also found between fertilisation and the yearly average concentrations of nitrate in Denmark and Slovenia case studies. In Tunø Island, DK, the correlation analyses yielded significant results at 15 points out of 24 monitoring points. At the 15 points with significant results, the tested pressure indicator (mineral fertilisation) showed strong correlations with the state indicator (yearly average nitrate concentrations). At the other 9 points, either the time series were too short and/or sparse, or the concentrations were relatively invariant at low concentrations ranges (10 mg/L) near the limit of quantification (LQ); why those other results were not significant nor reliable.

In Dravsko Polje, SI case study, the correlation analyses yielded significant results between the amount of fertilizers applied on the catchment (in tons of nitrogen) and the nitrate concentrations (i.e. the average concentration of nitrate in groundwater wells in the catchment). The best correlation was found for a lag time of 4 years. This lag time could seem short considering that the depth to the water table could reach 15 m in the northern part of the study area. But the Slovenian case study leader (University of Ljubljana) estimated that this lag time could be correctly calculated. In the upper terrace, the proportion of rock/gravel/pebbles is 40%. The proportion go higher with depth. On lower terrace, soils contain a high share of sand. Lag times have also connection to drought and precipitation periods (dry and wet years) that impacts nitrate leaching trough soil profile.

In the Anglian Region, UK case study, no lag times were calculated. Since the catchment is surface, the transfer times seem to be annual, and a relationship was found between the month of year and concentration. More closely, the relationship appears between monthly rainfall and nitrate concentration.

Cooper et al, 2020 proposed an explanation to this phenomenon. During winter, the monitoring point is characterised by high nitrate concentrations resulting from a precipitation-induced leaching through agricultural field (especially in exposed arable land). Conversely, during the summer, the monitoring point is characterised by low nitrate concentrations due to low precipitation rates and also because arable crop growth absorbs excess nitrogen in the soil.

2. Exploratory data analysis

More generally, the numerous time series and time-independent data of the database can be explored and analysed in various ways.

At the simplest level, the data can be explored individually, visually, by opening the Excel database and scrolling in one of its datasheets. However, since there are thousands of data rows in the case study datasheets, it is much more efficient and appropriate to explore the data sets by the parameter(s). One easy way to carry this out within Excel is to use the AutoFilter on the “…parameter name” fields to select the indicators you wish to extract. Once the filters are properly set, you can copy the visible dataset to the clipboard, and then paste it to another Excel workbook. You may then create graphs from this specific dataset. Suppose you want to prepare a graph displaying several time series. In that case, you may first create a graph with a first time series, and then use the Excel Paste Special options to add more XY series iteratively to the graph as you select each of the other indicators (X = “Date” and Y = “Results”) you want to display in the graph. You can also create histograms, box plots, and many other types of graphs, within an Excel environment. In addition, you may use selections and formulas to calculate various statistics (mean, median, min, max, counts, etc.) for the indicators of interest. Alternatively, all sorts of graphs can be generated using scripting languages (see »Using the database outside Excel).

Figure 8 provides an example of a stacked area chart that was prepared from selecting of four indicators related to Nitrogen inputs in La Voulzie. This specific graph provides an efficient means to assess the relative contributions of the “input” components of the N-surplus compound indicator.

D3.3 fig08
Figure 8
D3.3 fig09
Figure 9

Figure 9 provides an example graph where six indicators (time series) from the La Voulzie are displayed side by side. Note that the “Recharge” indicator was calculated from the rainfall and potential evapotranspiration (“PET”) indicators. This type of graph is very useful when exploring the datasets. It helps to detect similarities between the different signals. For instance, in this graph we see notably that the fluctuations in spring discharge are very similar to that of the hydraulic head and that peaks of recharge propagate in the spring discharge and hydraulic head signals after a delay of a few months. Complementary cross-correlation analyses (not shown here) allowed us to assess this hydraulic lag time more precisely: the spring appears to have an average delay of response of about 5 months.

3. Using the database outside Excel

Although it is possible to use this database within the Excel software environment to calculate statistics and produce various graphs, it may be more efficient to use other data analysis tools to exploit this database. In particular, scripting languages such as R or Python may be used (R: https://cran.r-project.org/; Python: https://www.python.org/). These programming languages and environments include hundreds of packages and functionalities that offer more flexibility in all steps of database exploitation, from data import and processing to data transformation, calculation of statistics and graphical representation. Here are two operations that may be applied to facilitate external use of the database:

  • Creation of a single large table containing the data from all case studies. As mentioned earlier, it remains possible to merge all data sheets into one large table, thanks to the uniform scheme used for the columns in all case study datasheets. An easy way to carry this out is to append (copy & paste) all rows of data from every Excel database sheet into a new Excel sheet. The resulting large table, saved as an Excel file, can then be read directly (in Excel format) by software or scripts, while other software and scripts will prefer to read a delimited text file.
  • Creation of a text file version of the Excel datasheet of interest. Indeed, it is often simpler and safer (more comprehensive) to read text files rather than Excel files within scripting environments . Any Excel sheet can be exported to a text file. The recommendation is to export as a tab-delimited text file, as this special character (tabulation) is not used in any field of the database.

Reading and exploiting the database using scripts is not straightforward, however. It requires the development of specific routines to read the data file(s) like

  1. process the flexible “Date” information;
  2. extract data for a given parameter, sub-parameter, sampling point (site) and spatial / time scale;
  3. aggregate an indicator at coarser time scales;
  4. perform various other calculations;
  5. generate graphs; etc.

Although we do not include such tools here, this document will help the reader interested in the programming of them. Here is a recommendation on how to read a case study datasheet, exported as a tab-delimited text file, efficiently and safely:

  1. Read the file using a function that creates a “data frame” from it with all columns treated as text (no automatic conversion of numeric values or dates).
    - Alternatively, if the scripting language allows reading an Excel file with automatic conversion disabled, you may try to read a data sheet directly from the Excel file.
  2. Convert fields containing only numeric values to actual numeric values (e.g., “LQ_Value”, “Top_of_screen_Depth”, etc.)
  3. Determine the list of all different “indicators” (parameters / sub-parameters) in the imported dataset. Keep a copy of this list in memory.
  4. Process the dataset one indicator at a time:
    - Select the data rows that correspond to that indicator.
    - Process the “Date” information according to the “Date_Format”.
    - Convert the “Results” text values to numeric values if “Data_Type” = ‘numerical’.
    - Split the selected data rows by site and by time scale (if there are many) so that each group of rows describes only one parameter or sub-parameter, for only one site (and spatial scale) and one-time scale (in case of time-dependent data).
  5. Gather all of these groups of data rows in a named hierarchical list.
  6. Convert time-dependent data to time series objects if the scripting language allows it.
  7. Keep summary information on every indicator along with the data series, for instance:
    ..$ State indicator.Concentration of nitrate :List of 9
    .. ..$ Parameter_Group : chr "State indicator" .. ..$ Parameter_Name : chr "Concentration of nitrate"
    .. ..$ Subparameter_Name: chr ""
    .. ..$ Spatial_Scale : chr "local, groundwater well"
    .. ..$ Time_Scale : chr "instant"
    .. ..$ Data_Type : chr "numerical"
    .. ..$ Unit : chr "mg/L"
    .. ..$ MANY_full_params : logi FALSE
    .. ..$ Nb_Results : int 3700
  8. Program an easy-to-use function allowing its user to specify the indicator he wishes to get (and for which site and time scale if applicable) right before performing the data extraction.
  9. Use this function in a script to extract indicators for a specific purpose.
  10. Carry out further data processing on the extracted indicators (if needed).
  11. Carry out analyses on the extracted indicators.

 


Note: For full references to papers quoted in this article see

» References

 

Main authors: Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor: Jane Brandt
Source document: »Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

 

As described in »Database development process, the database was not populated in the same way by all the partners. Several limiting factors and issues were identified throughout this research task.
 
Contents table
1. Definition of boundaries  
2. Local data through European databases 
3. Timescale of the monitoring data
4. Institutions with different aims

1. Definition of boundaries

When it is needed to compare a state indicator against a pressure indicator, the first question to ask is generally the scale of work. According to a hydrogeologist, the best scale is the catchment area because it represents a homogeneous zone that collects water flows. A catchment area is a physical extent based on hydrogeological/hydrologic data that cannot easily be linked to administrative or agricultural pre-existing territorial divisions. It is indeed often complicated to have access to farming data matching the catchment area. And thus, data from larger territorial divisions or smaller divisions frequently have to be used.

One possibility is to use data from scales smaller than the catchment area: generally data at the plot scale. However, listing plots is not the most time-consuming task. In a first approach, we asked for data at the plot scale but collecting data at this scale was not feasible in most case studies. Even if the case study leaders were able to collect the location of plots, collecting the rest of the data we asked at the plot scale was too time consuming. Notably, collecting soil occupation or fertilisation data at plot scale was too time consuming in many case studies, especially in the largest ones such as Arges-Vedea, RO. Some partners had to find alternative ways to define a much smaller sub-catchment to fulfil the task at this scale (Anglian Region, UK).

The problem of the plot scale, even if it gives an exact detailed representation of the catchment, is thus that it requires too much effort to collect data at this level of precision for indicators such as crop type, fertilisation and crop yields. The efforts are even more important in a deeper, more inertial aquifer since the data will need to be collected for several years or even decades. (For example in Anglian Region, UK it has been a challenge to collect the data, not helped by all of the people who collected the data c.20 years ago having either retired or died.) The partners that fulfilled this task with the more ease had started working in the study site before FAIRWAY and had already started collecting some data.

One might ask why so precise and hardly reachable data was asked. The fundamental reason is that in Monitoring & Indicators we aimed to select the best (scientifically-sound, fast-responding, and intuitive) indicators to be used by “local” stakeholders in the decision making processes. Therefore, datasets that can precisely illustrate the cause-effect relations at the local level where needed. In this part of FAIRWAY we worked at the catchment scale while other research tasks worked at regional or national scales. A second reason why precise data was asked is that most indicators could be reliably tested only if the input data was precisely known.

Plot scale data are, since 2016, even more, problematic to handle. Many member states protect the personal data. This includes the farming practices data since, in many cases, they are linked to a plot and thus to a farmer. Until 2016, this general statement could have been circumvented under certain conditions for a research project. However, in 2016, a new regulation on data protection (EU 2016/679) was enforced. Questions on confidentiality of farm data, therefore, arose in conjunction with the beginning of FAIRWAY. For instance, the Lower Saxony, DE found no possibilities to send data at the plot scale. Note, this is not issue due to data collected at the plot scale but it could be at the farm scale also. The Overijssel, NL case study was not allowed to share private data (including nitrate concentrations) of farmers.

Many case studies are not even catchment areas but are larger (Figure 10). These case studies are larger areas defined for a specific need with a certain homogeneity (farming practices, geology). These large parts of territory include several abstraction wells.

D3.3 fig10
Figure 10

For instance, the Lower Sazony, DE case study is based in a Federal State project on manure treatment and export from intensive animal production (north-west) regions to regions with arable farming (east). The main partner was the Lower Saxony Chamber of Agriculture. A large territorial scale is needed to exchange manure between areas. The Anglian Region, UK casse study comprises three quite small areas within the Anglian water areas within the larger Anglian Water Drinking Water company total area. All areas are independent surface water catchment within the same region geographically so similar altitude, latitude, rainfall, farming pressures. The water company area is one of the reasons why the case study has a large territorial scale.

When having a very large case study, more data (about water quality and farming practices) must be collected to properly calculate pressure indicators. Not all of the abstraction wells within the study area will necessarily have the same trends in pollutants, so different ways to apply indicators in these cases have to be used. This scale of work is designed to get large Multi-Actor-Platforms and the territory needs to be split to understand the impacts on each abstraction well. But in such MAPs, there is no desire to share/divide the territory into smaller parts: the MAPs are built to be large enough to create a momentum that can gather local stakeholders' goodwill and significantly impact the environment.

The Anglian Region, UK case study added data for the Waveney micro-catchment area in the database. The Waveney micro-catchment area is not a formal part of the FAIRWAY project; it just happens to be in the same Drinking Water company area and is a micro-catchment where FAIRWAY partner ADAS was able to find the detailed data that was required (Figure 10).

2. Local data through European databases

It was envisaged in the early stage of FAIRWAYto use European Databases to fill the database when local data sources were missing. As explained above, using too-large-scale data is not feasible. The European databases providing data at the national scale were thus discarded.

However, it is possible to use default values at the regional scale. For instance, data from the EMEP MSC-W model database could be used for atmospheric deposition (see »Database development process). Moreover, the European database also can provide data at the local scale. For instance, reports on water quality are published on the EIONET Central Data Repository website (EIONET Central Data Repository website: http://cdr.eionet.europa.eu/). This data is sometimes public. Data available for La Voulzie, FR, North Greece, GR and Dravsko Polje, SI were downloaded to test whether this data could easily be used. Those countries were tested because their data was public.

For France, the extraction wells used in the case study are not part of the EIONET database because the site was not selected to be reported to Europe as part of the Nitrate Directive. Data for a similar site (same groundwater body, same catchment) is available. Concentrations in NO3 are available for 2015 and 2019. For Slovenia and Greece, concentrations for several drinking water extraction wells are available on the groundwater body below the case studies (Case studies are larger than the French one for these two countries). In Slovenia, concentrations in NO3 are available from 2008 to 2019. In Greece, concentrations in NO3 are generally available for 2018 and 2019, and an average concentration is available for 2013-2015.

Thanks to the EIONET database, ten years of NO3 concentrations dataset with a yearly time step could be obtained for the Slovenian case study. In many other case studies, very little data is available (i.e., few years only) or data is not available. We contacted the EIONET team to have access to more data, but we would have to contact each member state services independently to have more data.

The EIONET / EEA website (EEA website: https://www.eea.europa.eu/data-and-maps/data/waterbase-water-quality-icm) also provides data for many surface water and groundwater points between 1960 and 2017 for almost every member state. Each point has its own pattern of data. A point could have one data only, whereas another one could have ten data gathered in the 1990s. No regular pattern could be observed. Data covers substances like nitrate or some pesticides, but the database also contains data about pH or conductivity. However, as explained above, inconsistency in data collection patterns often jeopardize the construction of time series.

3. Timescale of the monitoring data

Another major issue in the collected data is the limited time span of the indicator data. In some cases the sites are being followed for a long time (e.g., >50 years in La Voulzie, FR), whereas in other cases the sites were followed for a couple of years only (e.g. <10 years in Derg Catchment, UK). Moreover, for the most effective use of the database, it is further needed that time series share a long period in common. This is notably required to allow cross correlation analyses of pressure and state indicators with the consideration of response lags that can be long (>5 years and even >10 years in several cases) when the studied aquifer system is large and/or inertial (for example see »Using the database: cross-correlation analysis).

4. Institutions with different aims

The FAIRWAY consortium comprises many different institutes, each of them having its own field of work. None of the project partners had in its regular missions to collect both farming practices (Pressure Indicator) and hydrogeology/hydrology data (State Indicator).

In the case studies, organisations generally rely on a third party to get their missing data. For instance, BRGM (La Voulzie, FR case study) manages the French national database on groundwater quality, giving easy access to data on contaminant concentrations in groundwater. However, it still had to ask the catchment manager (Animateur de bassin) to access farming practice data. A project on this topic had started earlier (in 2015) in La Voulzie so data had already been exchanged, and data for the missing recent years was easy to acquire. AFBI (Derg Catchment, UK case study) and GEUS (Island Tunø and Aalborg, DK case studies) were involved in projects on their case studies that started before FAIRWAY and had already begun to collect data. They had little difficulty providing data to the database.

Some other partners did not have a prior project on their case study, so it was a long task to get data outside their usual field of work for their case study. First, a third party had to be identified, and then appropriate requests had to be made. The larger the catchment/case study area, the less likely the answers from the third party were positive since the work of data collection involved would have been too important (see »Definition of boundaries).

Access to a third party that can provide extra data is a key problem. Institutes (research institutes, universities or private companies) specialised in a given field of environmental sciences or management generally have easier access to the data or the third parties with the data on their area of expertise. Just knowing that a third party has the data is not enough, though, to convince the third party to share the data for a project in which he is not involved.

The Lower Saxony, DE case study is involved in the pre-existing project on manure treatment and export. The partners (University of Thünen and Chamber of Agriculture) could not supply information on the water quality of specific wells without asking a third party outside the project. Two attempts to collect data with the help of the water supplier “OOWV” were made: first, the record time scale of the data (about 15 years) was not sufficient, as water transit time was more than 30 years. A second attempt was thus made with another catchment. But data of this very small catchment did not show a good response to mitigation measures by farmers in the related cooperation area.

For new case studies and new MAPs, especially the largest ones (e.g. North Greece, GR) it was unachievable to collect farming practices data at the plot scale. Either because they could not collect data themselves or because they did not find a specialised third party that could have carried out the task for them. As reported, case studies could have difficulties getting access to the data outside of their field of work. On the other hand, they have easy access to data in their field: for instance, they know which national databases contain relevant and readily usable data. Moreover, they have easy access to all data they have already collected. For example, the Overijssel, NL provided us with a complete database of piezometric head data also available online from a Dutch website (only a sample was included in the FAIRWAY database).

 


Note: For full references to papers quoted in this article see

» References

 

Main authors: Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages
FAIRWAYiS Editor: Jane Brandt
Source document: »Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3

 

Data collection was carried out first through surveys sent to the leaders of the 13 FAIRWAY case studies. The case study leaders found data of various types, quality, quantity and formats, and sent us  those datasets. An Excel workbook template with detailed instructions was used for this purpose, to propose a uniform data exchange format for all case studies. Yet, many of the datasets we obtained from the case study leaders did not follow the template for various reasons. These “raw” datasets could still be processed and added to the database, in most cases. However, as expected, it was much easier and more time-efficient to integrate datasets prepared by the case study leaders using the template, versus “raw” datasets, which required special handling.

In 2020 and 2021, the case study leaders were contacted again to invite them to send additional data (for the CS that could collect usable data during the first surveys and were already present in the database) or all data (for the case studies that could not send anything earlier).

In parallel, we explored European and national databases online to find relevant data that could be added to the Excel database to enhance a case study data sheet when the case study leaders could not provide that type of information. This allowed us to add some interesting parameters to the database for some of the case studies, notably:

This was done mostly to test the feasibility and relevance of using very large-scale databases; it was not exhaustive. Overall, we found that the content of those databases was often hard to exploit either due to the data formats (too raw, unclear columns or metadata) or to the aggregated nature of the data (often averaged over one to several years and sometimes also over a large area). This is further discussed in »Conclusions.

In the end, the amounts of data that could be added to the database vary considerably between the 13 case studies (from <100 to >100,000 data rows: see Table 1. One of the main limiting factors to the effective availability of data for these research purposes is protecting personal data enforced by the new (May 2019) General Data Protection Regulation (GDPR) of the EU. In some case studies, the GDPR indeed made it difficult to get access to data, and more especially to plot-scale data (see »Evaluating agri-drinking water quality indicators in three case studies and »Conclusions: definition of boundaries).

We thus received, explored, and processed the data provided by case study leaders (along with the data from other sources) to gather it all in the database under development. Data reformatting was a major and time-consuming task. The provided data came from various sources and thus had many different formats, data types, and more or less accompanying information. The work to do was important even when the “Excel workbook template” was used properly by the case study leader, as there are several ways to copy-paste data from various sources during data collection, and because of the very varied formats of the raw data originally collected by the case study leaders, among other reasons.

Data cleaning was done in parallel to data reformatting, when weird/impossible/unusable values were detected in a dataset (e.g., a -999 numeric value for a parameter reporting daily precipitation rate in mm, or an “ND” text value in some numerical parameter). Those values, in most cases, were deleted.

Time information required special care, due to the various time scales being used. Dates having lost their proper format due to an unchecked copy-paste or format change of dates in an Excel environment were detected and corrected, for instance. Considering the time scales varied from year to month, day and evenly instant (date & hour of the day) timeframe, we chose to report the time information using two fields: the first giving the format (field “Date_Format”) and the second giving the format-dependent time value (field “Date”). These fields are detailed in »Detailed structure of the database.

An effort was made to harmonize the parameter names as much as possible within and between the case studies. Moreover, parameters were organized into groups and divided in sub-parameters when necessary, in order to facilitate exploration and exploitation of the database. Identification of a parameter is thus given in three levels:

  1. parameter group,
  2. parameter name and
  3. sub-parameter name (again, see »Detailed structure of the database for more detail).

There were even a few cases where the parameter names were changed (corrected) after we better understood the actual nature of the data provided. For instance, a parameter originally named “crop yield” (with data in kg of N per ha) in a file we received was providing information on the “nitrogen consumption by crops”, which is quite different.

Another step of data processing has been to aggregate the data in time and/or in space, when relevant. This was done most largely with the data of Island Tunø and Aalborg, DK. They had many data at the farm level (main crop type, nitrogen consumption by crops, mineral fertilisation) and detailed groundwater-quality monitoring data with several observed values per year. Aggregation of those data was done using Python and R scripts, which calculated yearly total or average values at the catchment scale. The aggregated values were then added to the database (in addition to the raw data).

Table 1: Total amount of data rows per case study in the Excel database

D3.3 tab01

Once parameters corresponding to the Agri-Drinking Water quality Indicators (Hansen et al., 2021) required to calculate a compound indicator such as “Total mineralisation” or “N surplus” became available at a common time and spatial scale in the database for a given case study, the compound was calculated and the resulting time series was added to the database. Moreover, the area of some geographical objects of interest such as the catchments or field plots was calculated and added to the database as well when possible and relevant.

In the end, the Excel database contains 389,265 “Results” values (data rows) for >65 parameters and >500 sub-parameters. It includes some long time series of pressure and state indicator data that are especially interesting to statistically analyse and compare (see examples in »Using the database).

 


Note: For full references to papers quoted in this article see

» References

 

Go To Top