|Main authors:||Marc Laurencelle, Nicolas Surdyk, Matjaž Glavan, Birgitte Hansen, Claudia Heidecke, Hyojin Kim, Susanne Klages|
|FAIRWAYiS Editor:||Jane Brandt|
|Source document:||»Laurencelle, M. et al 2021. (Short note for the) database containing harmonised datasets, 28 pp. FAIRWAY Project Deliverable 3.3|
|1. Cross-correlation analysis|
|2. Exploratory data analysis|
|3. Using the database outside Excel|
To illustrate our reasoning, Figure 4 compares between selected features of pesticide fate models and environmental risk indicators. Environmental indicators for pesticides are mainly devised to be used by farmers and extension advisers in each set of agro-environmental conditions that are usually specific to one country. Indicators generally have low data requirements, are easy to use, can be calculated quickly, are amenable to the non-expert, but they often suffer from a lack of scientific validation and the fact that the combination of the various processes is done on a subjective basis. Given the very significant efforts put into the evaluation and validation of pesticide fate models over the last 20 years and, at the same time, the opposite and complementary profiles of the two types of tools, an objective of the FAIRWAY project has thus been to improve the evaluation of some environmental indicators. That is why in this Monitoring & Indicators part of the research programme we focused on investigating statistical relationships (correlations) between pressure and state indicators.
Two main families of relations between Pressure and State indicators were thus explored using the database, (see also »Evaluating agri-drinking water quality indicators in three case studies):
- The relation between gross nitrogen budget (Pressure) indicators (e.g., Nitrogen surplus ) and water quality (State) indicators (e.g., Nitrate concentration in groundwater), and
- The relation between pesticide application rate (Pressure) indicators (e.g., the estimated total volume of atrazine applied each year) and concentration of pesticide (State) indicators (e.g., the concentration of atrazine and DEA in spring water).
Note that simpler Pressure indicators may be considered when it is impossible to compute the Nitrogen surplus indicator, for instance: Mineral fertilisation or Main crop type ideally combined with an N surplus (or/and crop fertilisation) indicator. However, these simple indicators remain overall less effective than the compound N surplus indicator (see »Lag time estimation using other indicators). Similarly, a pressure indicator not directly quantifying application rate (e.g., N surplus or Mineral fertilisation) may be tried as a proxy for pesticide application rate when no such indicator is directly available (assuming a positive relation between pesticide application rate and N surplus). It may not produce significant correlations in several cases, but it is worth a try still.
1. Cross-correlation analysis
The exploration of statistical relationships between pairs of Pressure and State time series aims to provide answers to the applied question “How long do we have to wait to see the impacts of changes in agricultural practices?” This delay was investigated through cross-correlation analyses of the aggregated data series (by year, for the whole catchment) in the case studies where data were available. These analyses produced “lag time” quantitative information.
1.1 Lag time as a Link indicator
The “lag time” is an estimation of the average delay in the response of an “output” time series to the pulses (fluctuations) of an “input” time series. Technically, it corresponds to the delay that (when cancelled in the output time series by subtracting it to its time values) maximises the correlation between the two thus “aligned” time series. From an applied perspective, the lag time provides an efficient means to estimate the travel time of contaminants in the groundwater system up to the sampling points. For instance, a pair of related indicators that were studied using cross-correlation analysis consisted of Nitrogen surplus as the input signal and nitrate concentration in groundwater as the output signal.
That work on lag time estimation and interpretation for three FAIRWAY case studies (Island Tunø, Aalborg and La Voulzie), was published in a scientific paper (Kim et al., 2020) and is summarised in »Evaluating agri-drinking water quality indicators in three case studies. In that work, the computed lag times were also compared to groundwater ages estimated using environmental tracers, which revealed that calculated lag times are generally shorter than tracer-based groundwater age estimates (Kim et al., 2020).
Those lag-time and groundwater-age data sets were added to the database as “Link” indicators when available.
1.2 Lag time related to nitrate contamination of groundwater
Figure 5 shows an example of graphical results from a cross-correlation analysis of pressure and state indicators related to nitrate contamination of groundwater. The top graph in the figure shows how the correlation coefficient r varies as a function of the considered lag (delay removed from the “NO3 conc.” time series). The dotted blue line indicates the r threshold below which the r values are not statistically significant. This graph thus informs that the (peak) correlation (r = 0.79) found at a lag = +19 years is significant. However, the found lag cannot be deemed precise since several r values are very close to the peak r value. Nonetheless, the bottom graph confirms that the two-time series show a similar variation over time once the delay of the response of the “NO3 conc.” series is removed. In this particular case, it will be interesting to repeat this cross correlation analysis in 5–15 years to see if the NO3 concentrations continue to decline, as to suggest in this present-day finding.
Figure 6 shows another example of cross correlation analysis of NO3 concentration and N surplus. In this case, the r threshold (below which r values are not statistically significant) is very high, there is only one r-value > 0.5, and it is not significant. This is because the two times series are very short in duration and do not share enough common features. This example highlights that it is often impossible to assess the lag time when the pressure and/or state time series are too short compared to the time interval it would take to fully capture the transient feature we wish to study in the two signals.
1.3 Lag time related to pesticide contamination of groundwater
Figure 7 shows an example of results from a cross correlation analysis of pressure and state indicators related to atrazine (pesticide) contamination of groundwater. The top graph shows that a (peak) correlation (r = 0.94) found at a lag = +22 years is significant. However, as for the nitrate related analysis, the lag identified cannot be deemed precise since the r values around it are very close to the peak value. Nonetheless, the bottom graph shows that the two series fit quite well visually once the delay of the response of the state series is removed. Furthermore, this particular analysis reveals the importance of having longer time series of both pressure and state indicators. Indeed, the seemingly satisfying fitting of the two series after the lag of 22 years was removed may not be the best answer in this case. Maybe a lag time of 15 years, 9 years, or even less would be a closer estimate of the actual “age” of the atrazine contaminant since it was introduced in the groundwater system. Having longer and more continuous and detailed time series would likely help narrow down the uncertainty in the lag time estimate.
Overall, the above interpretations of three examples of cross-correlation analyses underscore the importance of carefully examining the detailed results of such analysis, especially when the length or quality of the time series is limited.
1.4 Lag time estimation using other indicators
In addition, to the indicators presented above, some other indicators could be tested (as previously presented in Hansen et al, 2021). In the La Voulzie, FR case study, lag times were also estimated using the main crop area (i.e. wheat in that case) as the pressure indicator. Thus, the correlation coefficients were smaller than those obtained with other N pressure indicators (N budget and Mineral fertilisation). For example, the estimated lag time was 0 year at the main spring with a weak (r = 0.54) yet statistically significant correlation (Table 5). Using soil occupation directly as an indicator was shown insufficient but the connection between crop area and the presence of specific herbicides in surface waters is documented (e.g Brown et al, 2007).
Maybe that no correlation was found in La Voulzie because the wheat-area time series was available from 1994-2014, whereas the mineral fertilizer input and potential N leaching were available from the 1950s to 2018.
In La Voulzie statistically significant correlations were also found between mineral fertilisation and yearly average concentrations of nitrate in the springs. The estimated lag times at the top, main, and bottom springs were 7, 14 and 14 years, respectively. The lag times estimated using the potential N leaching were similar to those obtained with mineral fertilisation (Table 5). The mineral fertilisation seems to be an appropriate pressure indicator at the catchment scale if a lag time is to be calculated. Nevertheless, the mineral fertilisation indicator can only be applied in areas where no organic fertilizers are applied, that is to say, regions with no or negligible animal breeding.
Table 5: Estimated lag times (in years) using main crop surface or mineral fertilisation as pressure indicator in La Voulzie, FR case study (From Hansen et al., 2021)
|Monitoring points||Lag time with main crop surface as pressure indicator (correlation coefficient)||Lag time with mineral fertilisation as pressure indicator (correlation coefficient)|
|Top spring||4(0.65)**||7( 0.95)**|
*Statistically significant at p<0.05; **p<0.005; † Outer protection zone
Statistically significant correlations were also found between fertilisation and the yearly average concentrations of nitrate in Denmark and Slovenia case studies. In Tunø Island, DK, the correlation analyses yielded significant results at 15 points out of 24 monitoring points. At the 15 points with significant results, the tested pressure indicator (mineral fertilisation) showed strong correlations with the state indicator (yearly average nitrate concentrations). At the other 9 points, either the time series were too short and/or sparse, or the concentrations were relatively invariant at low concentrations ranges (10 mg/L) near the limit of quantification (LQ); why those other results were not significant nor reliable.
In Dravsko Polje, SI case study, the correlation analyses yielded significant results between the amount of fertilizers applied on the catchment (in tons of nitrogen) and the nitrate concentrations (i.e. the average concentration of nitrate in groundwater wells in the catchment). The best correlation was found for a lag time of 4 years. This lag time could seem short considering that the depth to the water table could reach 15 m in the northern part of the study area. But the Slovenian case study leader (University of Ljubljana) estimated that this lag time could be correctly calculated. In the upper terrace, the proportion of rock/gravel/pebbles is 40%. The proportion go higher with depth. On lower terrace, soils contain a high share of sand. Lag times have also connection to drought and precipitation periods (dry and wet years) that impacts nitrate leaching trough soil profile.
In the Anglian Region, UK case study, no lag times were calculated. Since the catchment is surface, the transfer times seem to be annual, and a relationship was found between the month of year and concentration. More closely, the relationship appears between monthly rainfall and nitrate concentration.
Cooper et al, 2020 proposed an explanation to this phenomenon. During winter, the monitoring point is characterised by high nitrate concentrations resulting from a precipitation-induced leaching through agricultural field (especially in exposed arable land). Conversely, during the summer, the monitoring point is characterised by low nitrate concentrations due to low precipitation rates and also because arable crop growth absorbs excess nitrogen in the soil.
2. Exploratory data analysis
More generally, the numerous time series and time-independent data of the database can be explored and analysed in various ways.
At the simplest level, the data can be explored individually, visually, by opening the Excel database and scrolling in one of its datasheets. However, since there are thousands of data rows in the case study datasheets, it is much more efficient and appropriate to explore the data sets by the parameter(s). One easy way to carry this out within Excel is to use the AutoFilter on the “…parameter name” fields to select the indicators you wish to extract. Once the filters are properly set, you can copy the visible dataset to the clipboard, and then paste it to another Excel workbook. You may then create graphs from this specific dataset. Suppose you want to prepare a graph displaying several time series. In that case, you may first create a graph with a first time series, and then use the Excel Paste Special options to add more XY series iteratively to the graph as you select each of the other indicators (X = “Date” and Y = “Results”) you want to display in the graph. You can also create histograms, box plots, and many other types of graphs, within an Excel environment. In addition, you may use selections and formulas to calculate various statistics (mean, median, min, max, counts, etc.) for the indicators of interest. Alternatively, all sorts of graphs can be generated using scripting languages (see »Using the database outside Excel).
Figure 8 provides an example of a stacked area chart that was prepared from selecting of four indicators related to Nitrogen inputs in La Voulzie. This specific graph provides an efficient means to assess the relative contributions of the “input” components of the N-surplus compound indicator.
Figure 9 provides an example graph where six indicators (time series) from the La Voulzie are displayed side by side. Note that the “Recharge” indicator was calculated from the rainfall and potential evapotranspiration (“PET”) indicators. This type of graph is very useful when exploring the datasets. It helps to detect similarities between the different signals. For instance, in this graph we see notably that the fluctuations in spring discharge are very similar to that of the hydraulic head and that peaks of recharge propagate in the spring discharge and hydraulic head signals after a delay of a few months. Complementary cross-correlation analyses (not shown here) allowed us to assess this hydraulic lag time more precisely: the spring appears to have an average delay of response of about 5 months.
3. Using the database outside Excel
Although it is possible to use this database within the Excel software environment to calculate statistics and produce various graphs, it may be more efficient to use other data analysis tools to exploit this database. In particular, scripting languages such as R or Python may be used (R: https://cran.r-project.org/; Python: https://www.python.org/). These programming languages and environments include hundreds of packages and functionalities that offer more flexibility in all steps of database exploitation, from data import and processing to data transformation, calculation of statistics and graphical representation. Here are two operations that may be applied to facilitate external use of the database:
- Creation of a single large table containing the data from all case studies. As mentioned earlier, it remains possible to merge all data sheets into one large table, thanks to the uniform scheme used for the columns in all case study datasheets. An easy way to carry this out is to append (copy & paste) all rows of data from every Excel database sheet into a new Excel sheet. The resulting large table, saved as an Excel file, can then be read directly (in Excel format) by software or scripts, while other software and scripts will prefer to read a delimited text file.
- Creation of a text file version of the Excel datasheet of interest. Indeed, it is often simpler and safer (more comprehensive) to read text files rather than Excel files within scripting environments . Any Excel sheet can be exported to a text file. The recommendation is to export as a tab-delimited text file, as this special character (tabulation) is not used in any field of the database.
Reading and exploiting the database using scripts is not straightforward, however. It requires the development of specific routines to read the data file(s) like
- process the flexible “Date” information;
- extract data for a given parameter, sub-parameter, sampling point (site) and spatial / time scale;
- aggregate an indicator at coarser time scales;
- perform various other calculations;
- generate graphs; etc.
Although we do not include such tools here, this document will help the reader interested in the programming of them. Here is a recommendation on how to read a case study datasheet, exported as a tab-delimited text file, efficiently and safely:
- Read the file using a function that creates a “data frame” from it with all columns treated as text (no automatic conversion of numeric values or dates).
- Alternatively, if the scripting language allows reading an Excel file with automatic conversion disabled, you may try to read a data sheet directly from the Excel file.
- Convert fields containing only numeric values to actual numeric values (e.g., “LQ_Value”, “Top_of_screen_Depth”, etc.)
- Determine the list of all different “indicators” (parameters / sub-parameters) in the imported dataset. Keep a copy of this list in memory.
- Process the dataset one indicator at a time:
- Select the data rows that correspond to that indicator.
- Process the “Date” information according to the “Date_Format”.
- Convert the “Results” text values to numeric values if “Data_Type” = ‘numerical’.
- Split the selected data rows by site and by time scale (if there are many) so that each group of rows describes only one parameter or sub-parameter, for only one site (and spatial scale) and one-time scale (in case of time-dependent data).
- Gather all of these groups of data rows in a named hierarchical list.
- Convert time-dependent data to time series objects if the scripting language allows it.
- Keep summary information on every indicator along with the data series, for instance:
..$ State indicator.Concentration of nitrate :List of 9
.. ..$ Parameter_Group : chr "State indicator" .. ..$ Parameter_Name : chr "Concentration of nitrate"
.. ..$ Subparameter_Name: chr ""
.. ..$ Spatial_Scale : chr "local, groundwater well"
.. ..$ Time_Scale : chr "instant"
.. ..$ Data_Type : chr "numerical"
.. ..$ Unit : chr "mg/L"
.. ..$ MANY_full_params : logi FALSE
.. ..$ Nb_Results : int 3700
- Program an easy-to-use function allowing its user to specify the indicator he wishes to get (and for which site and time scale if applicable) right before performing the data extraction.
- Use this function in a script to extract indicators for a specific purpose.
- Carry out further data processing on the extracted indicators (if needed).
- Carry out analyses on the extracted indicators.
Note: For full references to papers quoted in this article see