Elsevier

Journal of Hydrology

Volume 573, June 2019, Pages 501-515
Journal of Hydrology

Research papers
The comprehensive differential split-sample test: A stress-test for hydrological model robustness under climate variability

https://doi.org/10.1016/j.jhydrol.2019.03.054Get rights and content

Highlights

  • A systematic differential split-sample test is presented to evaluate robustness of hydrological model parameters under climate variability.

  • Choice of hydrological conditions for calibration data can be more important than data length.

  • In our case, calibrating to dry hydrological conditions proves to be most reliable for predicting arbitrary conditions.

Abstract

The choice of data periods for calibrating and evaluating conceptual hydrological models often seems ad-hoc, with no objective guidance on choosing calibration periods that produce the most reliable predictions. We therefore propose to systematically investigate the effects of calibration and validation data choices on parameter identification and predictive performance. We demonstrate our analysis on the Deggendorf/Kollbach catchment in Bavaria, Germany, for its long series of continuous hydrological and meteorological records. After classifying these data into three hydrological conditions (wet, dry and mixed) and combining them into periods of varied data length (2, 4, 8, 15, 20 and 25 years), we repeatedly calibrate a conceptual rainfall runoff hydrological model – Hydrologiska Byråns Vattenbalansavdelning (HBV) to these distinct data sets via Bayesian updating in a Monte Carlo setting. Then, we analyze predictive performance and posterior parameter statistics in various validation periods of distinct hydrological condition and time-series length. We call this the Comprehensive Differential Split-Sample Test (CDSST). Our results suggest that hydrological conditions in calibration tend to have a stronger impact than time-series length, and that calibrating on dry conditions might be a robust choice when aiming at predicting arbitrary future conditions (wet, dry or mixed). Furthermore, we found that posterior parameter estimates converged to a common optimum range with increasing data size under all investigated calibration scenarios, indicating that compensation of model structural errors by parameter fitting is independent of the chosen calibration condition. However, calibrating on time-series 8 years or longer led to overconfident predictions that failed to reliably envelope future data. While these findings are specific to our case study, we recommend using the CDSST to stress-test conceptual hydrological models to identify robust model parameters and/or deficiencies in the model structure. In general, we expect our proposed approach to be a valuable basis for model error diagnosis in any type of dynamic environmental system model, because it answers the following three questions: (1) what is the importance of physical processes not explicitly covered by the model? (2) How much overconfidence is present in the model? And (3), what are case-specific recommendations for appropriate calibration and validation setups?

Introduction

Hydrologists have been investigating calibration (Bárdossy and Singh, 2008, Monsalve, 2009, Kavetski and Fenicia, 2011, Sorooshian et al., 1983) and uncertainty assessment strategies (Li et al., 2010, Zhang et al., 2011, Yen et al., 2014, Zhang et al., 2016) for lumped conceptual rainfall-runoff (CRR) models in order to optimally inform modeling decisions. Specifically, the effects of data length on CRR model calibration and validation have been studied by numerous hydrologists, whose findings and suggestions do not converge to a common data length to be used for calibration purposes. For instance, Yapo et al. (1996) found that approximately eight years of data are required to obtain calibrations relatively insensitive to the period selected and that reduction of the parameter uncertainty is greatest when wettest data periods on record are used. Li et al. (2010) also found that eight years of data are sufficient to obtain steady estimates of model performance and parameters for a CRR model and that longer calibration data series do not necessarily result in better model performance. Anctil et al. (2004) showed that artificial neural network models continue to improve with datasets longer than nine years, in contrast to a four-parameter CRR model based on validation against the same 7-year test set. Perrin et al. (2007) found that having a dataset of 350 days that include both dry and wet conditions is sufficient to obtain robust estimates of CRR model parameters. Boughton (2007) concluded that a dataset longer than six years would not further improve CRR model performance and that modeling results were more dependent on the specific data set used for calibration than the specific model used. Razavi and Tolson (2013) stated that model calibration to short data periods leads to a range of performances from poor to very good depending on the representativeness of the short data period, which is typically unknown a priori. A consistent finding is that CRR models represent humid and semi-humid catchments better than arid catchments regardless of the calibration data length (Li et al., 2010). Altogether, these studies lead to mixed findings, making it difficult to propose general guidelines for choosing an adequate data length (given the data exists) for CRR model calibration.

Calibration data choices may be made with respect to data length and data type (e.g., specifically considering wet or dry records). For any such choice, it is important to recognize its implications for model error compensation. We briefly discuss the aspects of data length, data type and error compensation in the following sections and define the corresponding major research questions addressed by our proposed framework.

With regard to data set length, literature suggests that a certain number of years might be best for calibration (e.g., Sorooshian et al., 1983, Anctil et al., 2004, Xia et al., 2004, Perrin et al., 2007). Less attention has been paid to the choice of data type, i.e., the choice of wet, dry or average conditions for calibration. Some studies found that the most informative hydrological data might correspond to years with greater than average precipitation (e.g., Yapo et al., 1996, Gan et al., 1997). Numerous hydrologists (e.g., Refsgaard and Knudsen, 1996, Hartmann and Bárdossy, 2005, Vaze et al., 2010, Seibert et al., 2003, Fowler et al., 2016, Dakhlaoui et al., 2017) have evaluated the performance of various CRR models and their response to parameter transferability under changing hydrological conditions. Most of these studies used a flavor of the differential split-sample test (DSST) originally proposed by Klemeš (1986). The DSST is a standard method to examine parameter dependency on climate variability and associated outcomes on hydrological model efficiency. With this technique, calibration and validation periods are selected based on a climatic classification. Then, calibrations on wet and dry periods are validated considering either the same or opposite hydrological conditions during validation. One common finding is that parameter estimates obtained in wet-calibration periods perform poorly in dry-validation periods. On the contrary, Wu and Johnston (2007) tested the Soil and Water Assessment Tool (SWAT) model using datasets with diverse climate behavior. They showed that parameter ensembles calibrated on drought periods performed better in validation than parameters sets calibrated on average/wet periods. Apparently, the parameters that govern evapotranspiration processes are more identifiable in dry periods, where evapotranspiration is a more prevailing process. Seiller et al. (2012) tested the performance of 20 lumped CRR models under four different climate conditions using five non-continuous hydrological years. Their results showed that the transferability of CRR model parameters between contrasted climatic conditions is generally low. Nevertheless, the tested models showed better predictive reliability (the data falls into the predicted intervals with appropriate probability) for the respective opposite validation conditions when calibrating on dry periods than when calibrating on wet periods. Ruelland et al. (2015) found that wet calibration evaluated in dry validation periods did not generate larger errors than when using a dry calibration period.

Collectively, there is a lack of consensus and clear guidance regarding optimal calibration and model evaluation strategies for specific modeling purposes. With the exception of a few studies (Vaze et al., 2010, Seiller et al., 2012, Troin et al., 2017), previous studies mainly focused on model calibration with less emphasis on validation. This leads us to a number of intriguing research questions. For example, since conceptual hydrological models apparently perform best in predicting wet conditions, does that mean we should calibrate and validate on wet conditions? How much data under what hydrological conditions is necessary to calibrate a CRR model so that it can make reliable predictions regardless of future hydrological conditions?

Note that if the goal is to calibrate a model for predicting a specific scenario (e.g., floods), we follow the general intuition that it should be calibrated on similar past scenarios (if available). However, the focus of our study is to calibrate a model that provides robust predictions under arbitrary hydrological conditions. Such “all-purpose” calibrated models can be useful, e.g., when quantifying the water balance of a catchment or for a general improvement of system understanding. Our own motivation, among others, is to investigate how fit a given model is, because this fosters insights about its specific shortcomings and model errors. For such all-purpose investigations, we find that there is no clear recommendation on how to perform calibration and validation under these premises.

We will investigate these questions systematically by looking at the simultaneous impact of varying data set length and varying hydrological condition on model performance in calibration and validation. But, as an important aspect of this analysis, we have to address the role of model errors, because any calibration data choice has implications for the treatment of model errors. Model errors (model structural deficits) are known to exist since CRR models rely on simple process representations. Often, such model structural deficiencies that are revealed only during certain hydrological conditions (Wagener et al., 2003). In calibration, these deficiencies are (partially) compensated by parameter fitting, so that parameter values are expected to vary with hydrological conditions and calibration strategies. Additionally, Abebe et al. (2010) found that parameter estimates change in time not just due to model structural errors but also because of unsteady catchment-scale processes (e.g., under climatic variability).

Hence, we aim to answer the following additional questions: How stable are parameter estimates under different classes of hydrological conditions? And from the perspective of error compensation, does it hurt or help to increase the data set length used in calibration?

Methodologically, we propose an upgrade to the DSST which we call the Comprehensive Differential Split-Sample test (CDSST). It serves to analyze the impact of both data length choices and hydrological condition choices on the predictive reliability and robustness of a model. With robust we mean that best estimates and predicted uncertainty intervals perform in validation as expected from their performance in calibration, even under climatic variability between calibration and validation.

In the CDSST, we perform CRR model calibration through Bayesian updating to account for uncertainty in model parameters and output. In this probabilistic setting, we propose to analyze posterior parameter distributions and predictive performance (accuracy and precision) as a function of dataset length and hydrological condition in calibration and validation as a way to “stress-test” CRR models. Accuracy means small residuals between the data and the model with best-fit parameter values, while precision means a low predictive variance. Our procedure can be understood as a Bayesian version of cross-validation; but instead of a common “leave-one-out” or “leave-one-block-out” analysis (referred to as jackknife method, Efron and Gong (1983)), we systematically select data sets of different length and different hydrological condition. With this, we obtain more controlled, differential and interpretable answers than the answers that standard statistical methods would offer.

We demonstrate our CDSST on the Deggendorf/Kollbach (DK) catchment in Bavaria, Germany, which is a well-monitored catchment with 55 years of continuous hydrological and meteorological data. We use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) to produce discharge predictions as a function of precipitation, evapotranspiration, air temperature, land cover and soil properties. We chose to use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) because it is one of the most frequently used CRR models with successful applications in different environments ranging from Northern European to South American catchments (Bárdossy and Singh, 2008, Abebe et al., 2010, Dakhlaoui et al., 2012). Specifically, Seiller et al. (2012) showed that HBV is a good model choice for a close-by catchment in Southern Bavaria, Germany. Furthermore, results from Dakhlaoui et al. (2017) have shown, through a multi-model comparison, that HBV produces similar results to other CRR models (e.g., GR4j and IHACRES).

The available discharge data are classified into three hydrological conditions and combined into periods of varied data length. Then, we repeatedly calibrate the model to these distinct data sets via Bayesian updating in a brute-force Monte Carlo framework. We evaluate the resulting predictive performance and posterior parameter statistics in various validation periods, again classified according to hydrological condition and data length. Our systematic analysis allows us to determine the sensitivity of model reliability to data choices in calibration under varying modeling purposes, reflected by the choice of data length and validation period.

While some of our findings may depend on our specific choice of study domain and model, we are convinced that the idea of systematically stress-testing via our suggested CDSST is useful for arbitrary catchments and hydrological models, and in fact for dynamic environmental system models in general. In order to produce clear recommendations with broader applicability (i.e., different CRR models and/or different catchments), further research would be required.

In short, the main contributions of our study are (1) proposing a systematic investigation of competing calibration strategies for a CRR model, classified according to hydrological condition and data length, (2) a sensitivity analysis of model performance (both predictive accuracy and precision) based on varied calibration strategies and validation scenarios, and (3) a sensitivity analysis of posterior parameter statistics based on alternative calibration strategies. These investigations allow us to draw conclusions about optimal calibration strategies for robust prediction of arbitrary future conditions. Finally, we provide (4) a discussion of how such a stress-test builds a solid basis for diagnosing model structural errors in dynamic environmental system models.

The article is organized as follows: Section 2 provides a description of the Deggendorf/Kollbach catchment; Section 3 presents the CRR model, the criteria used to select the calibration and validation data, and the suggested CDSST; Section 4 discusses the impacts of data choice on parameter and predictive uncertainty; and Section 5 draws conclusions about the benefit of the proposed CDSST in environmental modeling.

Section snippets

Study area

As part of the European Union's Water Framework Directive (WFD), watersheds in Bavaria state are currently being monitored through 600 stations by the Environmental Agency of Bavaria (Bayerisches Landesamt für Umwelt BLU, 2016). From the many monitored catchments in Bavaria, we selected the Deggendorf/Kollbach catchment for this study because of its 55 years of continuous hydrological and meteorological data (Bayerisches Landesamt für Umwelt BLU, 2016).

The catchment is located in the

Model setup

For the reasons provided in the introduction, we use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) as the CRR model for our study. It can be classified as a semi-distributed CRR model, which uses sub-basins as primary hydrological units. Here, we implemented a lumped version of the model assuming the following: 1) the land cover has been primarily dominated during the entire period of observation by mixed forest (62%), natural vegetation (25%), and evergreen forest

Effects of data choices on model accuracy in calibration

We obtained parameter ensembles through repeated likelihood weighting of the BFMC method (Section 3.3) on 25 calibration datasets (Table 4). The different calibration data sets were differentiated based on time-series length and hydrological condition (see Section 3.2). Then, we determined the NSE values of the MLE prediction for each calibration data scenario (i.e., not yet in the validation periods). Results are shown in Fig. 4, sorted by increasing data length and color-coded according to

Summary and conclusions

The choice of data periods for calibration and validation of hydrological models is often ad-hoc, with no objective guidance in choosing periods for most reliable calibration and predictions. Therefore, this study focuses on the effects of calibration and validation data choices on parameter identification and predictive performance. We have proposed a stress-test in form of a Comprehensive Differential Split-Sample Test (CDSST) to reveal the impact of data choices on model quality, considering

Declaration of interests

None declared.

Acknowledgments

The authors would like to thank the German Research Foundation (DFG) for financial support of the project within the Research Training Group “Integrated Hydrosystem Modelling” (GRK 1829/2) at the University of Tübingen and within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart. Additional funding is granted by the Collaborative Research Center SFB 1253 “CAMPOS – Catchments as Reactors”. We are grateful to Dario Del Giudice, Laurent Coron and Geoff

References (62)

  • H. Yen et al.

    A framework for propagation of uncertainty contributed by parameterization, input data, model structure, and calibration/validation data in watershed modeling

    Environ. Modell. Software

    (2014)
  • H. Zhang et al.

    Multi-period calibration of a semi-distributed hydrological model based on hydroclimatic clustering

    Adv. Water Resour.

    (2011)
  • J. Zhang et al.

    Assessment of parameter uncertainty in hydrological model using a Markov-Chain-Monte-Carlo-based multilevel-factorial-analysis method

    J. Hydrol.

    (2016)
  • A. Abebe et al.

    Managing uncertainty in hydrological models using complementary models

    Hydrol. Sci. J.

    (2003)
  • A. Bárdossy et al.

    Robust estimation of hydrological model parameters

    Hydrol. Earth Syst. Sci. Discuss.

    (2008)
  • Bayerisches Landesamt für Umwelt BLU (2016), National measuring network water level and discharge, retrieved from...
  • S. Bergström

    The HBV MODEL its Structure and Applications, in HBV Hydrological Rain-Runoff Model

    (1992)
  • G.E. Box et al.

    An analysis of transformations

    J. Roy. Stat. Soc. B Methodol.

    (1964)
  • Bundesanstalt für Geowissenschaften und Rohstoffe, 2016, Geoviewer: International Hydrogeological Map of Europe...
  • CGIAR CSI, 2016, CGIAR Consortium for Spatial Information (CGIAR-CSI). SRTM 90m digital elevation data,...
  • L. Coron et al.

    Crash testing hydrological models in contrasted climate conditions: an experiment on 216 Australian catchments

    Water Resour. Res.

    (2012)
  • H. Dakhlaoui et al.

    Evaluating the robustness of conceptual rainfall-runoff models under climate variability in northern Tunisia

    J. Hydrol.

    (2017)
  • D. Del Giudice et al.

    Improving uncertainty estimation in urban hydrological modeling by statistically describing bias

    Hydrol. Earth Syst. Sci.

    (2013)
  • D. Del Giudice et al.

    On the practical usefulness of least squares for assessing uncertainty in hydrologic and water quality predictions

    Environ. Modell. Software

    (2018)
  • B. Efron et al.

    A leisurely look at the bootstrap, the jack-knife, and cross-validation

    Am. Statistician

    (1983)
  • G. Evin et al.

    Comparison of joint versus postprocessor approaches for hydrological uncertainty estimation accounting for error autocorrelation and heteroscedasticity

    Water Resour. Res.

    (2014)
  • K.J.A. Fowler et al.

    Simulating runoff under changing climatic conditions: revisiting an apparent deficiency of conceptual rainfall runoff models

    Water Resour. Res.

    (2016)
  • Friedman, J., Hastie, T., Tibshirani, R., 2001. The elements of statistical learning. Vol. 1, Springer series in...
  • T.Y. Gan et al.

    Effects of model complexity and structure, data quality, and objective functions on hydrologic modeling

    J. Hydrol.

    (1997)
  • J. Götzinger et al.

    Generic error model for calibration and uncertainty estimation of hydrological models

    Water Resour. Res.

    (2007)
  • Cited by (37)

    View all citing articles on Scopus
    View full text