Research papersThe comprehensive differential split-sample test: A stress-test for hydrological model robustness under climate variability
Introduction
Hydrologists have been investigating calibration (Bárdossy and Singh, 2008, Monsalve, 2009, Kavetski and Fenicia, 2011, Sorooshian et al., 1983) and uncertainty assessment strategies (Li et al., 2010, Zhang et al., 2011, Yen et al., 2014, Zhang et al., 2016) for lumped conceptual rainfall-runoff (CRR) models in order to optimally inform modeling decisions. Specifically, the effects of data length on CRR model calibration and validation have been studied by numerous hydrologists, whose findings and suggestions do not converge to a common data length to be used for calibration purposes. For instance, Yapo et al. (1996) found that approximately eight years of data are required to obtain calibrations relatively insensitive to the period selected and that reduction of the parameter uncertainty is greatest when wettest data periods on record are used. Li et al. (2010) also found that eight years of data are sufficient to obtain steady estimates of model performance and parameters for a CRR model and that longer calibration data series do not necessarily result in better model performance. Anctil et al. (2004) showed that artificial neural network models continue to improve with datasets longer than nine years, in contrast to a four-parameter CRR model based on validation against the same 7-year test set. Perrin et al. (2007) found that having a dataset of 350 days that include both dry and wet conditions is sufficient to obtain robust estimates of CRR model parameters. Boughton (2007) concluded that a dataset longer than six years would not further improve CRR model performance and that modeling results were more dependent on the specific data set used for calibration than the specific model used. Razavi and Tolson (2013) stated that model calibration to short data periods leads to a range of performances from poor to very good depending on the representativeness of the short data period, which is typically unknown a priori. A consistent finding is that CRR models represent humid and semi-humid catchments better than arid catchments regardless of the calibration data length (Li et al., 2010). Altogether, these studies lead to mixed findings, making it difficult to propose general guidelines for choosing an adequate data length (given the data exists) for CRR model calibration.
Calibration data choices may be made with respect to data length and data type (e.g., specifically considering wet or dry records). For any such choice, it is important to recognize its implications for model error compensation. We briefly discuss the aspects of data length, data type and error compensation in the following sections and define the corresponding major research questions addressed by our proposed framework.
With regard to data set length, literature suggests that a certain number of years might be best for calibration (e.g., Sorooshian et al., 1983, Anctil et al., 2004, Xia et al., 2004, Perrin et al., 2007). Less attention has been paid to the choice of data type, i.e., the choice of wet, dry or average conditions for calibration. Some studies found that the most informative hydrological data might correspond to years with greater than average precipitation (e.g., Yapo et al., 1996, Gan et al., 1997). Numerous hydrologists (e.g., Refsgaard and Knudsen, 1996, Hartmann and Bárdossy, 2005, Vaze et al., 2010, Seibert et al., 2003, Fowler et al., 2016, Dakhlaoui et al., 2017) have evaluated the performance of various CRR models and their response to parameter transferability under changing hydrological conditions. Most of these studies used a flavor of the differential split-sample test (DSST) originally proposed by Klemeš (1986). The DSST is a standard method to examine parameter dependency on climate variability and associated outcomes on hydrological model efficiency. With this technique, calibration and validation periods are selected based on a climatic classification. Then, calibrations on wet and dry periods are validated considering either the same or opposite hydrological conditions during validation. One common finding is that parameter estimates obtained in wet-calibration periods perform poorly in dry-validation periods. On the contrary, Wu and Johnston (2007) tested the Soil and Water Assessment Tool (SWAT) model using datasets with diverse climate behavior. They showed that parameter ensembles calibrated on drought periods performed better in validation than parameters sets calibrated on average/wet periods. Apparently, the parameters that govern evapotranspiration processes are more identifiable in dry periods, where evapotranspiration is a more prevailing process. Seiller et al. (2012) tested the performance of 20 lumped CRR models under four different climate conditions using five non-continuous hydrological years. Their results showed that the transferability of CRR model parameters between contrasted climatic conditions is generally low. Nevertheless, the tested models showed better predictive reliability (the data falls into the predicted intervals with appropriate probability) for the respective opposite validation conditions when calibrating on dry periods than when calibrating on wet periods. Ruelland et al. (2015) found that wet calibration evaluated in dry validation periods did not generate larger errors than when using a dry calibration period.
Collectively, there is a lack of consensus and clear guidance regarding optimal calibration and model evaluation strategies for specific modeling purposes. With the exception of a few studies (Vaze et al., 2010, Seiller et al., 2012, Troin et al., 2017), previous studies mainly focused on model calibration with less emphasis on validation. This leads us to a number of intriguing research questions. For example, since conceptual hydrological models apparently perform best in predicting wet conditions, does that mean we should calibrate and validate on wet conditions? How much data under what hydrological conditions is necessary to calibrate a CRR model so that it can make reliable predictions regardless of future hydrological conditions?
Note that if the goal is to calibrate a model for predicting a specific scenario (e.g., floods), we follow the general intuition that it should be calibrated on similar past scenarios (if available). However, the focus of our study is to calibrate a model that provides robust predictions under arbitrary hydrological conditions. Such “all-purpose” calibrated models can be useful, e.g., when quantifying the water balance of a catchment or for a general improvement of system understanding. Our own motivation, among others, is to investigate how fit a given model is, because this fosters insights about its specific shortcomings and model errors. For such all-purpose investigations, we find that there is no clear recommendation on how to perform calibration and validation under these premises.
We will investigate these questions systematically by looking at the simultaneous impact of varying data set length and varying hydrological condition on model performance in calibration and validation. But, as an important aspect of this analysis, we have to address the role of model errors, because any calibration data choice has implications for the treatment of model errors. Model errors (model structural deficits) are known to exist since CRR models rely on simple process representations. Often, such model structural deficiencies that are revealed only during certain hydrological conditions (Wagener et al., 2003). In calibration, these deficiencies are (partially) compensated by parameter fitting, so that parameter values are expected to vary with hydrological conditions and calibration strategies. Additionally, Abebe et al. (2010) found that parameter estimates change in time not just due to model structural errors but also because of unsteady catchment-scale processes (e.g., under climatic variability).
Hence, we aim to answer the following additional questions: How stable are parameter estimates under different classes of hydrological conditions? And from the perspective of error compensation, does it hurt or help to increase the data set length used in calibration?
Methodologically, we propose an upgrade to the DSST which we call the Comprehensive Differential Split-Sample test (CDSST). It serves to analyze the impact of both data length choices and hydrological condition choices on the predictive reliability and robustness of a model. With robust we mean that best estimates and predicted uncertainty intervals perform in validation as expected from their performance in calibration, even under climatic variability between calibration and validation.
In the CDSST, we perform CRR model calibration through Bayesian updating to account for uncertainty in model parameters and output. In this probabilistic setting, we propose to analyze posterior parameter distributions and predictive performance (accuracy and precision) as a function of dataset length and hydrological condition in calibration and validation as a way to “stress-test” CRR models. Accuracy means small residuals between the data and the model with best-fit parameter values, while precision means a low predictive variance. Our procedure can be understood as a Bayesian version of cross-validation; but instead of a common “leave-one-out” or “leave-one-block-out” analysis (referred to as jackknife method, Efron and Gong (1983)), we systematically select data sets of different length and different hydrological condition. With this, we obtain more controlled, differential and interpretable answers than the answers that standard statistical methods would offer.
We demonstrate our CDSST on the Deggendorf/Kollbach (DK) catchment in Bavaria, Germany, which is a well-monitored catchment with 55 years of continuous hydrological and meteorological data. We use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) to produce discharge predictions as a function of precipitation, evapotranspiration, air temperature, land cover and soil properties. We chose to use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) because it is one of the most frequently used CRR models with successful applications in different environments ranging from Northern European to South American catchments (Bárdossy and Singh, 2008, Abebe et al., 2010, Dakhlaoui et al., 2012). Specifically, Seiller et al. (2012) showed that HBV is a good model choice for a close-by catchment in Southern Bavaria, Germany. Furthermore, results from Dakhlaoui et al. (2017) have shown, through a multi-model comparison, that HBV produces similar results to other CRR models (e.g., GR4j and IHACRES).
The available discharge data are classified into three hydrological conditions and combined into periods of varied data length. Then, we repeatedly calibrate the model to these distinct data sets via Bayesian updating in a brute-force Monte Carlo framework. We evaluate the resulting predictive performance and posterior parameter statistics in various validation periods, again classified according to hydrological condition and data length. Our systematic analysis allows us to determine the sensitivity of model reliability to data choices in calibration under varying modeling purposes, reflected by the choice of data length and validation period.
While some of our findings may depend on our specific choice of study domain and model, we are convinced that the idea of systematically stress-testing via our suggested CDSST is useful for arbitrary catchments and hydrological models, and in fact for dynamic environmental system models in general. In order to produce clear recommendations with broader applicability (i.e., different CRR models and/or different catchments), further research would be required.
In short, the main contributions of our study are (1) proposing a systematic investigation of competing calibration strategies for a CRR model, classified according to hydrological condition and data length, (2) a sensitivity analysis of model performance (both predictive accuracy and precision) based on varied calibration strategies and validation scenarios, and (3) a sensitivity analysis of posterior parameter statistics based on alternative calibration strategies. These investigations allow us to draw conclusions about optimal calibration strategies for robust prediction of arbitrary future conditions. Finally, we provide (4) a discussion of how such a stress-test builds a solid basis for diagnosing model structural errors in dynamic environmental system models.
The article is organized as follows: Section 2 provides a description of the Deggendorf/Kollbach catchment; Section 3 presents the CRR model, the criteria used to select the calibration and validation data, and the suggested CDSST; Section 4 discusses the impacts of data choice on parameter and predictive uncertainty; and Section 5 draws conclusions about the benefit of the proposed CDSST in environmental modeling.
Section snippets
Study area
As part of the European Union's Water Framework Directive (WFD), watersheds in Bavaria state are currently being monitored through 600 stations by the Environmental Agency of Bavaria (Bayerisches Landesamt für Umwelt BLU, 2016). From the many monitored catchments in Bavaria, we selected the Deggendorf/Kollbach catchment for this study because of its 55 years of continuous hydrological and meteorological data (Bayerisches Landesamt für Umwelt BLU, 2016).
The catchment is located in the
Model setup
For the reasons provided in the introduction, we use the HBV model (Hydrologiska Byråns Vattenbalansavdelning, Bergström (1992)) as the CRR model for our study. It can be classified as a semi-distributed CRR model, which uses sub-basins as primary hydrological units. Here, we implemented a lumped version of the model assuming the following: 1) the land cover has been primarily dominated during the entire period of observation by mixed forest (62%), natural vegetation (25%), and evergreen forest
Effects of data choices on model accuracy in calibration
We obtained parameter ensembles through repeated likelihood weighting of the BFMC method (Section 3.3) on 25 calibration datasets (Table 4). The different calibration data sets were differentiated based on time-series length and hydrological condition (see Section 3.2). Then, we determined the NSE values of the MLE prediction for each calibration data scenario (i.e., not yet in the validation periods). Results are shown in Fig. 4, sorted by increasing data length and color-coded according to
Summary and conclusions
The choice of data periods for calibration and validation of hydrological models is often ad-hoc, with no objective guidance in choosing periods for most reliable calibration and predictions. Therefore, this study focuses on the effects of calibration and validation data choices on parameter identification and predictive performance. We have proposed a stress-test in form of a Comprehensive Differential Split-Sample Test (CDSST) to reveal the impact of data choices on model quality, considering
Declaration of interests
None declared.
Acknowledgments
The authors would like to thank the German Research Foundation (DFG) for financial support of the project within the Research Training Group “Integrated Hydrosystem Modelling” (GRK 1829/2) at the University of Tübingen and within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart. Additional funding is granted by the Collaborative Research Center SFB 1253 “CAMPOS – Catchments as Reactors”. We are grateful to Dario Del Giudice, Laurent Coron and Geoff
References (62)
- et al.
Sensitivity and uncertainty analysis of the conceptual HBV rainfall-runoff model: implications for parameter estimation
J. Hydrol.
(2010) - et al.
Impact of the length of observed records on the performance of ANN and of conceptual parsimonious rainfall-runoff forecasting models
Environ. Modell. Software
(2004) Effect of data length on rainfall-runoff modelling
Environ. Modell. Software
(2007)- et al.
Toward a more efficient calibration schema for HBV rainfall-runoff model
J. Hydrol.
(2012) - et al.
HydroTest: a web-based toolbox of evaluation metrics for the standardised assessment of hydrological forecasts
Environ. Modell. Software
(2007) - et al.
Effect of calibration data series length on performance and optimal parameters of hydrological model
Water Sci. Eng.
(2010) - et al.
Systematic evaluation of autoregressive error models as post-processors for a probabilistic streamflow forecast system
J. Hydrol.
(2011) - et al.
Climate non-stationarity: validity of calibrated rainfall runoff models for use in climate change studies
J. Hydrol.
(2010) - et al.
Hydrologic response to climatic variability in a great lakes watershed: a case study with the swat model
J. Hydrol.
(2007) - et al.
Automatic calibration of conceptual rainfall-runoff models: sensitivity to calibration data
J. Hydrol.
(1996)
A framework for propagation of uncertainty contributed by parameterization, input data, model structure, and calibration/validation data in watershed modeling
Environ. Modell. Software
Multi-period calibration of a semi-distributed hydrological model based on hydroclimatic clustering
Adv. Water Resour.
Assessment of parameter uncertainty in hydrological model using a Markov-Chain-Monte-Carlo-based multilevel-factorial-analysis method
J. Hydrol.
Managing uncertainty in hydrological models using complementary models
Hydrol. Sci. J.
Robust estimation of hydrological model parameters
Hydrol. Earth Syst. Sci. Discuss.
The HBV MODEL its Structure and Applications, in HBV Hydrological Rain-Runoff Model
An analysis of transformations
J. Roy. Stat. Soc. B Methodol.
Crash testing hydrological models in contrasted climate conditions: an experiment on 216 Australian catchments
Water Resour. Res.
Evaluating the robustness of conceptual rainfall-runoff models under climate variability in northern Tunisia
J. Hydrol.
Improving uncertainty estimation in urban hydrological modeling by statistically describing bias
Hydrol. Earth Syst. Sci.
On the practical usefulness of least squares for assessing uncertainty in hydrologic and water quality predictions
Environ. Modell. Software
A leisurely look at the bootstrap, the jack-knife, and cross-validation
Am. Statistician
Comparison of joint versus postprocessor approaches for hydrological uncertainty estimation accounting for error autocorrelation and heteroscedasticity
Water Resour. Res.
Simulating runoff under changing climatic conditions: revisiting an apparent deficiency of conceptual rainfall runoff models
Water Resour. Res.
Effects of model complexity and structure, data quality, and objective functions on hydrologic modeling
J. Hydrol.
Generic error model for calibration and uncertainty estimation of hydrological models
Water Resour. Res.
Cited by (37)
A hydrologic similarity-based parameters dynamic matching framework: Application to enhance the real-time flood forecasting
2024, Science of the Total EnvironmentEcohydrologic model with satellite-based data for predicting streamflow in ungauged basins
2023, Science of the Total EnvironmentHydrological models weighting for hydrological projections: The impacts on future peak flows
2023, Journal of HydrologyThe robustness of conceptual rainfall-runoff modelling under climate variability – A review
2023, Journal of HydrologyVerifying model performance using validation of Pareto solutions
2023, Journal of Hydrology