With any predictive method, we would like to know how good it is. This is model evaluation, often called model validation1. This is in contrast with model calibration, when we are building (fitting) the model.
To evaluate the predictive accuracy of a model, we compare the model predictions against actual values of the same items. These items ideally are from an independent dataset representative of the target population. An alternative is cross-validation, where some items from a single dataset are left out of the predictive model and then predicted by the remaining items.
Notation: actual values \(y_i\), predicted values \(\hat{y}_i\), for a set of \(n\) paired values (actual and predicted). Their means are \(\bar{y}\) and \(\bar{\hat{y}}\).
This should be representative of the population for which we want to make a statement.
The root mean squared error (RMSE) of the residuals (actual – predicted) measures how close, on average, the predictions are to reality. Ideally, this is zero.
\[\mathrm{RMSE} = \left[ \frac{1}{n} \sum_{i=1}^n (\widehat{y}_i - y_i)^2 \right]^{1/2}\]
The advantage of this measure is that it can be interpreted in terms of the model application. To put this in application context, it may be compared to the known range, inter-quartile range, or standard deviation of the evaluation set. These show how significant is the prediction variance when compared to the overall variability of the dataset.
A disadvantage is that is does not show whether the errors are evenly spaced throughout the prediction range.
This is similar to an \(R^2\) “coefficient of determination”, but is computed against the 1:1 line of actual vs. predicted. This measures how much of the variance in the actual values is explained by the prediction. It is the complement of the ratio of the residual sum-of-squares, i.e., left over after matching 1:1, to the total sum-of-squares, i.e., the total variability in the dataset:
\[ R^2 = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}} = 1 - \frac{\sum_i(y_i- \hat{y}_i)^2}{\sum_i( y_i - \bar{y})^2}\]
This measure is normalized for the variability of the evaluation set, and shows how much of this variability is explained by the 1:1 line based on the model predictions. Ideally this is 1, i.e., RSS = 0. Note that a poor model can give RSS \(>\) TSS, so that \(R^2 < 0\).
Bias, also known as the mean prediction error (MPE) compares the predicted vs. actual means of the evaluation dataset:
\[ \mathrm{MPE} = \frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y}_i ) \]
This shows whether the model systematically over- or under-predicts. Ideally, this is zero.
To put this in application context, it may be compared to the mean or median of the evaluation set.
This is the slope \(\beta_1\) of the linear relation between actual and predicted, fitted by ordinary least squares (OLS), assuming independent and identically-distributed residuals:
\[y = \beta_0 + \beta_1 \hat{y} + \epsilon; \;\;\epsilon \sim \mathcal{N}(0,\sigma^2)\]
Ideally this should be 1: a unit increase in the precicted value should correspond to a unit increase in the actual value.
If \(\beta_1 > 1\) the higher actual values are under-predicted and the lower actual values are over-predicted. If \(\beta_1 < 1\) the higher actual values are over-predicted and the lower actual values are under-predicted. It is fairly common for \(\beta_1 > 1\) in machine-learning models, which have difficulty with the highest and lowest values due to the resampling during bagging.
The \(R^2\) (coefficient of determination) of the fitted linear model is not an evaluation measure of the model! It does show how much of the variability in the evaluation set can be explained by s linear model (not the 1:1 relation between actual and precicted). Thus it tells us how well an adjustment equation is able to match the two sets. This equation could be used to adjust measurements by one method to another.
This is a measure of the deviation from the 1:1 line which includes all sources of deviation. It was developed to evaluate reproducibility of test procedures that are supposed to give the same result, but it applies equally well to compare actual vs. predicted by any model. It is a correlation coefficient of the two datasets (actual and predicted), corrected for the different means (bias) and standard deviations (variability) of the two sets 2.
\[\rho_c = \frac{2 \rho_{1,2} \sigma_1 \sigma_2}{\sigma_1^2 + \sigma_2^2 + (\mu_1 - \mu_2)^2}\]
If points are independent use the sample estimates \(r_{1,2}, S_1, S_2, \overline{Y_1}, \overline{Y_2}\) in place of the population statistics of the formula.
To illustrate evaluation statistics, we simulate some actual measurements and model outcomes.
Here are the presumed actual values, just an evenly-spaced sequence of integers:
n <- 256 # number of evaluation items
y <- seq(1:n)
head(y)
## [1] 1 2 3 4 5 6
In order to have the same amount of “noise” in each example, we define it here:
Note: so that the script will be reproducible, we use set.seed
. Comment this out for repeated runs to see how much the statistics change with different random noises from the same distribution.
set.seed(316)
noise <- rnorm(n=n, mean=0, sd=12)
First, we simulate a model that predicts without bias, i.e., \(\bar{y} = \bar{\hat{y}}\), and gain, i.e., the slope of the linear relation actual vs. predicted is 1.
yhat.1 <- y + noise
plot(y ~ yhat.1, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise only")
abline(0,1, lty=2); grid()
We see that the errors are randomly distributed around the 1:1 line; there is nothing to be done here except improve the model to remove sources of random error, for example, by improving the precision of measurements of the data values.
Second, a model with bias but the same noise
bias <- 20
yhat.2 <- y + bias + noise
plot(y ~ yhat.2, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise and bias")
abline(0,1, lty=2); grid()
Notice the same linear relation and lack of fit, but systematically over-predicting.
Third, a model with gain but no bias and no noise:
gain=0.8
yhat.3 <- y*gain + noise
yhat.3 <- yhat.3 - (mean(yhat.3) - mean(y)) # remove bias
plot(y ~ yhat.3, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise and gain")
abline(0,1, lty=2); grid()
Notice how the higher actual values are systematically under-predicted, and the lower actual values are systematically over-predicted. This pattern is obscured somewhat by the noise. The expected values at the centroid are not changed.
yhat.4 <- yhat.3 + bias
plot(y ~ yhat.4, asp=1, pch=20,
xlab="predicted", ylab="actual",
main="Model with noise, bias and gain")
abline(0,1, lty=2); grid()
Here we see all three sources of error in one plot. The gain is obvious, but displaced by the bias.
Now let’s see how the evaluation statistics compare:
f.rmse <- function(x.a, x.p) {
sqrt(mean((x.a-x.p)^2))
}
f.rmse(y, yhat.1) # noise only
## [1] 11.55209
f.rmse(y, yhat.2) # bias and noise
## [1] 22.90268
f.rmse(y, yhat.3) # gain and noise
## [1] 18.24796
f.rmse(y, yhat.4) # gain, bias, and noise
## [1] 27.07375
Both gain and bias in the predictive model increase the RMSE.
These can be compared to various measures of spread in the actual dataset: (1) the range, (2) the inter-quartile range; (3) the standard deviation:
range(y); f.rmse(y, yhat.1)/diff(range(y))
## [1] 1 256
## [1] 0.04530231
(iqr <- as.numeric(quantile(y, .75) - quantile(y, .25))); f.rmse(y, yhat.1)/iqr
## [1] 127.5
## [1] 0.09060462
sd(y); f.rmse(y, yhat.1)/sd(y)
## [1] 74.04503
## [1] 0.1560144
The RMSE of this model is about 5% of the total range, about 9% of the IQR, and about 16% of the standard deviation of the test set.
f.R2.1.1 <- function (x.a, x.p) {
1 - sum((x.a - x.p)^2)/sum(x.a^2)
}
f.R2.1.1(y, yhat.1) # noise only
## [1] 0.9939267
f.R2.1.1(y, yhat.2) # bias and noise
## [1] 0.9761288
f.R2.1.1(y, yhat.3) # gain and noise
## [1] 0.9848459
f.R2.1.1(y, yhat.4) # gain, bias, and noise
## [1] 0.9666422
This \(R^2\) decreases with gain and bias. The first model just quantifies the signal vs. noise of the 1:1 relation, but the other two include bias and gain, as deviations from the 1:1 line.
Compare these to the linear model \(R^2\):
summary(lm(y ~ yhat.1))$adj.r.squared # noise only
## [1] 0.9765305
summary(lm(y ~ yhat.2))$adj.r.squared # bias and noise
## [1] 0.9765305
summary(lm(y ~ yhat.3))$adj.r.squared # gain and noise
## [1] 0.9639537
summary(lm(y ~ yhat.4))$adj.r.squared # gain, bias, and noise
## [1] 0.9639537
Adding a bias does not change the proportion of variance explained, but this is not relative to the 1:1 line. The only difference between the \(R^2\) for the models with and without gain is the slight lack of fit due to the offset of the centroid.
f.bias <- function(x.a, x.p) {
mean((x.a-x.p))
}
f.bias(y, yhat.1) # noise only
## [1] 0.2229483
f.bias(y, yhat.2) # bias and noise
## [1] -19.77705
f.bias(y, yhat.3) # gain and noise
## [1] 5.02235e-15
f.bias(y, yhat.4) # gain, bias, and noise
## [1] -20
This measures the systematic over- or under-prediction. There is almost no bias in the noise-only and gain + noise models, any bias is due to random error. In the models with added bias we see a severe over-prediction, i.e., the actual is systematically less than the predicted, leading to a negative MPE.
gain <- function(x.a, x.p) {
m <- lm(x.a ~ x.p)
return(as.numeric(coef(m)[2]))
}
gain(y, yhat.1) # noise only
## [1] 0.9682616
gain(y, yhat.2) # bias and noise
## [1] 0.9682616
gain(y, yhat.3) # gain and noise
## [1] 1.19225
gain(y, yhat.4) # gain, bias, and noise
## [1] 1.19225
Notice that the bias does not change this measure – there is no gain, i.e., the slope of actual vs. predicted is still \(\approx 1\) (here, with some error due to the random noise).
lin <- function (x.a, x.p) {
return((2*cov(x.a,x.p))/
(var(x.a) + var(x.p) + (mean(x.a) - mean(x.p))^2))
}
lin(y, yhat.1) # noise only
## [1] 0.9880316
lin(y, yhat.2) # bias and noise
## [1] 0.9546782
lin(y, yhat.3) # gain and noise
## [1] 0.9636686
lin(y, yhat.4) # gain, bias, and noise
## [1] 0.923521
This is a good overall measure because it accounts for all kinds of lack of fit to the 1:1 line.
Taylor 3 developed a diagram that shows three measures of modelled vs. observed pattern: (1) correlation \(\rho\) between the two patterns, (2) centred RMSE \(E'\), (3) standard deviations \(\sigma_r\) (reference) and \(\sigma_t\) (test). These are related by the formula:
\[E'^2 = \sigma^2_t + \sigma^2_r - 2 \sigma_t \sigma_r \rho\]
It does not measure bias, since the RMSE is centred by subtracting the respective means. So it shows how well the two patterns match, allowing for a systematic shift, which can be measured by the bias (see above).
A Taylor diagram can be created by the taylor.diagram
function of the plotrix
package. Typically several models are compared on one diagram. Here we compare the three models:
require(plotrix)
## Loading required package: plotrix
taylor.diagram(y, yhat.1, col = "red",
pch = 1, pcex = 2,
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.2, col = "green", add = TRUE)
taylor.diagram(y, yhat.3, col = "blue", add = TRUE)
taylor.diagram(y, yhat.4, col = "purple", add = TRUE)
The correlation coefficient is shown by the angle (azimuth), from perfect (\(0^\circ\)) to none (\(90^\circ\)) – ideally this would be 1. Here all models have good correlation, the third and fourth models a bit less than the other two. Note that the models with and without bias (but no gain) are identical in this diagram.
Note that only non-negative correlations can be shown, which makes sense for two patterns that are supposed to represent the same thing.
The centred RMS error in the predicted pattern is proportional to the distance from the point on the x-axis (standard deviation of the actual pattern) marked with an open circle. Various values of the RMS error are shown with circles centred at this point, coloured green in this example – ideally this would be zero, on the x-axis. Here we see identical centred RMSE, i.e., the noise in each model is the same.
The standard deviation of the predicted pattern is proportional to the radial distance from the origin (black dottted contours) – ideally this would be on the arc from the standard deviation of the reference set. Here we see a bit more variability in the third model.
We can show some features of the Taylor diagram with some other models. First, increasing noise, leading to poorer correlation:
yhat.t1 <- y + noise
yhat.t2 <- y + noise*2
yhat.t3 <- y + noise*3
yhat.t4 <- y + noise*4
taylor.diagram(y, yhat.t1, col = "black",
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.t2, col = "red", add = TRUE)
taylor.diagram(y, yhat.t3, col = "blue", add = TRUE)
taylor.diagram(y, yhat.t4, col = "magenta", add = TRUE)
round(c(cor(y, yhat.t1), cor(y, yhat.t2), cor(y, yhat.t3), cor(y, yhat.t4)),2)
## [1] 0.99 0.96 0.91 0.86
round(c(sd(y), sd(yhat.t1), sd(yhat.t2), sd(yhat.t3), sd(yhat.t4)), 1)
## [1] 74.0 75.6 78.8 83.5 89.5
The centred RMSE increases with noise, and the correlation decreases (each observation may be perturbed more from its actual value).
The standard deviation of the predicted set increases (more extreme values) with noise and is increasingly further from the actual standard deviation (distance from centre of quandrant) – these models predict too much variability.
Here are four models, the first with no gain, the other three with increasing gain.
Note that for the Taylor diagram there is no need to remove bias, it is not taken into account.
# yhat.t1 has no gain, same noise
yhat.g2 <- y*0.8 + noise
yhat.g2 <- yhat.g2 - (mean(yhat.g2) - mean(y)) # remove bias
yhat.g3 <- y*0.7 + noise
yhat.g3 <- yhat.g3 - (mean(yhat.g3) - mean(y)) # remove bias
yhat.g4 <- y*0.6 + noise
yhat.g4 <- yhat.g4 - (mean(yhat.g4) - mean(y)) # remove bias
plot(y ~ yhat.t1, asp=1, pch=20,
xlab = "predicted", ylab="actual",
main = "Increasing gain")
abline(0,1, lty=2); grid()
points(y ~ yhat.g2, add=T, col="red", pch=20)
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical
## parameter
points(y ~ yhat.g3, add=T, col="blue", pch=20)
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical
## parameter
points(y ~ yhat.g4, add=T, col="magenta", pch=20)
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "add" is not a graphical
## parameter
Now show these four on a Taylor diagram:
taylor.diagram(y, yhat.1, col = "black",
ngamma = 8,
gamma.col = "green",
sd.arcs = TRUE,
ref.sd = TRUE)
taylor.diagram(y, yhat.g2, col = "red", add = TRUE)
taylor.diagram(y, yhat.g3, col = "blue", add = TRUE)
taylor.diagram(y, yhat.g4, col = "magenta", add = TRUE)
round(c(cor(y, yhat.1), cor(y, yhat.g2), cor(y, yhat.g3), cor(y, yhat.g4)), 3)
## [1] 0.988 0.982 0.977 0.969
round(c(sd(y), sd(yhat.1), sd(yhat.g2), sd(yhat.g3), sd(yhat.g4)),1)
## [1] 74.0 75.6 61.0 53.7 46.5
The correlation between actual and predicted patterns decreases with gain, but only by a small amount.
The standard deviation increases with noise but then decreases with gain, because the range of the predicted values is narrower, i.e., the angle of the actual:predicted line is steeper than the 1:1 line. The models with increasing gain are not variable enough.
A systematic approach to model quality was presented by Gauch et al. 4 This is based on a comparison of the model-based predictions and the measured values. This should be a 1:1 relation: each model prediction should equal the true value. Of course this will rarely be the case. These authors present a decomposition to show the source of the disagreement:
\[ \begin{align} \mathrm{MSD} &= \frac{1}{n} \sum_{i=1}^n \big(y_i - \widehat{y}_i\big)^2\label{eq:2}\\ \mathrm{SB} &= \Big(\overline{\widehat{y}} - \overline{y}\Big)^2\label{eq:3}\\ \mathrm{NU} &= (1 - b)^2 \frac{1}{n} \sum_{i=i}^n \Big(\widehat{y}_i - \overline{\widehat{y}}\Big)^2\label{eq:4}\\ \mathrm{LC} &= (1 - r^2) \frac{1}{n} \sum_{i=i}^n \Big(y_i - \overline{y}\Big)^2 \end{align} \]
\(n\) is the number of observations, \(y_i, \hat{y}_i\) are the actual and predicted values of observation \(i\), the overline indicates the arithmetic mean.
\(b\) is the slope of the least-squares regression of actual values on the predicted values, i.e., \(\sum{y_i \widehat{y}_i} / \sum{\widehat{y}_i^2}\); this is also called the .
\(r^2\) is the square of the correlation coefficient \(r_{1:1}\) between actual and predicted, i.e., \(\big(\sum{y_i} \widehat{y}_i\big)^2 / \big(\sum{y_i}\big)^2 \big(\sum{\widehat{y}_i}\big)^2\).
Thus % NU is the non-unity slope in the regression of actual on predicted, scaled by the sums of squares of the predicted values of the evaluation observations, % and LC is the lack of correlation in the regression of actual on predicted, scaled by the sums of squares of the actual values of the evaluation observations.
These have the relation:
\[ \mathrm{MSD} = \mathrm{SB} + \mathrm{NU} + \mathrm{LC} \]
That is, the total evaluation error consists of bias, gain, and lack of correlation. These are distinct aspects of model quality and can be interpreted as translation (SB), rotation (NU), and scatter (LC), respectively:
If there is significant translation or rotation, this indicates the model form is not correct. If there is significant scatter, this indicates that the model does not well-describe the system; perhaps there are missing factors (predictors) or perhaps the system has a lot of noise that can not be modelled.
Thus by examining the model evaluation decomposition the analyst can decide on how to improve the model.
decompose <- function(actual, predicted) {
msd <- mean((actual - predicted)^2)
sb <- (mean(predicted) - mean(actual))^2
regr.actual.pred <- lm(actual ~ predicted)
b1 <- coef(regr.actual.pred)[2]; names(b1) <- NULL
nu <- (1 - b1)^2 * mean((predicted - mean(predicted))^2)
r2 <- summary(regr.actual.pred)$r.squared
lc <- (1 - r2) * mean((actual - mean(actual))^2)
v <- c(msd, sb, nu, lc)
names(v) <- c("msd", "sb", "nu", "lc")
return(v)
}
decompose(y, yhat.1) # noise only
## msd sb nu lc
## 133.45075823 0.04970595 5.73064719 127.67040510
decompose(y, yhat.2) # bias and noise
## msd sb nu lc
## 524.532826 391.131774 5.730647 127.670405
decompose(y, yhat.3) # gain and noise
## msd sb nu lc
## 332.9880 0.0000 136.9021 196.0859
decompose(y, yhat.4) # bias, gain, and noise
## msd sb nu lc
## 732.9880 400.0000 136.9021 196.0859
In the noise-only model, the MSD is almost completely from lack of correlation LC. There is minimal bias and non-unity slope, these are just due to randomness in the noise.
In the bias + noise model, the MSD is larger, the lack of correlation LC is the same, and the majority of the MSD is from the squared bias SB. Again the non-unity slope is minimal and due to rqndomness.
In the gain + noise model, the MSD is larger, the lack of correlation LC is larger, there is effectively no bias SB, and now we see a large contribution to MSD by the non-unity slope NU.
In the model will bias + gain + noise, the MSD is the largest. The lack of correlation LC is the same as with only gain, but now in addition there is squared bias SB as the main contributor to MSD.
This decomposition shows where each model has the most problems.
We prefer the term “evaluation” because “validation” implies that the model is correct (“valid”); that of course is never the case. We want to evaluate how close it comes to reality. See Oreskes, N. (1998). Evaluation (not validation) of quantitative models. Environmental Health Perspectives, 106(Suppl 6), 1453–1460, and Oreskes, N., et al. (1994). Verification, validation, and confirmation of numerical models in the earth sciences. Science, 263, 641–646.1↩︎
Lin, L. I.-K. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1), 255–268.↩︎
Taylor, K. E. (2001). Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research: Atmospheres, 106(D7), 7183–7192. https://doi.org/10.1029/2000JD900719↩︎
Gauch, H. G., Hwang, J. T. G., & Fick, G. W. (2003). Model evaluation by comparison of model-based predictions and measured values. Agronomy Journal, 95(6), 1442–1446. https://doi.org/10.2134/agronj2003.1442↩︎