This R Markdown file shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.
We use a subset of the meuse
soil pollution dataset found in the sp
(spatial objects) package. Note that each time the source R Markdown file is executed, a different random set of observation is selected from the full dataset, so the results will be different.
library(sp)
data(meuse)
# if you want to see its metadata
# help(meuse)
dim(meuse)
## [1] 155 14
Add computed fields with the log-transformed values of the heavy metals:
meuse$logZn <- log10(meuse$zinc) # 锌
meuse$logCu <- log10(meuse$copper) # 铜
meuse$logPb <- log10(meuse$lead) # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
dim(meuse)
## [1] 155 18
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [15] "logZn" "logCu" "logPb" "logCd"
table(meuse$ffreq) # observations in each flood frequency class
##
## 1 2 3
## 84 48 23
To be able to show the matrices, we limit the data to only a few rows and 4 selected columns: a dependent variable logZn
, a continuous predictor dist
(normalized distance to the river Meuse), and a categorical predictor ffreq
(flooding frequency).
To ensure we have some observations from each flood frequency class, we take a stratified sample based on this category. This uses the strata
method from the sampling
package. We choose an equal number of observations from each of the three classes.
The n.s
parameter is the sample size from each stratum; you can change this to see the effect of taking larger or smaller samples.
n.s <- 5 # sample size from each stratum
(n <- n.s * length(levels(meuse$ffreq))) # total sample size
## [1] 15
library(sampling)
# indices of the selected rows in the data frame
(ix <- strata(meuse, stratanames="ffreq", size=rep(n.s,3), method="srswor")$ID_unit)
## [1] 16 17 38 61 74 109 125 129 131 132 143 144 148 150 151
These are the selected rows, they will differ with each random sample.
# the reduced data frame
(ds <- meuse[ix, c("logZn","dist", "ffreq")])
## logZn dist ffreq
## 16 3.013680 0.0000000 1
## 17 2.782473 0.0122243 1
## 39 2.920645 0.0484965 1
## 62 2.957607 0.0054321 1
## 83 2.773055 0.2009760 1
## 114 2.283301 0.5757520 2
## 131 2.418301 0.1683310 2
## 135 2.618048 0.0812194 2
## 161 2.100371 0.7716980 2
## 162 2.322219 0.3368290 2
## 148 2.227887 0.1030290 3
## 149 2.605305 0.0703550 3
## 153 2.893762 0.0921435 3
## 155 2.330414 0.3749400 3
## 156 2.220108 0.4238370 3
Visualize the relation with possible predictors of log(Zn):
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
plot(logZn ~ ffreq, data=ds, xlab="Flood frequency class", boxwex=0.4)
Depending on the random sample, there should be some relation: further distances and less-frequent flooding are associated with lower metal concentrations.
Model form: \(y = \beta_0 + \beta_1 x + \varepsilon; \;\; \mathbf{\varepsilon} \sim \mathcal{N}(0, \sigma^2)\)
Each observation \(i\): \(y_i = \beta_0 + \beta_1 x_i + r_i\) where \(r_i\) is the residual – i.e., what the model does not explain.
There is only one set of coefficients \(\beta_p\) which must be adjusted to all observations.
Once the model is fit, i.e., we compute the \(\beta_p\), we can compute the fitted values that would be expected at each observation point: \(\widehat{y}_i = \beta_0 + \beta_1 x_i\), and from that the residuals: \(y_i - (\beta_0 + \beta_1 x_i)\), i.e., \((y_i - \widehat{y}_i)\), actual less fitted.
So the question is: what are the `best’ values of \(\beta_p\)?
One criterion is to select \(\beta_0\), \(\beta_1\) to minimize sum of squared residuals: \(\sum_i (y_i - (\beta_0 + \beta_1 x_i))^2\).
Note:
Given this criterion, the optimal values of the coefficients \(\beta_0\), \(\beta_1\) can be found by simple minimization:
Solution:
\[ \begin{aligned} \widehat{\beta}_1 &= \frac{ \sum_i (x_i - \overline{x})(y_i - \overline{y}) } { \sum_i (x_i - \overline{x})^2 } \\ \widehat{\beta}_0 &= \overline{y} - \widehat{\beta}_1 \overline{x} \end{aligned} \] Where:
This can also be expressed in terms of variances and covariances:
\[ \widehat{\beta}_1 = \frac{s_{xy}}{s_x^2} \] Where:
These are unbiased estimates of the population variance/covariance:
\[ \widehat{\beta}_1 = \frac{\mathrm{Covar}(x,y)}{\mathrm{Var}(x)} \]
This can be more efficiently written in matrix notation: \(\mathbf{y} = \mathbf{X} \mathbf{\beta} + \varepsilon\)
Where:
This allows extension to multiple variables and to categorical variables, as we will see below.
Notice that \(\mathbf{I}\) means there is no correlation among the residuals. If this is not true, generalized least squares (GLS) must be used.
Solution by minimizing the sum of squares of the residuals:
\[ \begin{aligned} S &= \varepsilon^T \varepsilon \\ S &= (\mathbf{y} - \mathbf{X}\mathbf{\beta})^T (\mathbf{y} - \mathbf{X}\mathbf{\beta}) \end{aligned} \]
Note that \((\mathbf{X}^T \mathbf{X})\) is the matrix equivalent of \(s_x^2\), i.e., the variance of the predictor \(x\).
This expands to: \[ \begin{aligned} S &= \mathbf{y}^T\mathbf{y} - \beta^T \mathbf{X}^T \mathbf{y} - \mathbf{y}^T \mathbf{X}\mathbf{\beta} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \nonumber \\ S &= \mathbf{y}^T\mathbf{y} - 2 \mathbf{\beta}^T \mathbf{X}^T \mathbf{y} + \mathbf{\beta}^T\mathbf{X}^T\mathbf{X} \mathbf{\beta} \end{aligned} \]
Minimize by finding the partial derivative with respect the the unknown coefficients \(\mathbf{\beta}\), setting this equal to \(\mathbf{0}\), and solving:
\[ \begin{aligned} \frac{\partial}{\partial \mathbf{\beta}^T} S &= -2 \mathbf{X}^T \mathbf{y} + 2 \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ \mathbf{0} &= -\mathbf{X}^T \mathbf{y} + \mathbf{X}^T \mathbf{X} \mathbf{\beta} \\ (\mathbf{X}^T \mathbf{X} ) \mathbf{\beta} &= \mathbf{X}^T \mathbf{y} \\ (\mathbf{X}^T \mathbf{X})^{-1} (\mathbf{X}^T \mathbf{X}) \mathbf{\beta} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \\ \mathbf{\widehat{\beta}}_{\mathrm{OLS}} &= (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} \end{aligned} \]
The model matix \(\mathbf{X}\) is computed from the model formula, with the model.matrix
function:
(X <- model.matrix(logZn ~ dist, data=ds))
## (Intercept) dist
## 16 1 0.0000000
## 17 1 0.0122243
## 39 1 0.0484965
## 62 1 0.0054321
## 83 1 0.2009760
## 114 1 0.5757520
## 131 1 0.1683310
## 135 1 0.0812194
## 161 1 0.7716980
## 162 1 0.3368290
## 148 1 0.1030290
## 149 1 0.0703550
## 153 1 0.0921435
## 155 1 0.3749400
## 156 1 0.4238370
## attr(,"assign")
## [1] 0 1
dim(X)
## [1] 15 2
Note the column of \(1\)s to represent the mean, and the column dist
which are the observed normalized distances at each of the selected observations.
Its square, using the t
“transpose” function and “%*%" matrix multiplication operator:
t(X)
## 16 17 39 62 83 114 131
## (Intercept) 1 1.0000000 1.0000000 1.0000000 1.000000 1.000000 1.000000
## dist 0 0.0122243 0.0484965 0.0054321 0.200976 0.575752 0.168331
## 135 161 162 148 149 153
## (Intercept) 1.0000000 1.000000 1.000000 1.000000 1.000000 1.0000000
## dist 0.0812194 0.771698 0.336829 0.103029 0.070355 0.0921435
## 155 156
## (Intercept) 1.00000 1.000000
## dist 0.37494 0.423837
## attr(,"assign")
## [1] 0 1
(XTX <- t(X)%*%X)
## (Intercept) dist
## (Intercept) 15.000000 3.265263
## dist 3.265263 1.462589
dim(XTX)
## [1] 2 2
This is the product-crossproduct matrix of the two predictor variables:
sum(rep(1,n)^2); sum((ds$dist)^2); sum(rep(1,n)*ds$dist)
## [1] 15
## [1] 1.462589
## [1] 3.265263
Its inverse, using the solve
matrix solution function:
(XTXinv <- solve(XTX))
## (Intercept) dist
## (Intercept) 0.1296979 -0.2895533
## dist -0.2895533 1.3301533
dim(XTXinv)
## [1] 2 2
Compute \(\mathbf{X}^T \mathbf{y}\), the matrix equivalent of \(s_{xy}\), i.e., the covariance between two predictors (intercept and normalized distance) and predictand:
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 38.467175
## dist 7.580611
dim(XTy)
## [1] 2 1
And finally the solution:
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.794119
## dist -1.054924
dim(beta)
## [1] 2 1
This is the same result as using the lm
function:
coef(m1 <- lm(logZn ~ dist, data=ds))
## (Intercept) dist
## 2.794119 -1.054924
Recall, in OLS the residuals are assumed to be identically and independently distributed, from a normal distribution.
First see if they appear to be normally-distributed. We use the qqPlot' method from the
car’ package, which shows the quantiles of the normal distribution against the residuals, with confidence intervals.
library(car)
qqPlot(residuals(m1))
Second, see if there is any relation between the residuals and fitted values, and also if the variance of the residuals is the same across the range of fitted values:
plot(residuals(m1) ~ fitted(m1))
grid()
abline(h=0, col="blue")
lines(stats::lowess(residuals(m1) ~ fitted(m1)), col="red")
There are not so many points here, so it’s hard to draw a firm conclusion. The empirical smoother, using lowess
“locally-weighted sum of squares”, helps visualize any relation.
Third, look for observations with high influence on the coefficients; these may be from a different population (process). The leverage is measured by the hat value, which measures the overall influence of a single observation on the predictions.
(sort(h <- hatvalues(m1)))
## 83 131 148 162 153 135
## 0.06703800 0.06990657 0.08415261 0.08554884 0.08763050 0.09143763
## 149 155 39 17 156 62
## 0.09553883 0.09956055 0.10474161 0.12281745 0.12319683 0.12659134
## 16 114 161
## 0.12969785 0.23720903 0.47493237
plot(ds$dist, h, xlab="Normalized distance", ylab="Hat value", col=ds$ffreq, pch=20,
xlim=c(0,1), ylim=c(0,0.8))
text(ds$dist, h, labels=names(residuals(m1)), pos=3)
The highest hat values are at the extremes of the predictor.
The influence of each observation can be measured by its Cook’s distance, which measures how much \(\beta\) would change if the observation were omitted. It is defined as:
\[ \begin{aligned} D_i &= \frac{r^2_i}{\mathrm{tr}(\mathbf{H})} \frac{h_i}{1-h_i} \end{aligned} \] where \(\mathbf{H}\) is the hat matrix, \(h_i\) is its \(i\)th diagonal, and \(r_i\) is the standardized residual, i.e., \((y_i - \widehat{y}_i) / s \sqrt{1 - h_i}\). This is based on \(s^2\), which is the residual variance after the model fit.
sort(d <- cooks.distance(m1))
## 17 155 135 162 149
## 3.166605e-06 7.248844e-03 1.149586e-02 1.765843e-02 1.948093e-02
## 156 83 131 114 153
## 3.277671e-02 3.567119e-02 4.033986e-02 4.828057e-02 5.181515e-02
## 39 62 16 148 161
## 5.241002e-02 6.036023e-02 1.048543e-01 2.667905e-01 3.168586e-01
plot(fitted(m1), d, xlab="Fitted value", ylab="Cook's distance", col=ds$ffreq, pch=20)
text(fitted(m1), d, labels=names(residuals(m1)), pos=4)
The hat matrix is so-named because it `puts the hats on’ the fitted values: from observed \(\mathbf{y}\) to best-fit \(\mathbf{\widehat{y}}\)
It is defined as:
\[ \begin{aligned} \mathbf{H} &= \mathbf{X} {(\mathbf{X}' \mathbf{X})}^{-1} \mathbf{X}' \end{aligned} \]
The hat matrix gives the weights by which each original observation is multiplied when computing \(\mathbf{\widehat{\beta}}\). This means that if a high-leverage observation were changed, the fit would change substantially. In terms of vector spaces, the \(\mathbf{H}\) matrix gives the of the observation vector \(\mathbf{y}\) into the of the design (model) matrix \(\mathbf{X}\).
The overall leverage of each observation \(y_i\) is given by its hat value, which is the sum of the squared entries of the hat matrix associated with the observation. This is also the diagonal of the hat matrix, because of the fact that \(\mathbf{H} = \mathbf{H'} \mathbf{H} = \mathbf{H^2}\):
\[ \begin{aligned} h_{ii} &= \sum_{j=1}^n h_{ij}^2 \end{aligned} \]
There are various ways to set up the model matrix to separate categories. These are viewed and set with the contrasts
function:
contrasts(ds$ffreq)
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
This is the default “treatment contrasts” matrix; where “treatments” comes from experimental design. Here the “treatment” was by nature, not humans. Treatment contrasts are sometimes called “dummy variables”.
The model matrix uses these contrasts:
(X <- model.matrix(logZn ~ ffreq, data=ds))
## (Intercept) ffreq2 ffreq3
## 16 1 0 0
## 17 1 0 0
## 39 1 0 0
## 62 1 0 0
## 83 1 0 0
## 114 1 1 0
## 131 1 1 0
## 135 1 1 0
## 161 1 1 0
## 162 1 1 0
## 148 1 0 1
## 149 1 0 1
## 153 1 0 1
## 155 1 0 1
## 156 1 0 1
## attr(,"assign")
## [1] 0 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
Notice how the first category is included in the intercept – this is the base case, the other two classes are contrasted to it. The matrix has created two variables, beyond the intercept, to represent the two additional classes.
Solution is exactly as before, by OLS:
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3
## (Intercept) 15 5 5
## ffreq2 5 5 0
## ffreq3 5 0 5
dim(XTX)
## [1] 3 3
This is the product-crossproduct matrix among the predictors. Note the relation to the number of observations in each class.
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 38.46718
## ffreq2 11.74224
## ffreq3 12.27748
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.8894919
## ffreq2 -0.5410438
## ffreq3 -0.4339968
And again, these are the coefficents found by lm
:
coef(m2 <- lm(logZn ~ ffreq, data=ds))
## (Intercept) ffreq2 ffreq3
## 2.8894919 -0.5410438 -0.4339968
The first coefficient is the mean of the independent variable for observations in the first class:
beta[1]
## [1] 2.889492
mean(ds$logZn[ds$ffreq==1])
## [1] 2.889492
The others are the difference between this and their class means:
beta[2]
## [1] -0.5410438
mean(ds$logZn[ds$ffreq==2])
## [1] 2.348448
mean(ds$logZn[ds$ffreq==2]) - beta[1]
## [1] -0.5410438
We can make the model without an intercept, by adding a -1
to the model formulation:
(X <- model.matrix(logZn ~ ffreq - 1, data=ds))
## ffreq1 ffreq2 ffreq3
## 16 1 0 0
## 17 1 0 0
## 39 1 0 0
## 62 1 0 0
## 83 1 0 0
## 114 0 1 0
## 131 0 1 0
## 135 0 1 0
## 161 0 1 0
## 162 0 1 0
## 148 0 0 1
## 149 0 0 1
## 153 0 0 1
## 155 0 0 1
## 156 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
Now each observation has a 1
only in its own `dummy’ variable.
(XTX <- t(X)%*%X)
## ffreq1 ffreq2 ffreq3
## ffreq1 5 0 0
## ffreq2 0 5 0
## ffreq3 0 0 5
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## ffreq1 14.44746
## ffreq2 11.74224
## ffreq3 12.27748
(beta <- XTXinv%*%XTy)
## [,1]
## ffreq1 2.889492
## ffreq2 2.348448
## ffreq3 2.455495
coef(m3 <- lm(logZn ~ ffreq - 1, data=ds))
## ffreq1 ffreq2 ffreq3
## 2.889492 2.348448 2.455495
Now the coefficients are the means of each class.
The same formulation works with any combination of predictors. For example, both the flood frequency and distance:
(X <- model.matrix(logZn ~ ffreq + dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist
## 16 1 0 0 0.0000000
## 17 1 0 0 0.0122243
## 39 1 0 0 0.0484965
## 62 1 0 0 0.0054321
## 83 1 0 0 0.2009760
## 114 1 1 0 0.5757520
## 131 1 1 0 0.1683310
## 135 1 1 0 0.0812194
## 161 1 1 0 0.7716980
## 162 1 1 0 0.3368290
## 148 1 0 1 0.1030290
## 149 1 0 1 0.0703550
## 153 1 0 1 0.0921435
## 155 1 0 1 0.3749400
## 156 1 0 1 0.4238370
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3 dist
## (Intercept) 15.000000 5.000000 5.000000 3.265263
## ffreq2 5.000000 5.000000 0.000000 1.933829
## ffreq3 5.000000 0.000000 5.000000 1.064304
## dist 3.265263 1.933829 1.064304 1.462589
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 38.467175
## ffreq2 11.742240
## ffreq3 12.277475
## dist 7.580611
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.9285211
## ffreq2 -0.2975279
## ffreq3 -0.3175242
## dist -0.7305327
coef(m4 <- lm(logZn ~ ffreq + dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist
## 2.9285211 -0.2975279 -0.3175242 -0.7305327
Here the intercept is the value at distance 0 and in flood frequency class 1.
Their interaction:
(X <- model.matrix(logZn ~ ffreq * dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## 16 1 0 0 0.0000000 0.0000000 0.0000000
## 17 1 0 0 0.0122243 0.0000000 0.0000000
## 39 1 0 0 0.0484965 0.0000000 0.0000000
## 62 1 0 0 0.0054321 0.0000000 0.0000000
## 83 1 0 0 0.2009760 0.0000000 0.0000000
## 114 1 1 0 0.5757520 0.5757520 0.0000000
## 131 1 1 0 0.1683310 0.1683310 0.0000000
## 135 1 1 0 0.0812194 0.0812194 0.0000000
## 161 1 1 0 0.7716980 0.7716980 0.0000000
## 162 1 1 0 0.3368290 0.3368290 0.0000000
## 148 1 0 1 0.1030290 0.0000000 0.1030290
## 149 1 0 1 0.0703550 0.0000000 0.0703550
## 153 1 0 1 0.0921435 0.0000000 0.0921435
## 155 1 0 1 0.3749400 0.0000000 0.3749400
## 156 1 0 1 0.4238370 0.0000000 0.4238370
## attr(,"assign")
## [1] 0 1 1 2 3 3
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## (Intercept) 15.000000 5.000000 5.000000 3.265263 1.933829 1.064304
## ffreq2 5.000000 5.000000 0.000000 1.933829 1.933829 0.000000
## ffreq3 5.000000 0.000000 5.000000 1.064304 0.000000 1.064304
## dist 3.265263 1.933829 1.064304 1.462589 1.075394 0.344273
## ffreq2:dist 1.933829 1.933829 0.000000 1.075394 1.075394 0.000000
## ffreq3:dist 1.064304 0.000000 1.064304 0.344273 0.000000 0.344273
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## (Intercept) 38.467175
## ffreq2 11.742240
## ffreq3 12.277475
## dist 7.580611
## ffreq2:dist 4.337369
## ffreq3:dist 2.494204
(beta <- XTXinv%*%XTy)
## [,1]
## (Intercept) 2.9320609
## ffreq2 -0.3425105
## ffreq3 -0.2610534
## dist -0.7967881
## ffreq2:dist 0.1734077
## ffreq3:dist -0.2156685
coef(m5 <- lm(logZn ~ ffreq * dist, data=ds))
## (Intercept) ffreq2 ffreq3 dist ffreq2:dist ffreq3:dist
## 2.9320609 -0.3425105 -0.2610534 -0.7967881 0.1734077 -0.2156685
A nested model without intercept:
(X <- model.matrix(logZn ~ ffreq/dist - 1, data=ds))
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 16 1 0 0 0.0000000 0.0000000 0.0000000
## 17 1 0 0 0.0122243 0.0000000 0.0000000
## 39 1 0 0 0.0484965 0.0000000 0.0000000
## 62 1 0 0 0.0054321 0.0000000 0.0000000
## 83 1 0 0 0.2009760 0.0000000 0.0000000
## 114 0 1 0 0.0000000 0.5757520 0.0000000
## 131 0 1 0 0.0000000 0.1683310 0.0000000
## 135 0 1 0 0.0000000 0.0812194 0.0000000
## 161 0 1 0 0.0000000 0.7716980 0.0000000
## 162 0 1 0 0.0000000 0.3368290 0.0000000
## 148 0 0 1 0.0000000 0.0000000 0.1030290
## 149 0 0 1 0.0000000 0.0000000 0.0703550
## 153 0 0 1 0.0000000 0.0000000 0.0921435
## 155 0 0 1 0.0000000 0.0000000 0.3749400
## 156 0 0 1 0.0000000 0.0000000 0.4238370
## attr(,"assign")
## [1] 1 1 1 2 2 2
## attr(,"contrasts")
## attr(,"contrasts")$ffreq
## [1] "contr.treatment"
(XTX <- t(X)%*%X)
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist
## ffreq1 5.0000000 0.000000 0.000000 0.2671289 0.000000
## ffreq2 0.0000000 5.000000 0.000000 0.0000000 1.933829
## ffreq3 0.0000000 0.000000 5.000000 0.0000000 0.000000
## ffreq1:dist 0.2671289 0.000000 0.000000 0.0429222 0.000000
## ffreq2:dist 0.0000000 1.933829 0.000000 0.0000000 1.075394
## ffreq3:dist 0.0000000 0.000000 1.064304 0.0000000 0.000000
## ffreq3:dist
## ffreq1 0.000000
## ffreq2 0.000000
## ffreq3 1.064304
## ffreq1:dist 0.000000
## ffreq2:dist 0.000000
## ffreq3:dist 0.344273
XTXinv <- solve(XTX)
y <- ds$logZn
(XTy <- t(X)%*%y)
## [,1]
## ffreq1 14.4474593
## ffreq2 11.7422405
## ffreq3 12.2774754
## ffreq1:dist 0.7490383
## ffreq2:dist 4.3373692
## ffreq3:dist 2.4942038
(beta <- XTXinv%*%XTy)
## [,1]
## ffreq1 2.9320609
## ffreq2 2.5895504
## ffreq3 2.6710075
## ffreq1:dist -0.7967881
## ffreq2:dist -0.6233804
## ffreq3:dist -1.0124565
coef(m6 <- lm(logZn ~ ffreq/dist - 1, data=ds))
## ffreq1 ffreq2 ffreq3 ffreq1:dist ffreq2:dist ffreq3:dist
## 2.9320609 2.5895504 2.6710075 -0.7967881 -0.6233804 -1.0124565
This gives directly the means of each class at distance 0, and a coefficient showing how much the metal concentration decreases with each normalized distance away from the river.
A robust inference is one that is not greatly disturbed by:
For linear modelling, the underlying model form should still be linear, but there may be unusual observations that obscure the general relation.
An example of the first case is a contaminated dataset, where some observations result from a different process then that which produced the others. The issue here is that the observations do not all come from the same population, i.e., result of a single process.
Many of the problems with parametric regression can be avoided by fitting a so-called robust linear regression, which attempts to fit the majority of the observations and ignore the unusual cases. Since we do not know in advance if there is contamination or problems with the model assumptions, there is no way to select an optimum method. See the discussion in J Fox and S Weisberg. Robust regression in R: An appendix to “An R companion to applied regression, second edition”. 15-Dec-2010. URL http://socserv.mcmaster.ca/jfox/Books/Companion/appendix/Appendix-Robust-Regression.pdf
One method is called least trimmed squares (LTS). It estimates the slope vector \(\mathbf{\beta}\), by the criterion of minimizing
\[ \sum_{i=1}^q \big| \;y_i - \mathbf{X}_i \mathbf{\beta} \; \big|_{(i)}^2 \]
That is, the squared absolute deviations over some subset of the residuals, indicated by the subscript \((i)\). The smallest \(q\) residuals are chosen as some proportion, by default: \(q = \lfloor n/2 \rfloor + \lfloor (p + 1)/2 \rfloor\), where \(p\) is the number of predictors and \(n\) the number of observations. This is about half the observations.
To determine the observations to use, sort the squared residuals from smallest \((1)\) to largest \((n)\):
\[ (e^2)_{(1)}, (e^2)_{(2)}, \ldots (e^2)_{(n)} \] and select the observations with the smallest \(q\) of these to re-fit the model.
p <- 2 # number of predictors
(q <- floor((n/2) + floor((p+1)/2))) # number of `good' observations
## [1] 8
In our example, 8 ‘good’ observations will be used; the other 7 observations with the largest residuals will be discarded. This is the ‘trimming’.
However, it is not so simple, because after re-fitting on the selected observations the residuals from all the observations must be re-computed with the new coefficients, and these sorted residuals may not produce the same \(q\) smallest residuals. So the procedure is iterative until there is no change in the order of sorted residuals.
This is solved by the lqs
function found in the MASS
“Modern Applied Statistics with S” package, explained in the textbook: Venables, W., & Ripley, B. (2002). Modern Applied Statistics with S. Fourth Edition. Springer. This function has several methods, of which least trimmed squares is the default and most advised.
We show how to compute one iteration of the robust regression line, starting from the OLS line:
Find the 7 largest absolute residuals:
m1 <- lm(logZn ~ dist, data=ds)
(r0.s <- sort(r0.a <- abs(residuals(m1))))
## 17 155 135 114 149 162
## 0.001249728 0.068171646 0.090390217 0.096557197 0.114594383 0.116570331
## 161 156 62 39 83 153
## 0.120334640 0.126894715 0.169219136 0.177686517 0.190950485 0.196847543
## 131 16 148
## 0.198240909 0.219561094 0.457544139
(hi.res <- r0.s[(q+1):n])
## 62 39 83 153 131 16 148
## 0.1692191 0.1776865 0.1909505 0.1968475 0.1982409 0.2195611 0.4575441
(hi.res.ix <- which(abs(residuals(m1)) %in% hi.res))
## [1] 1 3 4 5 7 11 13
ds.trim <- ds[-hi.res.ix,]
dim(ds); dim(ds.trim)
## [1] 15 3
## [1] 8 3
m1.r <- lm(logZn ~ dist, data=ds.trim)
Compare the regression coefficients of the OLS and robust models, and the percent change:
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept) dist
## 2.6813878 -0.8269679
## (Intercept) dist
## 2.794119 -1.054924
## (Intercept) dist
## -4.0 -21.6
Now re-compute the residuals using the predict
method, the first-iteration robust model, and the full dataset:
(r1.s <- sort(r1.a <- abs(ds$logZn - predict(m1.r, newdata=ds))))
## 135 149 155 161 114 162
## 0.00382612 0.01790144 0.04091070 0.05715218 0.07804182 0.08062176
## 156 17 131 83 39 62
## 0.11078014 0.11119392 0.12388219 0.25786758 0.27936224 0.28071165
## 153 16 148
## 0.28857367 0.33229189 0.36829943
r0.s
## 17 155 135 114 149 162
## 0.001249728 0.068171646 0.090390217 0.096557197 0.114594383 0.116570331
## 161 156 62 39 83 153
## 0.120334640 0.126894715 0.169219136 0.177686517 0.190950485 0.196847543
## 131 16 148
## 0.198240909 0.219561094 0.457544139
In general the residuals after the re-fit will not be the same, nor in the same sorted order, as from the OLS fit. Check if the same observations are discarded:
hi.res.1 <- r1.s[(q+1):n]
(hi.res.ix.1 <- which(r1.a %in% hi.res.1))
## [1] 1 3 4 5 7 11 13
hi.res.ix
## [1] 1 3 4 5 7 11 13
intersect(hi.res.ix, hi.res.ix.1)
## [1] 1 3 4 5 7 11 13
In this case 7 of the 7 largest absolute residuals are from the same observations in both the OLS regression and the first iteration of the LTS regression.
If these sets are the same, this is the robust regression. If not, the process is repeated with the new set of ‘good’ observations, until the sets are the same.
The robust line can be computed directly by the lqs
method, with the method
optional argument set to lts
for least-trimmed squares. This performs the iteration as needed.
require(MASS)
## Loading required package: MASS
m1.r <- lqs(logZn ~ dist, data=ds, method = "lts")
coef(m1.r); coef(m1); round(100*(coef(m1.r) - coef(m1))/coef(m1),1)
## (Intercept) dist
## 3.025942 -1.872351
## (Intercept) dist
## 2.794119 -1.054924
## (Intercept) dist
## 8.3 77.5
Depending on the sample, the regression coefficients can change quite a lot.
Visualize the two regression lines:
plot(logZn ~ dist, data=ds, col=ffreq, pch=20, xlab="Normalized distance to river", ylab="log-ppm Zn")
grid()
abline(m1)
abline(m1.r, lty=2)
legend("topright", c("OLS", "robust"), lty=1:2)