Linear models may not meet the assumptions of Ordinary Least Squares (OLS), which are:
can be dealt with by Generalized Linear Squares (GLS), if the nature of the dependence is known (e.g., time or space dependence, repeated measurements on the same object).
usually indicates a non-linear relation; (3) and (4) may be due to that, but also to unusual values, and especially skewed predictors or response variables.
For example, in the Meuse soil pollution dataset, model one of the metal concentrations in soil from the elevation above sea level and the distance from the Maas river:
library(sp); library(gstat)
data(meuse) # an example dataset supplied with the sp package
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
summary(meuse$zinc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 113.0 198.0 326.0 469.7 674.5 1839.0
summary(m.ols <- lm(zinc ~ elev + dist, data=meuse))
##
## Call:
## lm(formula = zinc ~ elev + dist, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -422.11 -160.34 -54.94 129.47 1138.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1677.98 177.66 9.445 < 2e-16 ***
## elev -123.10 23.33 -5.277 4.45e-07 ***
## dist -846.24 124.92 -6.775 2.57e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 259.9 on 152 degrees of freedom
## Multiple R-squared: 0.5053, Adjusted R-squared: 0.4988
## F-statistic: 77.64 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols)
par(mfrow=c(1,1))
We see all of the above-mentioned problems.
Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model, i.e. our problem (3) above. Applying this transform often reduces non-linearity as well, i.e., our problem (1), and heteroscedascity, i.e., our problem (4).
The linear model is: \[ Y_i=\beta_0+\mathbf{\beta X}+\varepsilon_i, \;\;\varepsilon_i\sim\mathcal{N}(0,\sigma^2) \]
The idea is to transform the response variable \(Y\) to a replacement response variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that the regression residuals become normally-distributed. Note that the regression coefficients will also change, because the response variable has changed; therefore the regression coefficients must be interpreted with respect to the transformed variable. Also, any predictions made with the model have to be back-transformed, to be interpreted in the original units.
The standard (simple) Box-Cox transform is:
\[ Y_i^{(\lambda)}=\begin{cases} \dfrac{Y_i^\lambda-1}{\lambda}&(\lambda\neq0)\\ \log{(Y_i)} & (\lambda=0) \end{cases} \]
The inverse transform is:
\[ Y_i=\begin{cases} \exp \left( \dfrac{\log(1 + \lambda Y_i^{(\lambda)})}{\lambda} \right) & (\lambda\neq0)\\ \exp( {Y_i^\lambda)} & (\lambda=0) \end{cases} \]
Obviously, this depends on a parameter \(\lambda\). Some familiar transformations are associated with certain values of \(\lambda\):
This is solved by maximum likelihood, i.e., which value of \(\lambda\) is most likely to have given rise to the observed distribution of the model residuals. 1
The log-likelihood of the linear model, assuming independent observations (i.e., solution by OLS), is:
\[ \begin{aligned} \log\mathcal{L} = &-\frac{n}{2}\log(2\pi)-n\log\sigma\\&-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta X}_i)\right]^2\\&+(\lambda-1)\sum_{i=1}^n\log Y_i \end{aligned} \]
This can also be written:
\[ \mathcal{L} \propto -\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta_1 X}_i)}{\dot{Y}^\lambda}\right)^2\right]" \]
(the \(\propto\) is because some constants have been removed) where \(\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]\), i.e., the exponential of the mean of the logarithms of the variable.
Fortunately, the function boxcox
has been implemented in the MASS
package; this computes the likelihood at user-defined or default (\(\lambda = [-2 \ldots 2]\) by \(0.1\)); we can then find the optimum, i.e., maximum log-likelihood.
For the Meuse linear model example:
library(MASS)
bc <- boxcox(m.ols)
str(bc) # numbers that make up the graph
## List of 2
## $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
## $ y: num [1:100] -346 -343 -339 -335 -331 ...
(bc.power <- bc$x[which.max(bc$y)])
## [1] -0.3838384
The optimal transformation is -0.384, i.e., towards an inverse power. Note that the confidence interval does not include 1 (no transformation), so a transformation is indicated. A logarithmic transformatic (\(\lambda = 0\)) would not be optimal.
Now we need to transform the response variable accordingly, and re-compute the linear model, with this transformed variable.
We can write a function corresponding to the formula:
BCTransform <- function(y, lambda=0) {
if (lambda == 0L) { log(y) }
else { (y^lambda - 1) / lambda }
}
And of course we need a reverse transformation:
BCTransformInverse <- function(yt, lambda=0) {
if (lambda == 0L) { exp(yt) }
else { exp(log(1 + lambda * yt)/lambda) }
}
# testing
yt <- BCTransform(meuse$zinc, 0)
yo <- BCTransformInverse(yt, 0)
unique(round(yo-meuse$zinc),8)
## [1] 0
yt <- BCTransform(meuse$zinc, .5)
yo <- BCTransformInverse(yt, .5)
unique(round(yo-meuse$zinc),8)
## [1] 0
Use this function with the optimal value of \(\lambda\) to transform the response variable:
hist(meuse$zinc, breaks=12); rug(meuse$zinc)
meuse$zinc.bc <- BCTransform(meuse$zinc, bc.power)
hist(meuse$zinc.bc, breaks=12); rug(meuse$zinc.bc)
The skew has been removed. The target is obviously bi-modal.
Compare to the log-transform:
hist(log(meuse$zinc), breaks=12); rug(log(meuse$zinc))
But the test of the transformation is the distrbution of the linear model residuals, not the target variable.
Re-run the linear model with transformation:
summary(m.ols.bc <- lm(zinc.bc ~ elev + dist, data=meuse))
##
## Call:
## lm(formula = zinc.bc ~ elev + dist, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.111369 -0.028361 0.004142 0.027047 0.109199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.586950 0.029348 88.148 < 2e-16 ***
## elev -0.026144 0.003854 -6.784 2.44e-10 ***
## dist -0.210349 0.020635 -10.194 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04293 on 152 degrees of freedom
## Multiple R-squared: 0.6714, Adjusted R-squared: 0.667
## F-statistic: 155.3 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc)
par(mfrow=c(1,1))
The residuals are close to normally-distributed; the residuals are less related to the fits, and the apparent heteroscedascity is mostly due to the uneven numbers of observations at different fitted values.
Compare this to the linear model with a log-transformation:
summary(m.ols.log <- lm(log(zinc) ~ elev + dist, data=meuse))
##
## Call:
## lm(formula = log(zinc) ~ elev + dist, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03595 -0.27154 -0.00249 0.25329 1.10223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.48453 0.29312 28.945 < 2e-16 ***
## elev -0.26065 0.03849 -6.772 2.6e-10 ***
## dist -1.96003 0.20610 -9.510 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4288 on 152 degrees of freedom
## Multiple R-squared: 0.6518, Adjusted R-squared: 0.6472
## F-statistic: 142.3 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.log)
par(mfrow=c(1,1))
The powerTransform
of the car
package also uses the maximum likelihood-like approach outlined by Box and Cox (1964) to select a transformatiion of a univariate or multivariate response for normality, linearity and/or constant variance. The summary method automatically computes two or three likelihood ratio tests concerning the transformation powers.
We use this on the Meuse Zn linear model:
require(car)
## Loading required package: car
## Loading required package: carData
(bc.car <- powerTransform(m.ols))
## Estimated transformation parameter
## Y1
## -0.369844
str(bc.car, max.level=1)
## List of 15
## $ value : num 773
## $ counts : Named int [1:2] 3 3
## ..- attr(*, "names")= chr [1:2] "function" "gradient"
## $ convergence: int 0
## $ message : chr "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
## $ hessian : num [1, 1] 93.2
## $ start : num -0.37
## $ lambda : Named num -0.37
## ..- attr(*, "names")= chr "Y1"
## $ roundlam : Named num -0.5
## ..- attr(*, "names")= chr "Y1"
## $ invHess : num [1, 1] 0.0107
## $ llik : num 773
## $ family : chr "bcPower"
## $ xqr :List of 4
## ..- attr(*, "class")= chr "qr"
## $ y : num [1:155, 1] 1022 1141 640 257 269 ...
## ..- attr(*, "dimnames")=List of 2
## $ x : num [1:155, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## ..- attr(*, "assign")= int [1:3] 0 1 2
## $ weights : num [1:155] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "powerTransform"
print(bc.car$lambda)
## Y1
## -0.369844
summary(bc.car)
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 -0.3698 -0.5 -0.5729 -0.1668
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 13.09709 1 0.00029576
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 174.6859 1 < 2.22e-16
We see that this estimate from car
-0.37 is not quite the same estimate as from the function in MASS
, i.e., -0.384.
The LR-tests here show that a log-transform would not be optimal, and that a transformation is surely needed.
The car
package also has a function bcPower
to carry out the transformation:
meuse$zinc.bc.car <- bcPower(meuse$zinc, lambda=bc.car$lambda)
hist(meuse$zinc.bc.car, breaks=12); rug(meuse$zinc.bc.car)
Re-run the linear model with this transformation:
summary(m.ols.bc.car <- lm(zinc.bc.car ~ elev + dist, data=meuse))
##
## Call:
## lm(formula = zinc.bc.car ~ elev + dist, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.120725 -0.030940 0.004368 0.029003 0.118037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.673271 0.031865 83.89 < 2e-16 ***
## elev -0.028412 0.004184 -6.79 2.36e-10 ***
## dist -0.227984 0.022405 -10.18 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04661 on 152 degrees of freedom
## Multiple R-squared: 0.671, Adjusted R-squared: 0.6667
## F-statistic: 155 on 2 and 152 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc.car)
par(mfrow=c(1,1))
This is also quite good.
Note that this model is conditioned on the data in two ways: (1) the usual estimate from noisy data points, and (2) the value of \(\lambda\) selected for the transformation. Condition (2) is not the case for user-selected, a priori, transforms.
In this case we want the predictor (to be transformed) to be represented as the response \(Y\); this is accomplished with the null model, i.e., a variable predicted only by its mean: \(Y_i = \beta_0 + \varepsilon_i\). The distribution of the residuals is the same as that of the variable, just centred at its mean.
For example, supposing that Zn would be used as a predictor in a model, and we would like to transform it to near-normality:
m.null <- lm(zinc ~ 1, data=meuse)
summary(m.null)
##
## Call:
## lm(formula = zinc ~ 1, data = meuse)
##
## Residuals:
## Min 1Q Median 3Q Max
## -356.7 -271.7 -143.7 204.8 1369.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 469.72 29.48 15.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 367.1 on 154 degrees of freedom
plot(m.null, which=2)
qqnorm(meuse$zinc); qqline(meuse$zinc)
hist(meuse$zinc, breaks=12); rug(meuse$zinc)
We see that the model residuals follow the same distribution as the original data values. There is serious skew: far too many small values, far too few large, to be normally-distributed.
Now compute the transformation parameter:
bc.null <- boxcox(m.null)
(bc.power.null <- bc.null$x[which.max(bc.null$y)])
## [1] -0.2626263
summary(bc.power.null.car <- powerTransform(m.null))
## bcPower Transformation to Normality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 -0.2728 -0.5 -0.5168 -0.0287
##
## Likelihood ratio test that transformation parameter is equal to 0
## (log transformation)
## LRT df pval
## LR test, lambda = (0) 4.906319 1 0.026759
##
## Likelihood ratio test that no transformation is needed
## LRT df pval
## LR test, lambda = (1) 112.0677 1 < 2.22e-16
The optimal transformation from MASS
is -0.263 and from car
is -0.273. As in the linear model residuals for the predictive model of Zn, these are negative, but closer to a log transformation.
Transform the variable and re-compute the distribution:
meuse$zinc.bc.null <- BCTransform(meuse$zinc, bc.power.null)
qqnorm(meuse$zinc.bc.null); qqline(meuse$zinc.bc.null)
hist(meuse$zinc.bc.null, breaks=12); rug(meuse$zinc.bc.null)
This is somewhat better – at least it is symmetric, although the tails have become too “light”. This is because this variable has a bimodal distribution, which is difficult to see in the original histogram.
The Box-Cox transform should not be applied if other transformations are indicated by theory. An example from Box and Cox (1964) is a chemical reaction, where physical theory suggests that the reaction rate \(r\) should be modelled as \(r \propto c^\alpha \cdot \beta \cdot exp(t)\), where \(c\) is a concentration of the reactants and \(t\) is the temperature. This could be linearized as \(r = \alpha \cdot \log c + \beta \cdot t\) and then solved for \(\alpha\) and \(\beta\). This transformation is indicated by theory, not empirically as with Box-Cox.
Also, if the suggested Box-Cox power is close to 0, many analysts use the log-transform even though it is not optimal, for ease of interpretation.
Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.