1 Motivation

Linear models may not meet the assumptions of Ordinary Least Squares (OLS), which are:

  1. Independent residuals
  2. No relation between residuals and fitted values
  3. Normally-distributed residuals
  4. Homoscedastic (equal-variance) residuals, at each fitted value
  5. No high-leverage points with large residuals
  6. Linearity in all predictors, independently (perhaps also with interaction)
  1. can be dealt with by Generalized Linear Squares (GLS), if the nature of the dependence is known (e.g., time or space dependence, repeated measurements on the same object).

  2. usually indicates a non-linear relation; (3) and (4) may be due to that, but also to unusual values, and especially skewed predictors or response variables.

For example, in the Meuse soil pollution dataset, model one of the metal concentrations in soil from the elevation above sea level and the distance from the Maas river:

library(sp); library(gstat)
data(meuse)  # an example dataset supplied with the sp package
names(meuse)
##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m"
summary(meuse$zinc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   113.0   198.0   326.0   469.7   674.5  1839.0
summary(m.ols <- lm(zinc ~ elev + dist, data=meuse))
## 
## Call:
## lm(formula = zinc ~ elev + dist, data = meuse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -422.11 -160.34  -54.94  129.47 1138.10 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1677.98     177.66   9.445  < 2e-16 ***
## elev         -123.10      23.33  -5.277 4.45e-07 ***
## dist         -846.24     124.92  -6.775 2.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 259.9 on 152 degrees of freedom
## Multiple R-squared:  0.5053, Adjusted R-squared:  0.4988 
## F-statistic: 77.64 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols)

par(mfrow=c(1,1))

We see all of the above-mentioned problems.

2 The Box-Cox transform

Box and Cox (1964) developed a family of transformations designed to reduce nonnormality of the errors in a linear model, i.e. our problem (3) above. Applying this transform often reduces non-linearity as well, i.e., our problem (1), and heteroscedascity, i.e., our problem (4).

The linear model is: \[ Y_i=\beta_0+\mathbf{\beta X}+\varepsilon_i, \;\;\varepsilon_i\sim\mathcal{N}(0,\sigma^2) \]

The idea is to transform the response variable \(Y\) to a replacement response variable \(Y_i^{(\lambda)}\), leaving the right-hand side of the regression model unchanged, so that the regression residuals become normally-distributed. Note that the regression coefficients will also change, because the response variable has changed; therefore the regression coefficients must be interpreted with respect to the transformed variable. Also, any predictions made with the model have to be back-transformed, to be interpreted in the original units.

The standard (simple) Box-Cox transform is:

\[ Y_i^{(\lambda)}=\begin{cases} \dfrac{Y_i^\lambda-1}{\lambda}&(\lambda\neq0)\\ \log{(Y_i)} & (\lambda=0) \end{cases} \]

The inverse transform is:

\[ Y_i=\begin{cases} \exp \left( \dfrac{\log(1 + \lambda Y_i^{(\lambda)})}{\lambda} \right) & (\lambda\neq0)\\ \exp( {Y_i^\lambda)} & (\lambda=0) \end{cases} \]

Obviously, this depends on a parameter \(\lambda\). Some familiar transformations are associated with certain values of \(\lambda\):

  • \(\lambda\) = 1.00: no transformation needed; produces results identical to original data
  • \(\lambda\) = 0.50: square root transformation
  • \(\lambda\) = 0.33: cube root transformation
  • \(\lambda\) = 0.25: fourth root transformation
  • \(\lambda\) = 0.00: natural log transformation
  • \(\lambda\) = -0.50: reciprocal square root transformation
  • \(\lambda\) = -1.00: reciprocal (inverse) transformation=

This is solved by maximum likelihood, i.e., which value of \(\lambda\) is most likely to have given rise to the observed distribution of the model residuals. 1

The log-likelihood of the linear model, assuming independent observations (i.e., solution by OLS), is:

\[ \begin{aligned} \log\mathcal{L} = &-\frac{n}{2}\log(2\pi)-n\log\sigma\\&-\frac{1}{2\sigma^2}\sum_{i=1}^n\left[Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta X}_i)\right]^2\\&+(\lambda-1)\sum_{i=1}^n\log Y_i \end{aligned} \]

This can also be written:

\[ \mathcal{L} \propto -\frac{n}{2}\log\left[\sum_{i=1}^n\left(\frac{Y_i^{(\lambda)}-(\beta_0+\mathbf{\beta_1 X}_i)}{\dot{Y}^\lambda}\right)^2\right]" \]

(the \(\propto\) is because some constants have been removed) where \(\dot{Y}=\exp\left[\frac{1}{n}\sum_{i=1}^n \log Y_i\right]\), i.e., the exponential of the mean of the logarithms of the variable.

Fortunately, the function boxcox has been implemented in the MASS package; this computes the likelihood at user-defined or default (\(\lambda = [-2 \ldots 2]\) by \(0.1\)); we can then find the optimum, i.e., maximum log-likelihood.

For the Meuse linear model example:

library(MASS)
bc <- boxcox(m.ols)

str(bc) # numbers that make up the graph
## List of 2
##  $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
##  $ y: num [1:100] -346 -343 -339 -335 -331 ...
(bc.power <- bc$x[which.max(bc$y)])
## [1] -0.3838384

The optimal transformation is -0.384, i.e., towards an inverse power. Note that the confidence interval does not include 1 (no transformation), so a transformation is indicated. A logarithmic transformatic (\(\lambda = 0\)) would not be optimal.

3 Transformation and back-transformation

Now we need to transform the response variable accordingly, and re-compute the linear model, with this transformed variable.

We can write a function corresponding to the formula:

BCTransform <- function(y, lambda=0) {
    if (lambda == 0L) { log(y) }
    else { (y^lambda - 1) / lambda }
    }

And of course we need a reverse transformation:

BCTransformInverse <- function(yt, lambda=0) {
    if (lambda == 0L) { exp(yt) }
    else { exp(log(1 + lambda * yt)/lambda) }
    }
# testing
yt <- BCTransform(meuse$zinc, 0)
yo <- BCTransformInverse(yt, 0)
unique(round(yo-meuse$zinc),8)
## [1] 0
yt <- BCTransform(meuse$zinc, .5)
yo <- BCTransformInverse(yt, .5)
unique(round(yo-meuse$zinc),8)
## [1] 0

4 Applying the transformation

Use this function with the optimal value of \(\lambda\) to transform the response variable:

hist(meuse$zinc, breaks=12); rug(meuse$zinc)

meuse$zinc.bc <- BCTransform(meuse$zinc, bc.power)
hist(meuse$zinc.bc, breaks=12); rug(meuse$zinc.bc)

The skew has been removed. The target is obviously bi-modal.

Compare to the log-transform:

hist(log(meuse$zinc), breaks=12); rug(log(meuse$zinc))

But the test of the transformation is the distrbution of the linear model residuals, not the target variable.

Re-run the linear model with transformation:

summary(m.ols.bc <- lm(zinc.bc ~ elev + dist, data=meuse))
## 
## Call:
## lm(formula = zinc.bc ~ elev + dist, data = meuse)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.111369 -0.028361  0.004142  0.027047  0.109199 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.586950   0.029348  88.148  < 2e-16 ***
## elev        -0.026144   0.003854  -6.784 2.44e-10 ***
## dist        -0.210349   0.020635 -10.194  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04293 on 152 degrees of freedom
## Multiple R-squared:  0.6714, Adjusted R-squared:  0.667 
## F-statistic: 155.3 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc)

par(mfrow=c(1,1))

The residuals are close to normally-distributed; the residuals are less related to the fits, and the apparent heteroscedascity is mostly due to the uneven numbers of observations at different fitted values.

Compare this to the linear model with a log-transformation:

summary(m.ols.log <- lm(log(zinc) ~ elev + dist, data=meuse))
## 
## Call:
## lm(formula = log(zinc) ~ elev + dist, data = meuse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03595 -0.27154 -0.00249  0.25329  1.10223 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.48453    0.29312  28.945  < 2e-16 ***
## elev        -0.26065    0.03849  -6.772  2.6e-10 ***
## dist        -1.96003    0.20610  -9.510  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4288 on 152 degrees of freedom
## Multiple R-squared:  0.6518, Adjusted R-squared:  0.6472 
## F-statistic: 142.3 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.log)

par(mfrow=c(1,1))

5 Another transformation function

The powerTransform of the car package also uses the maximum likelihood-like approach outlined by Box and Cox (1964) to select a transformatiion of a univariate or multivariate response for normality, linearity and/or constant variance. The summary method automatically computes two or three likelihood ratio tests concerning the transformation powers.

We use this on the Meuse Zn linear model:

require(car)
## Loading required package: car
## Loading required package: carData
(bc.car <- powerTransform(m.ols))
## Estimated transformation parameter 
##        Y1 
## -0.369844
str(bc.car, max.level=1)
## List of 15
##  $ value      : num 773
##  $ counts     : Named int [1:2] 3 3
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  $ convergence: int 0
##  $ message    : chr "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##  $ hessian    : num [1, 1] 93.2
##  $ start      : num -0.37
##  $ lambda     : Named num -0.37
##   ..- attr(*, "names")= chr "Y1"
##  $ roundlam   : Named num -0.5
##   ..- attr(*, "names")= chr "Y1"
##  $ invHess    : num [1, 1] 0.0107
##  $ llik       : num 773
##  $ family     : chr "bcPower"
##  $ xqr        :List of 4
##   ..- attr(*, "class")= chr "qr"
##  $ y          : num [1:155, 1] 1022 1141 640 257 269 ...
##   ..- attr(*, "dimnames")=List of 2
##  $ x          : num [1:155, 1:3] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   ..- attr(*, "assign")= int [1:3] 0 1 2
##  $ weights    : num [1:155] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "powerTransform"
print(bc.car$lambda)
##        Y1 
## -0.369844
summary(bc.car)
## bcPower Transformation to Normality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1   -0.3698        -0.5      -0.5729      -0.1668
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df       pval
## LR test, lambda = (0) 13.09709  1 0.00029576
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 174.6859  1 < 2.22e-16

We see that this estimate from car -0.37 is not quite the same estimate as from the function in MASS, i.e., -0.384.

The LR-tests here show that a log-transform would not be optimal, and that a transformation is surely needed.

The car package also has a function bcPower to carry out the transformation:

meuse$zinc.bc.car <- bcPower(meuse$zinc, lambda=bc.car$lambda)
hist(meuse$zinc.bc.car, breaks=12); rug(meuse$zinc.bc.car)

Re-run the linear model with this transformation:

summary(m.ols.bc.car <- lm(zinc.bc.car ~ elev + dist, data=meuse))
## 
## Call:
## lm(formula = zinc.bc.car ~ elev + dist, data = meuse)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.120725 -0.030940  0.004368  0.029003  0.118037 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.673271   0.031865   83.89  < 2e-16 ***
## elev        -0.028412   0.004184   -6.79 2.36e-10 ***
## dist        -0.227984   0.022405  -10.18  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04661 on 152 degrees of freedom
## Multiple R-squared:  0.671,  Adjusted R-squared:  0.6667 
## F-statistic:   155 on 2 and 152 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m.ols.bc.car)

par(mfrow=c(1,1))

This is also quite good.

6 Analyzing the linear model of the transformed variable

Note that this model is conditioned on the data in two ways: (1) the usual estimate from noisy data points, and (2) the value of \(\lambda\) selected for the transformation. Condition (2) is not the case for user-selected, a priori, transforms.

7 Transforming a predictor

In this case we want the predictor (to be transformed) to be represented as the response \(Y\); this is accomplished with the null model, i.e., a variable predicted only by its mean: \(Y_i = \beta_0 + \varepsilon_i\). The distribution of the residuals is the same as that of the variable, just centred at its mean.

For example, supposing that Zn would be used as a predictor in a model, and we would like to transform it to near-normality:

m.null <- lm(zinc ~ 1, data=meuse)
summary(m.null)
## 
## Call:
## lm(formula = zinc ~ 1, data = meuse)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -356.7 -271.7 -143.7  204.8 1369.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   469.72      29.48   15.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 367.1 on 154 degrees of freedom
plot(m.null, which=2)

qqnorm(meuse$zinc); qqline(meuse$zinc)

hist(meuse$zinc, breaks=12); rug(meuse$zinc)

We see that the model residuals follow the same distribution as the original data values. There is serious skew: far too many small values, far too few large, to be normally-distributed.

Now compute the transformation parameter:

bc.null <- boxcox(m.null)

(bc.power.null <- bc.null$x[which.max(bc.null$y)])
## [1] -0.2626263
summary(bc.power.null.car <- powerTransform(m.null))
## bcPower Transformation to Normality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1   -0.2728        -0.5      -0.5168      -0.0287
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                            LRT df     pval
## LR test, lambda = (0) 4.906319  1 0.026759
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 112.0677  1 < 2.22e-16

The optimal transformation from MASS is -0.263 and from car is -0.273. As in the linear model residuals for the predictive model of Zn, these are negative, but closer to a log transformation.

Transform the variable and re-compute the distribution:

meuse$zinc.bc.null <- BCTransform(meuse$zinc, bc.power.null)
qqnorm(meuse$zinc.bc.null); qqline(meuse$zinc.bc.null)

hist(meuse$zinc.bc.null, breaks=12); rug(meuse$zinc.bc.null)

This is somewhat better – at least it is symmetric, although the tails have become too “light”. This is because this variable has a bimodal distribution, which is difficult to see in the original histogram.

8 Caution

The Box-Cox transform should not be applied if other transformations are indicated by theory. An example from Box and Cox (1964) is a chemical reaction, where physical theory suggests that the reaction rate \(r\) should be modelled as \(r \propto c^\alpha \cdot \beta \cdot exp(t)\), where \(c\) is a concentration of the reactants and \(t\) is the temperature. This could be linearized as \(r = \alpha \cdot \log c + \beta \cdot t\) and then solved for \(\alpha\) and \(\beta\). This transformation is indicated by theory, not empirically as with Box-Cox.

Also, if the suggested Box-Cox power is close to 0, many analysts use the log-transform even though it is not optimal, for ease of interpretation.

9 References

Box, G. E. P., & Cox, D. R. (1964). An Analysis of Transformations (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.