This small example is to explain the concept of variance and how it is affected by classification.
The variance \(s^2\) of a set of \(n\) numbers is the average squared separation \((z_i - \bar{z})^2\) from the mean \(\bar{z}\) of the set, with a small finite-population correction \((n-1\):
\[s^2 = \frac{1}{(n-1)} \sum_{i=1}^n (z_i - \bar{z})^2\]
It measures the overall variability of the dataset. The sample \(s^2\) is an estimate of the (unknown) population variance \(\sigma^2\) – this assumes a simple random sample (SRS) from the population!
For example, here are samples of \(n = 12\) from two simulated populations of heights, normally-distributed, with a known \(\mu = 165\) and variance \(\sigma^2 = 12\) (population 1) and variance \(\sigma^2 = 24\) (population 2). Population 2 is more variable, perhaps all the children in grades 1–12, whereas is population 1 is only the children in grade 12.
n <- 12
sort(sample.1 <- rnorm(n, mean=165, sd=sqrt(12)))
## [1] 158.2938 160.2751 161.6946 163.0842 163.5987 163.9856 165.2202
## [8] 165.6517 166.0777 166.6068 166.8974 170.4908
sort(sample.2 <- rnorm(n, mean=150, sd=sqrt(48)))
## [1] 144.1384 144.1777 145.3618 145.9731 146.0455 148.9586 150.9867
## [8] 151.4841 152.5590 153.6113 162.7390 169.2977
If we only had these samples, we would estimate the variance as:
1/(n-1) * sum((sample.1 - mean(sample.1))^2)
## [1] 10.6684
(v.1 <- var(sample.1))
## [1] 10.6684
1/(n-1) * sum((sample.2 - mean(sample.2))^2)
## [1] 60.05138
(v.2 <- var(sample.2))
## [1] 60.05138
It is quite unlikely that with a small sample we would accurately estimate the true mean and variance.
You can run the above two code chunks several times to see the variability of the estimates of a population variance from a sample variance.
The confidence intervals that the true population variance (which we suppose we do not know) lie between a lower and upper limit of probability (assigned by us), based on the sample variance, is given by:
\[ \frac{(n-1) s^2}{\chi^2_{\alpha/2, (n-1)}} \le \sigma^2 \le \frac{(n-1) s^2}{\chi^2_{1-\alpha/2, (n-1)}} \]
where \(\alpha\) is the probability of a Type I error, i.e., we accept the null hypothesis \(H_0\) that the population variance is within this interval. For these two samples:
c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
## [1] 5.96450 25.65184
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
## [1] 33.5736 144.3917
The sample size does not affect the true population variance, but it does affect the confidence interval in which we think the true variance lies:
These should all give the same estimate \(s^2\) of \(\sigma^2\):
(v.1 <- var(sample.1 <- rnorm(n, mean=165, sd=sqrt(12))))
## [1] 17.22879
(v.2 <- var(sample.2 <- rnorm(2*n, mean=165, sd=sqrt(12))))
## [1] 13.66656
(v.3 <- var(sample.3 <- rnorm(3*n, mean=165, sd=sqrt(12))))
## [1] 10.4533
(v.4 <- var(sample.4 <- rnorm(4*n, mean=165, sd=sqrt(12))))
## [1] 16.5499
But the confidence intervals will in general get narrower:
c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
## [1] 9.63229 41.42609
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
## [1] 7.640716 32.860826
c(((n-1)*v.3)/qchisq(0.95, df=(n-1)), ((n-1)*v.3)/qchisq(0.05, df=(n-1)))
## [1] 5.844241 25.134633
c(((n-1)*v.4)/qchisq(0.95, df=(n-1)), ((n-1)*v.4)/qchisq(0.05, df=(n-1)))
## [1] 9.252737 39.793731
If there is some attribute of each sampling unit that affects the value, this may allow a reduction of variance, by using this attribute to divide the population. Some of the total variance is explained by the attribute. We can see this in a typical analysis of variance of an experiment.
This is one of the sample R data sets, an NPK factorial experiment of pea yield.
data(npk)
summary(npk)
## block N P K yield
## 1:4 0:12 0:12 0:12 Min. :44.20
## 2:4 1:12 1:12 1:12 1st Qu.:49.73
## 3:4 Median :55.65
## 4:4 Mean :54.88
## 5:4 3rd Qu.:58.62
## 6:4 Max. :69.50
The overall sample variance:
var(npk$yield)
## [1] 38.10283
and the sample variance for two subsets: the two levels of Nitrogen:
by(npk$yield, npk$N, var)
## npk$N: 0
## [1] 28.92242
## --------------------------------------------------------
## npk$N: 1
## [1] 33.5397
In this case the sub-populations are less variable than the overall population. This is because the means are different:
mean(npk$yield)
## [1] 54.875
by(npk$yield, npk$N, mean)
## npk$N: 0
## [1] 52.06667
## --------------------------------------------------------
## npk$N: 1
## [1] 57.68333
and the values in each set are closer to that set’s means than to the overall mean.
We can see this graphically:
require(ggplot2)
## Loading required package: ggplot2
qplot(yield, data=npk, fill=N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice that the yields for N=0 are lower, N=1 are higher, and both have a narrower range than the full dataset.
The analysis of variance shows how much of the variance is explained by the factor:
summary(aov(yield ~ N, data=npk))
## Df Sum Sq Mean Sq F value Pr(>F)
## N 1 189.3 189.28 6.061 0.0221 *
## Residuals 22 687.1 31.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice how a portion of the total sum of squares (TSS) is accounted for by the factor; the remainder is the residual sum of squares (RSS); these can be converted to variances by their respective degrees of freedom.
The lm
method shows this in a somewhat different way:
summary(lm(yield ~ N, data=npk))
##
## Call:
## lm(formula = yield ~ N, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8833 -3.7667 0.1667 3.5583 11.8167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.067 1.613 32.274 <2e-16 ***
## N1 5.617 2.281 2.462 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.1803
## F-statistic: 6.061 on 1 and 22 DF, p-value: 0.02213
The adjusted \(R^2\) is the proportion of variance explained by the factor.
Fit a linear model of tree volume based on its girth (circumfrance) and height:
data(trees)
summary(trees)
## Girth Height Volume
## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00
(m.v.gh <- lm(Volume ~ Girth*Height, data=trees))
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Coefficients:
## (Intercept) Girth Height Girth:Height
## 69.3963 -5.8558 -1.2971 0.1347
summary(m.v.gh)
##
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## Girth -5.85585 1.92134 -3.048 0.00511 **
## Height -1.29708 0.30984 -4.186 0.00027 ***
## Girth:Height 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
The model is good but not perfect, each coefficient is well-predicted but has uncertainty. So when we use this model to predict the volume of a newly-observed tree (its girth and height) our prediction is not certain:
(volume <- predict(m.v.gh, newdata=data.frame(Girth=14, Height=70), se.fit=TRUE, interval="prediction"))
## $fit
## fit lwr upr
## 1 28.57992 22.67534 34.48449
##
## $se.fit
## [1] 0.9720972
##
## $df
## [1] 27
##
## $residual.scale
## [1] 2.70855
The se.fit
here is the square root of the prediction variance.
The prediction variance is:
\[ \sigma^2 \left[1 + \mathbf{X^*} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X^*}' \right] \]
where \(\mathbf{X}\) is the design (model) matrix and \(\mathbf{X^*}\) is the row vector of predictor values at the point to be predicted, e.g., here \([14, 70]\).
The design matrix here is:
(mm <- model.matrix(m.v.gh))
## (Intercept) Girth Height Girth:Height
## 1 1 8.3 70 581.0
## 2 1 8.6 65 559.0
## 3 1 8.8 63 554.4
## 4 1 10.5 72 756.0
## 5 1 10.7 81 866.7
## 6 1 10.8 83 896.4
## 7 1 11.0 66 726.0
## 8 1 11.0 75 825.0
## 9 1 11.1 80 888.0
## 10 1 11.2 75 840.0
## 11 1 11.3 79 892.7
## 12 1 11.4 76 866.4
## 13 1 11.4 76 866.4
## 14 1 11.7 69 807.3
## 15 1 12.0 75 900.0
## 16 1 12.9 74 954.6
## 17 1 12.9 85 1096.5
## 18 1 13.3 86 1143.8
## 19 1 13.7 71 972.7
## 20 1 13.8 64 883.2
## 21 1 14.0 78 1092.0
## 22 1 14.2 80 1136.0
## 23 1 14.5 74 1073.0
## 24 1 16.0 72 1152.0
## 25 1 16.3 77 1255.1
## 26 1 17.3 81 1401.3
## 27 1 17.5 82 1435.0
## 28 1 17.9 80 1432.0
## 29 1 18.0 80 1440.0
## 30 1 18.0 80 1440.0
## 31 1 20.6 87 1792.2
## attr(,"assign")
## [1] 0 1 2 3
The inverse of its square is:
(mm2.i <- solve(t(mm) %*% mm))
## (Intercept) Girth Height Girth:Height
## (Intercept) 77.44334157 -5.983063010 -1.00093404 7.662883e-02
## Girth -5.98306301 0.503191000 0.07603970 -6.354863e-03
## Height -1.00093404 0.076039705 0.01308607 -9.843500e-04
## Girth:Height 0.07662883 -0.006354863 -0.00098435 8.100241e-05
This is pre- and post-multipled by the predictor values. Note that a new vector in R is interpreted as a column vector in matrix operations, so to get the row vector \(\mathbf{X^*}\) we take its transpose.
# the intercept is set to 1, the others calculated from the values of predictors
(obs <- c(1, 14, 70, 14*70))
## [1] 1 14 70 980
(tmp <- c(t(obs) %*% mm2.i %*% obs))
## [1] 0.1288088
Completing the calculation:
# estimate of \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh))
## [1] 7.336243
c(sig2*(1 + tmp))
## [1] 8.281216
Here is some matrix algebra to compute the prediction variance directly from the linear model.
tm <- delete.response(terms(m.v.gh)) # the right-hand side of the formula
(Xp <- model.matrix(tm, data=data.frame(Girth=14, Height=70))) # make a design matrix from the predictors
## (Intercept) Girth Height Girth:Height
## 1 1 14 70 980
## attr(,"assign")
## [1] 0 1 2 3
(pred <- c(Xp %*% coef(m.v.gh))) # directly compute the prediction
## [1] 28.57992
QR <- m.v.gh$qr # the QR decomposition of the model matrix
# solve this lower-triangular system to give the .... ??
(B <- forwardsolve(t(QR$qr), t(Xp), k=QR$rank))
## [,1]
## [1,] -0.17960530
## [2,] 0.04372819
## [3,] -0.22774013
## [4,] -0.20681650
# estimate \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh)) # squared residuals divided by d.f.
## [1] 7.336243
sqrt(sig2) # we saw this in the summary() above, it is our estimate of \sigma
## [1] 2.70855
(pred.var <- c(crossprod(B) * sig2))
## [1] 0.9449731
(pred.se <- sqrt(pred.var)) # we saw this in the results of predict.lm with se.fit=TRUE
## [1] 0.9720972