This small example is to explain the concept of variance and how it is affected by classification.

Definition of the variance of a sample

The variance \(s^2\) of a set of \(n\) numbers is the average squared separation \((z_i - \bar{z})^2\) from the mean \(\bar{z}\) of the set, with a small finite-population correction \((n-1\):

\[s^2 = \frac{1}{(n-1)} \sum_{i=1}^n (z_i - \bar{z})^2\]

It measures the overall variability of the dataset. The sample \(s^2\) is an estimate of the (unknown) population variance \(\sigma^2\) – this assumes a simple random sample (SRS) from the population!

For example, here are samples of \(n = 12\) from two simulated populations of heights, normally-distributed, with a known \(\mu = 165\) and variance \(\sigma^2 = 12\) (population 1) and variance \(\sigma^2 = 24\) (population 2). Population 2 is more variable, perhaps all the children in grades 1–12, whereas is population 1 is only the children in grade 12.

n <- 12
sort(sample.1 <- rnorm(n, mean=165, sd=sqrt(12)))
##  [1] 158.2938 160.2751 161.6946 163.0842 163.5987 163.9856 165.2202
##  [8] 165.6517 166.0777 166.6068 166.8974 170.4908
sort(sample.2 <- rnorm(n, mean=150, sd=sqrt(48)))
##  [1] 144.1384 144.1777 145.3618 145.9731 146.0455 148.9586 150.9867
##  [8] 151.4841 152.5590 153.6113 162.7390 169.2977

If we only had these samples, we would estimate the variance as:

1/(n-1) * sum((sample.1 - mean(sample.1))^2)
## [1] 10.6684
(v.1 <- var(sample.1))
## [1] 10.6684
1/(n-1) * sum((sample.2 - mean(sample.2))^2)
## [1] 60.05138
(v.2 <- var(sample.2))
## [1] 60.05138

Confidence intervals

It is quite unlikely that with a small sample we would accurately estimate the true mean and variance.

You can run the above two code chunks several times to see the variability of the estimates of a population variance from a sample variance.

The confidence intervals that the true population variance (which we suppose we do not know) lie between a lower and upper limit of probability (assigned by us), based on the sample variance, is given by:

\[ \frac{(n-1) s^2}{\chi^2_{\alpha/2, (n-1)}} \le \sigma^2 \le \frac{(n-1) s^2}{\chi^2_{1-\alpha/2, (n-1)}} \]

where \(\alpha\) is the probability of a Type I error, i.e., we accept the null hypothesis \(H_0\) that the population variance is within this interval. For these two samples:

c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
## [1]  5.96450 25.65184
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
## [1]  33.5736 144.3917

Effect of sample size

The sample size does not affect the true population variance, but it does affect the confidence interval in which we think the true variance lies:

These should all give the same estimate \(s^2\) of \(\sigma^2\):

(v.1 <- var(sample.1 <- rnorm(n, mean=165, sd=sqrt(12))))
## [1] 17.22879
(v.2 <- var(sample.2 <- rnorm(2*n, mean=165, sd=sqrt(12))))
## [1] 13.66656
(v.3 <- var(sample.3 <- rnorm(3*n, mean=165, sd=sqrt(12))))
## [1] 10.4533
(v.4 <- var(sample.4 <- rnorm(4*n, mean=165, sd=sqrt(12))))
## [1] 16.5499

But the confidence intervals will in general get narrower:

c(((n-1)*v.1)/qchisq(0.95, df=(n-1)), ((n-1)*v.1)/qchisq(0.05, df=(n-1)))
## [1]  9.63229 41.42609
c(((n-1)*v.2)/qchisq(0.95, df=(n-1)), ((n-1)*v.2)/qchisq(0.05, df=(n-1)))
## [1]  7.640716 32.860826
c(((n-1)*v.3)/qchisq(0.95, df=(n-1)), ((n-1)*v.3)/qchisq(0.05, df=(n-1)))
## [1]  5.844241 25.134633
c(((n-1)*v.4)/qchisq(0.95, df=(n-1)), ((n-1)*v.4)/qchisq(0.05, df=(n-1)))
## [1]  9.252737 39.793731

Reducing variance by subpopulations

If there is some attribute of each sampling unit that affects the value, this may allow a reduction of variance, by using this attribute to divide the population. Some of the total variance is explained by the attribute. We can see this in a typical analysis of variance of an experiment.

This is one of the sample R data sets, an NPK factorial experiment of pea yield.

data(npk)
summary(npk)
##  block N      P      K          yield      
##  1:4   0:12   0:12   0:12   Min.   :44.20  
##  2:4   1:12   1:12   1:12   1st Qu.:49.73  
##  3:4                        Median :55.65  
##  4:4                        Mean   :54.88  
##  5:4                        3rd Qu.:58.62  
##  6:4                        Max.   :69.50

The overall sample variance:

var(npk$yield)
## [1] 38.10283

and the sample variance for two subsets: the two levels of Nitrogen:

by(npk$yield, npk$N, var)
## npk$N: 0
## [1] 28.92242
## -------------------------------------------------------- 
## npk$N: 1
## [1] 33.5397

In this case the sub-populations are less variable than the overall population. This is because the means are different:

mean(npk$yield)
## [1] 54.875
by(npk$yield, npk$N, mean)
## npk$N: 0
## [1] 52.06667
## -------------------------------------------------------- 
## npk$N: 1
## [1] 57.68333

and the values in each set are closer to that set’s means than to the overall mean.

We can see this graphically:

require(ggplot2)
## Loading required package: ggplot2
qplot(yield, data=npk, fill=N)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice that the yields for N=0 are lower, N=1 are higher, and both have a narrower range than the full dataset.

The analysis of variance shows how much of the variance is explained by the factor:

summary(aov(yield ~ N, data=npk))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## N            1  189.3  189.28   6.061 0.0221 *
## Residuals   22  687.1   31.23                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how a portion of the total sum of squares (TSS) is accounted for by the factor; the remainder is the residual sum of squares (RSS); these can be converted to variances by their respective degrees of freedom.

The lm method shows this in a somewhat different way:

summary(lm(yield ~ N, data=npk))
## 
## Call:
## lm(formula = yield ~ N, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8833 -3.7667  0.1667  3.5583 11.8167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   52.067      1.613  32.274   <2e-16 ***
## N1             5.617      2.281   2.462   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.588 on 22 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.1803 
## F-statistic: 6.061 on 1 and 22 DF,  p-value: 0.02213

The adjusted \(R^2\) is the proportion of variance explained by the factor.

Prediction variance of a linear model

Fit a linear model of tree volume based on its girth (circumfrance) and height:

data(trees)
summary(trees)
##      Girth           Height       Volume     
##  Min.   : 8.30   Min.   :63   Min.   :10.20  
##  1st Qu.:11.05   1st Qu.:72   1st Qu.:19.40  
##  Median :12.90   Median :76   Median :24.20  
##  Mean   :13.25   Mean   :76   Mean   :30.17  
##  3rd Qu.:15.25   3rd Qu.:80   3rd Qu.:37.30  
##  Max.   :20.60   Max.   :87   Max.   :77.00
(m.v.gh <- lm(Volume ~ Girth*Height, data=trees))
## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Coefficients:
##  (Intercept)         Girth        Height  Girth:Height  
##      69.3963       -5.8558       -1.2971        0.1347
summary(m.v.gh)
## 
## Call:
## lm(formula = Volume ~ Girth * Height, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5821 -1.0673  0.3026  1.5641  4.6649 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.39632   23.83575   2.911  0.00713 ** 
## Girth        -5.85585    1.92134  -3.048  0.00511 ** 
## Height       -1.29708    0.30984  -4.186  0.00027 ***
## Girth:Height  0.13465    0.02438   5.524 7.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9728 
## F-statistic: 359.3 on 3 and 27 DF,  p-value: < 2.2e-16

The model is good but not perfect, each coefficient is well-predicted but has uncertainty. So when we use this model to predict the volume of a newly-observed tree (its girth and height) our prediction is not certain:

(volume <- predict(m.v.gh, newdata=data.frame(Girth=14, Height=70), se.fit=TRUE, interval="prediction"))
## $fit
##        fit      lwr      upr
## 1 28.57992 22.67534 34.48449
## 
## $se.fit
## [1] 0.9720972
## 
## $df
## [1] 27
## 
## $residual.scale
## [1] 2.70855

The se.fit here is the square root of the prediction variance.

The prediction variance is:

\[ \sigma^2 \left[1 + \mathbf{X^*} (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X^*}' \right] \]

where \(\mathbf{X}\) is the design (model) matrix and \(\mathbf{X^*}\) is the row vector of predictor values at the point to be predicted, e.g., here \([14, 70]\).

The design matrix here is:

(mm <- model.matrix(m.v.gh))
##    (Intercept) Girth Height Girth:Height
## 1            1   8.3     70        581.0
## 2            1   8.6     65        559.0
## 3            1   8.8     63        554.4
## 4            1  10.5     72        756.0
## 5            1  10.7     81        866.7
## 6            1  10.8     83        896.4
## 7            1  11.0     66        726.0
## 8            1  11.0     75        825.0
## 9            1  11.1     80        888.0
## 10           1  11.2     75        840.0
## 11           1  11.3     79        892.7
## 12           1  11.4     76        866.4
## 13           1  11.4     76        866.4
## 14           1  11.7     69        807.3
## 15           1  12.0     75        900.0
## 16           1  12.9     74        954.6
## 17           1  12.9     85       1096.5
## 18           1  13.3     86       1143.8
## 19           1  13.7     71        972.7
## 20           1  13.8     64        883.2
## 21           1  14.0     78       1092.0
## 22           1  14.2     80       1136.0
## 23           1  14.5     74       1073.0
## 24           1  16.0     72       1152.0
## 25           1  16.3     77       1255.1
## 26           1  17.3     81       1401.3
## 27           1  17.5     82       1435.0
## 28           1  17.9     80       1432.0
## 29           1  18.0     80       1440.0
## 30           1  18.0     80       1440.0
## 31           1  20.6     87       1792.2
## attr(,"assign")
## [1] 0 1 2 3

The inverse of its square is:

(mm2.i <- solve(t(mm) %*% mm))
##              (Intercept)        Girth      Height  Girth:Height
## (Intercept)  77.44334157 -5.983063010 -1.00093404  7.662883e-02
## Girth        -5.98306301  0.503191000  0.07603970 -6.354863e-03
## Height       -1.00093404  0.076039705  0.01308607 -9.843500e-04
## Girth:Height  0.07662883 -0.006354863 -0.00098435  8.100241e-05

This is pre- and post-multipled by the predictor values. Note that a new vector in R is interpreted as a column vector in matrix operations, so to get the row vector \(\mathbf{X^*}\) we take its transpose.

# the intercept is set to 1, the others calculated from the values of predictors
(obs <- c(1, 14, 70, 14*70))
## [1]   1  14  70 980
(tmp <- c(t(obs) %*% mm2.i %*% obs))
## [1] 0.1288088

Completing the calculation:

# estimate of \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh))
## [1] 7.336243
c(sig2*(1 + tmp))
## [1] 8.281216

Here is some matrix algebra to compute the prediction variance directly from the linear model.

tm <- delete.response(terms(m.v.gh))  # the right-hand side of the formula
(Xp <- model.matrix(tm, data=data.frame(Girth=14, Height=70)))  # make a design matrix from the predictors
##   (Intercept) Girth Height Girth:Height
## 1           1    14     70          980
## attr(,"assign")
## [1] 0 1 2 3
(pred <- c(Xp %*% coef(m.v.gh)))  # directly compute the prediction
## [1] 28.57992
QR <- m.v.gh$qr  # the QR decomposition of the model matrix
# solve this lower-triangular system to give the .... ??
(B <- forwardsolve(t(QR$qr), t(Xp), k=QR$rank))
##             [,1]
## [1,] -0.17960530
## [2,]  0.04372819
## [3,] -0.22774013
## [4,] -0.20681650
# estimate \sigma^2
(sig2 <- c(crossprod(residuals(m.v.gh))) / df.residual(m.v.gh)) # squared residuals divided by d.f.
## [1] 7.336243
sqrt(sig2) # we saw this in the summary() above, it is our estimate of \sigma
## [1] 2.70855
(pred.var <- c(crossprod(B) * sig2))
## [1] 0.9449731
(pred.se <- sqrt(pred.var))  # we saw this in the results of predict.lm with se.fit=TRUE
## [1] 0.9720972