Here we demonstrate the sensivity of regression trees to small changes in the training dataset, using the Meuse heavy metals dataset.
Libraries used in this exercise: sp
for spatial objects, rpart
for regression trees and random forests, rpart.plot
for nicer plots of trees.
library(sp)
library(rpart)
library(rpart.plot)
Sample dataset, comes with sp
; make a log10-transformed copy of the metal concentrations.
data(meuse)
str(meuse)
## 'data.frame': 155 obs. of 14 variables:
## $ x : num 181072 181025 181165 181298 181307 ...
## $ y : num 333611 333558 333537 333484 333330 ...
## $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## $ copper : num 85 81 68 81 48 61 31 29 37 24 ...
## $ lead : num 299 277 199 116 117 137 132 150 133 80 ...
## $ zinc : num 1022 1141 640 257 269 ...
## $ elev : num 7.91 6.98 7.8 7.66 7.48 ...
## $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...
meuse$logZn <- log10(meuse$zinc) # 锌
meuse$logCu <- log10(meuse$copper) # 铜
meuse$logPb <- log10(meuse$lead) # é“…
meuse$logCd <- log10(meuse$cadmium) # 镉
Prediction grid:
data(meuse.grid)
coordinates(meuse.grid) <- ~ x + y
gridded(meuse.grid) <- TRUE
Use all 155 to build and prune:
m.lzn <- rpart(logZn ~ ffreq + dist + soil + x + y,
data=meuse,
minsplit=2,
cp=0.005)
cp.table <- m.lzn[["cptable"]]
cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"]
m.lzn <- prune(m.lzn, cp=cp.min)
rpart.plot(m.lzn)
grid.map <- predict(m.lzn, newdata=meuse.grid)
meuse.grid$pred <- grid.map
print(spplot(meuse.grid, zcol="pred"))
x <- m.lzn$variable.importance
data.frame(variableImportance = 100 * x / sum(x))
## variableImportance
## dist 45.53803
## x 16.88457
## soil 15.27468
## y 11.54119
## ffreq 10.76154
Randomly sample a specific number of the 155 points and use it to build and prune a regression tree; use the remaining points as out-of-bag evaluation. The model uses all the variable in the prediction grid: coordnates, relative distance from river, flooding frequency clqss, and soil type.
We write this as a function, so we can call it any number of times. It has one argument: the number of points to include in the calibration sample.
It returns information about the tree: the minimum complexity parameter, the out-of-bag error, and the variable importance, and the prediction over the grid. It also displays the tree and the prediction over the grid.
build.rpart <- function(n.cal) {
meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE)
meuse.cal <- meuse[meuse.cal.ix,]
meuse.val <- meuse[-meuse.cal.ix,]
m.lzn.cal <- rpart(logZn ~ ffreq + dist + soil + x + y,
data=meuse.cal,
minsplit=2,
cp=0.005)
cp.table <- m.lzn.cal[["cptable"]]
cp.min <- cp.table[which.min(cp.table[,"xerror"]),"CP"]
m.lzn.cal <- prune(m.lzn.cal, cp=cp.min)
rpart.plot(m.lzn.cal)
tmp <- m.lzn.cal$variable.importance
v.imp <- data.frame(variableImportance = 100 * tmp / sum(tmp))
grid.map <- predict(m.lzn.cal, newdata=meuse.grid)
meuse.grid$pred <- grid.map
val <- predict(m.lzn.cal, newdata=meuse.val)
rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn))
print(spplot(meuse.grid, zcol="pred"))
# return a list
return(list(cp = cp.min, rmse = rmse, v.imp = v.imp, pred = grid.map))
}
Set up an object to collect the fitted paramter, and a field in the prediction grid to show the average prediction:
runs <- 64
fits.rf <- list(cp=vector(mode="numeric",length=runs),
rmse=vector(mode="numeric",length=runs),
v.imp=as.data.frame(matrix(data=0, nrow=runs, ncol=5))
)
names(fits.rf[["v.imp"]]) <- c("ffreq", "dist", "soil","x","y")
meuse.grid$logZn.avg <- 0 # will accumulate predictions and at the end divide by # of runs
Call the function and collect the fitted parameters. Save the graphics in a PDF for display.
for (run in 1:runs) {
tmp <- build.rpart(140)
fits.rf[["cp"]][run] <- tmp[[1]]
fits.rf[["rmse"]][run] <- tmp[[2]]
fits.rf[["v.imp"]][run,] <- t(tmp[[3]]) # returned a row, need a column
# add to average prediction
meuse.grid$logZn.avg <- meuse.grid$logZn.avg + tmp[[4]]
}
Show the distribution of the complexity parameters, RMSE, and variable importance:
summary(fits.rf[["cp"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005000 0.008878 0.010003 0.013489 0.019106 0.031515
hist(fits.rf[["cp"]], xlab="Complexity parameter", main="")
rug(fits.rf[["cp"]])
summary(fits.rf[["rmse"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007369 0.0605494 0.1407907 0.1626592 0.2248974 0.6361544
hist(fits.rf[["rmse"]], xlab="Out-of-bag RMSE", main="")
rug(fits.rf[["rmse"]])
summary(fits.rf[["v.imp"]])
## ffreq dist soil x
## Min. :37.86 Min. :13.34 Min. : 6.605 Min. : 3.721
## 1st Qu.:40.44 1st Qu.:16.65 1st Qu.:14.136 1st Qu.:11.480
## Median :43.92 Median :17.29 Median :14.901 Median :12.866
## Mean :44.63 Mean :17.26 Mean :14.736 Mean :12.533
## 3rd Qu.:46.09 3rd Qu.:17.93 3rd Qu.:15.511 3rd Qu.:14.116
## Max. :75.23 Max. :19.66 Max. :17.562 Max. :15.089
## y
## Min. : 1.099
## 1st Qu.: 9.401
## Median :11.511
## Mean :10.839
## 3rd Qu.:12.992
## Max. :14.556
Show the average prediction.
meuse.grid$logZn.avg <- meuse.grid$logZn.avg/runs
summary(meuse.grid$logZn.avg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.215 2.222 2.327 2.460 2.685 3.003
print(spplot(meuse.grid, zcol="logZn.avg"))
Still not very smooth, but better than any single tree.
Compare with the variability of linear models:
Function to build a linear model with a subset. Because some predictors are correlated the model may be over-fit, so apply backwards stepwise elimination before summarizing the model and computing the RMSE.
The distance will always be in this model; return this to see how much it varies between models. Other coefficients may be eliminated.
build.lm <- function(n.cal) {
meuse.cal.ix <- sample(1:dim(meuse)[1], size=n.cal, replace=FALSE)
meuse.cal <- meuse[meuse.cal.ix,]
meuse.val <- meuse[-meuse.cal.ix,]
m.lzn.cal <- lm(logZn ~ ffreq + dist + soil + x + y, data=meuse.cal)
m.lzn.cal <- step(m.lzn.cal, trace=0)
meuse.grid$predictedLog10Zn.cal <- predict(m.lzn.cal, newdata=meuse.grid)
val <- predict(m.lzn.cal, newdata=meuse.val)
rmse <- sqrt((sum(meuse.val$logZn - val)^2)/length(meuse.val$logZn))
print(spplot(meuse.grid, zcol="predictedLog10Zn.cal"))
return(list(rmse = rmse, coef.dist=coefficients(m.lzn.cal)["dist"]))
}
Set up an object to collect the fitted paramters:
# runs should be consistent with regression trees
# runs <- 64
fits.lm <- list(rmse=vector(mode="numeric",length=runs),
coef.dist=vector(mode="numeric",length=runs))
Call the function and collect the fitted parameters. Save the graphics in a PDF for display.
for (run in 1:runs) {
tmp <- build.lm(140)
fits.lm[["rmse"]][run] <- tmp[[1]]
fits.lm[["coef.dist"]][run] <- tmp[[2]]
}
These maps are much more similar to each other than the single RT maps.
Show the distribution of the RMSE and linear model coefficients; compare RMSE to RT RMSE. :
summary(fits.rf[["rmse"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0007369 0.0605494 0.1407907 0.1626592 0.2248974 0.6361544
summary(fits.lm[["rmse"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0004445 0.0738047 0.1251507 0.1572231 0.2261781 0.4997731
hist(fits.lm[["rmse"]], xlab="Out-of-bag RMSE", main="")
rug(fits.lm[["rmse"]])
summary(fits.lm[["coef.dist"]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9415 -0.8701 -0.8319 -0.8367 -0.8058 -0.6552
hist(fits.lm[["coef.dist"]], xlab="Linear model coefficient: relative distance", main="")
rug(fits.lm[["coef.dist"]])
The reduced linear model gives lower RMSE on average than the single trees. The distance coefficient is fairly stable.