In this exercise we return to the Meuse heavy metals dataset.
Packages used in this section: sp
for the sample dataset.
library(sp)
Sample point dataset, comes with sp
; make a log10-transformed copy of the metal concentrations.
Also load the prediction grid: locations where we do not know the value of the target variable, at which we want to make prediction of that variable, from the model built (trained, calibrated) from the known points, where we know both the target variable and the predictors.
data(meuse)
str(meuse)
## 'data.frame': 155 obs. of 14 variables:
## $ x : num 181072 181025 181165 181298 181307 ...
## $ y : num 333611 333558 333537 333484 333330 ...
## $ cadmium: num 11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
## $ copper : num 85 81 68 81 48 61 31 29 37 24 ...
## $ lead : num 299 277 199 116 117 137 132 150 133 80 ...
## $ zinc : num 1022 1141 640 257 269 ...
## $ elev : num 7.91 6.98 7.8 7.66 7.48 ...
## $ dist : num 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## $ om : num 13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## $ lime : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
## $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
## $ dist.m : num 50 30 150 270 380 470 240 120 240 420 ...
meuse$logZn <- log10(meuse$zinc) # 锌
meuse$logCu <- log10(meuse$copper) # 铜
meuse$logPb <- log10(meuse$lead) # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
data(meuse.grid)
str(meuse.grid)
## 'data.frame': 3103 obs. of 7 variables:
## $ x : num 181180 181140 181180 181220 181100 ...
## $ y : num 333740 333700 333700 333700 333660 ...
## $ part.a: num 1 1 1 1 1 1 1 1 1 1 ...
## $ part.b: num 0 0 0 0 0 0 0 0 0 0 ...
## $ dist : num 0 0 0.0122 0.0435 0 ...
## $ soil : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
The objective is to model the log10(Zn) concentration from covariables, not including the other metals. One approach is the regression tree, which splits the target according to thresholds of predictor variables.
Reference: Brieman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall, CRC Press.
Packages used in this section: sp
for spatial objects, rpart
for regression trees.
library(sp)
library(rpart)
library(rpart.plot)
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [15] "logZn" "logCu" "logPb" "logCd"
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
Q: What covariables are available in the point dataset? Are these all available in the prediction grid?
We will try to model the log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, whether the site has had added limestone 灰岩, and elevation above sea level. We also use the coordinates to capture any geographic trend.
Q: Why could these be good predictors of the metal concentration in this site?
Q: Do you see any problem in using so many predictors? Would you do this in a linear model?
We will use the rpart
function to build the regression tree. This uses a helper function rpart.control
to control the behaviour of rpart
.
Examine the help for rpart.control
.
help(rpart.control)
Q: What is the meaning of the minsplit
and cp
parameters?
Build a regression tree for this, print its results, and display as a tree plot. Allow splits all the way to single observations, if the \(R^2\) increases by at least 0.5% at the split.
m.lzn.rp <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse,
minsplit=2,
cp=0.005)
print(m.lzn.rp)
## n= 155
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 155 15.13633000 2.556160
## 2) dist.m>=145 101 4.65211000 2.388584
## 4) elev>=6.943 93 2.83545700 2.349952
## 8) dist.m>=230 78 1.73584100 2.311717
## 16) x>=179102.5 71 1.23063300 2.288808
## 32) ffreq=2,3 40 0.36471060 2.223342
## 64) y>=330942 21 0.14958940 2.167449 *
## 65) y< 330942 19 0.07701018 2.285117 *
## 33) ffreq=1 31 0.47328450 2.373281
## 66) x< 179913.5 1 0.00000000 2.075547 *
## 67) x>=179913.5 30 0.38168440 2.383205
## 134) x>=180346.5 25 0.20839560 2.355868
## 268) y< 333191 19 0.09154361 2.320174 *
## 269) y>=333191 6 0.01598697 2.468900 *
## 135) x< 180346.5 5 0.06119400 2.519889 *
## 17) x< 179102.5 7 0.08998420 2.544084 *
## 9) dist.m< 230 15 0.39263770 2.548774
## 18) x< 180563 10 0.23688250 2.494602
## 36) x>=179405.5 7 0.10694750 2.420923 *
## 37) x< 179405.5 3 0.00326675 2.666521 *
## 19) x>=180563 5 0.06771894 2.657116 *
## 5) elev< 6.943 8 0.06436102 2.837680 *
## 3) dist.m< 145 54 2.34313500 2.869589
## 6) elev>=8.1455 15 0.49618250 2.646331
## 12) elev< 8.48 4 0.07502826 2.456464 *
## 13) elev>=8.48 11 0.22452210 2.715373 *
## 7) elev< 8.1455 39 0.81172600 2.955457
## 14) dist.m>=75 11 0.08001190 2.848717 *
## 15) dist.m< 75 28 0.55715050 2.997391
## 30) x< 179565 9 0.12124970 2.914541 *
## 31) x>=179565 19 0.34486030 3.036636
## 62) x>=180194 12 0.12169000 2.966064 *
## 63) x< 180194 7 0.06095067 3.157617 *
rpart.plot(m.lzn.rp, digits=3, type=4, extra=1)
Q: Which variables were actually used in the tree?
Q: Which variable was used at the first split?
Q: Does that agree with your theory of the origin of the metals in these soils?
Q: What was the value of the splitting variable at that split?
Q: What was the average log10(Zn) before the split, and in the two leaves after the split?
Q: How many leaves (i.e., different predictions)? What proportion of the observations is this?
Q: What is the minimum number of observations among the leaves? The maximum?
Now we examine the variable importance:
print(tmp <- m.lzn.rp$variable.importance)
## dist.m elev lime x y soil ffreq
## 9.410837 6.134458 4.633959 2.097574 1.911179 1.169126 1.143282
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist.m 35.512035
## elev 23.148534
## lime 17.486364
## x 7.915251
## y 7.211884
## soil 4.411728
## ffreq 4.314205
Q: Which variable explained most of the metal concentration? What proportion? Were all predictor variables used somewhere in the tree?
Have we built a tree that is too complex, i.e., fitting noise not signal? In other words, fitting our dataset well but not the process?
Examine the cross-validation relative error vs. the complexity parameter.
Q: What complexity parameter minimizes the cross-validation relative error?
We can also compute this, rather than just looking at the graph.
plotcp(m.lzn.rp)
cp.table <- m.lzn.rp[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.006663768 15.000000000 0.099886839 0.305124342 0.037987293
cp.min <- cp.table[cp.ix,"CP"]
(Note that some other cross-validation errors are quite close to this minimum, so we might choose to use a higher value of the control parameter to have a more parsimonious model, that might extrapolate better.)
Prune the tree with this complexity parameter, and examine in the same way as above.
(Note we would get the same result if we re-fit with this parameter.)
(m.lzn.rpp <- prune(m.lzn.rp, cp=cp.min))
## n= 155
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 155 15.13633000 2.556160
## 2) dist.m>=145 101 4.65211000 2.388584
## 4) elev>=6.943 93 2.83545700 2.349952
## 8) dist.m>=230 78 1.73584100 2.311717
## 16) x>=179102.5 71 1.23063300 2.288808
## 32) ffreq=2,3 40 0.36471060 2.223342
## 64) y>=330942 21 0.14958940 2.167449 *
## 65) y< 330942 19 0.07701018 2.285117 *
## 33) ffreq=1 31 0.47328450 2.373281
## 66) x< 179913.5 1 0.00000000 2.075547 *
## 67) x>=179913.5 30 0.38168440 2.383205
## 134) x>=180346.5 25 0.20839560 2.355868 *
## 135) x< 180346.5 5 0.06119400 2.519889 *
## 17) x< 179102.5 7 0.08998420 2.544084 *
## 9) dist.m< 230 15 0.39263770 2.548774
## 18) x< 180563 10 0.23688250 2.494602
## 36) x>=179405.5 7 0.10694750 2.420923 *
## 37) x< 179405.5 3 0.00326675 2.666521 *
## 19) x>=180563 5 0.06771894 2.657116 *
## 5) elev< 6.943 8 0.06436102 2.837680 *
## 3) dist.m< 145 54 2.34313500 2.869589
## 6) elev>=8.1455 15 0.49618250 2.646331
## 12) elev< 8.48 4 0.07502826 2.456464 *
## 13) elev>=8.48 11 0.22452210 2.715373 *
## 7) elev< 8.1455 39 0.81172600 2.955457
## 14) dist.m>=75 11 0.08001190 2.848717 *
## 15) dist.m< 75 28 0.55715050 2.997391
## 30) x< 179565 9 0.12124970 2.914541 *
## 31) x>=179565 19 0.34486030 3.036636
## 62) x>=180194 12 0.12169000 2.966064 *
## 63) x< 180194 7 0.06095067 3.157617 *
rpart.plot(m.lzn.rpp, digits=3, type=4, extra=1)
Q: How does this tree differ from the full (un-pruned tree)? Look at the predictors actually used, number of splits, number of leaves, number of observations in each leaf.
Q: Which leaf predicts the highest concentration of log10(Zn)? Explain that by reading the predictors and their values down the tree from the root to this leaf. Does this fit your theory of the origin of these metals?
Now, compare the model predictions to the known values:
p.rpp <- predict(m.lzn.rpp, newdata=meuse)
length(unique(p.rpp))
## [1] 16
summary(r.rpp <- meuse$logZn - p.rpp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.312253 -0.071205 -0.008655 0.000000 0.055852 0.226281
sqrt(sum(r.rpp^2)/length(r.rpp))
## [1] 0.09876398
plot(meuse$logZn ~ p.rpp, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Regression Tree")
grid()
abline(0,1)
Q: What is the RMSE?
Q: Why are there vertical lines at all the fitted values?
Another use of CART is to predict a category. Here we try to predict whether a site is in flood frequency class 1, 2, or 3. If this is a successful model, we could use it to map flood frequency, if we didn’t already have this map.
Q: What covariables might help predict flooding frequency?
Flooding is influenced by the distance from the river and the elevation above the river. Build a model with these, using a complexity parameter of 2% increase in \(R^2\).
m.ff.rp <- rpart(ffreq ~ x + y + dist.m + elev,
data=meuse,
minsplit=2,
cp=0.02)
print(m.ff.rp)
## n= 155
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 155 71 1 (0.54193548 0.30967742 0.14838710)
## 2) y>=331984.5 63 6 1 (0.90476190 0.07936508 0.01587302)
## 4) elev< 9.399 60 3 1 (0.95000000 0.03333333 0.01666667) *
## 5) elev>=9.399 3 0 2 (0.00000000 1.00000000 0.00000000) *
## 3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)
## 6) elev< 7.552 24 2 1 (0.91666667 0.00000000 0.08333333)
## 12) y>=330043 22 0 1 (1.00000000 0.00000000 0.00000000) *
## 13) y< 330043 2 0 3 (0.00000000 0.00000000 1.00000000) *
## 7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)
## 14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)
## 28) elev< 9.947 49 9 2 (0.10204082 0.81632653 0.08163265)
## 56) y>=330644 35 5 2 (0.14285714 0.85714286 0.00000000)
## 112) dist.m>=470 17 5 2 (0.29411765 0.70588235 0.00000000)
## 224) elev< 8.526 3 0 1 (1.00000000 0.00000000 0.00000000) *
## 225) elev>=8.526 14 2 2 (0.14285714 0.85714286 0.00000000) *
## 113) dist.m< 470 18 0 2 (0.00000000 1.00000000 0.00000000) *
## 57) y< 330644 14 4 2 (0.00000000 0.71428571 0.28571429)
## 114) elev< 8.7225 11 1 2 (0.00000000 0.90909091 0.09090909) *
## 115) elev>=8.7225 3 0 3 (0.00000000 0.00000000 1.00000000) *
## 29) elev>=9.947 5 1 3 (0.00000000 0.20000000 0.80000000) *
## 15) y< 330330 14 2 3 (0.00000000 0.14285714 0.85714286) *
rpart.plot(m.ff.rp)
Q: Were both predictors used?
Q: What was the first split?
Find the value of the complexity parameter that minimizes the cross-validation error, and prune the tree with that value.
printcp(m.ff.rp)
##
## Classification tree:
## rpart(formula = ffreq ~ x + y + dist.m + elev, data = meuse,
## minsplit = 2, cp = 0.02)
##
## Variables actually used in tree construction:
## [1] dist.m elev y
##
## Root node error: 71/155 = 0.45806
##
## n= 155
##
## CP nsplit rel error xerror xstd
## 1 0.267606 0 1.00000 1.00000 0.087366
## 2 0.140845 2 0.46479 0.49296 0.073316
## 3 0.042254 3 0.32394 0.38028 0.066506
## 4 0.028169 5 0.23944 0.40845 0.068385
## 5 0.021127 6 0.21127 0.36620 0.065518
## 6 0.020000 10 0.12676 0.38028 0.066506
plotcp(m.ff.rp)
cp.table.class <- m.ff.rp[["cptable"]]
cp.ix <- which.min(cp.table.class[,"xerror"])
print(cp.table.class[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.02112676 6.00000000 0.21126761 0.36619718 0.06551750
(cp.min <- cp.table.class[cp.ix,"CP"])
## [1] 0.02112676
m.ff.rpp <- prune(m.ff.rp, cp=cp.min)
print(m.ff.rpp)
## n= 155
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 155 71 1 (0.54193548 0.30967742 0.14838710)
## 2) y>=331984.5 63 6 1 (0.90476190 0.07936508 0.01587302)
## 4) elev< 9.399 60 3 1 (0.95000000 0.03333333 0.01666667) *
## 5) elev>=9.399 3 0 2 (0.00000000 1.00000000 0.00000000) *
## 3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)
## 6) elev< 7.552 24 2 1 (0.91666667 0.00000000 0.08333333)
## 12) y>=330043 22 0 1 (1.00000000 0.00000000 0.00000000) *
## 13) y< 330043 2 0 3 (0.00000000 0.00000000 1.00000000) *
## 7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)
## 14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)
## 28) elev< 9.947 49 9 2 (0.10204082 0.81632653 0.08163265) *
## 29) elev>=9.947 5 1 3 (0.00000000 0.20000000 0.80000000) *
## 15) y< 330330 14 2 3 (0.00000000 0.14285714 0.85714286) *
rpart.plot(m.ff.rpp, type=2, extra=1)
Q: Now how many splits? Which variables were used? Are all classes predicted? How pure are the leaves?
Q: How useful is this model? In other words, how well could flooding frequency class be mapped from knowing the distance to river and elevation?
Variable importance:
m.ff.rp$variable.importance
## elev y x dist.m
## 38.26320 38.24051 16.58514 11.40992
m.ff.rpp$variable.importance
## y elev x dist.m
## 36.74909 29.85313 14.99264 9.89731
In both the unpruned and pruned trees the y-coordinate and elevation are most important.
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
Q: What variables are available for mapping over the grid?
The grid only has coordinates, relative distance, soil type, and flood frequency. So if we want a model to map log(Zn) over the grid, we can only use these variables.
Build and prune a regression tree for log(Zn) with these predictors; display the pruned tree and the variable importance.
m.lzn.g <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse,
minsplit=2,
cp=0.005)
plotcp(m.lzn.g)
rpart.plot(m.lzn.g, digits=3, type=4, extra=1)
cp.table <- m.lzn.g[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
## CP nsplit rel error xerror xstd
## 0.01792115 8.00000000 0.17301816 0.32353445 0.05034038
(cp.min <- cp.table[cp.ix,"CP"])
## [1] 0.01792115
Now prune the tree to the indicated complexity:
m.lzn.gp <- prune(m.lzn.g, cp=cp.min)
rpart.plot(m.lzn.gp)
print(tmp <- m.lzn.gp$variable.importance)
## dist x soil y ffreq
## 10.189013 3.777877 3.363417 2.473806 2.136605
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist 46.43883
## x 17.21857
## soil 15.32957
## y 11.27496
## ffreq 9.73808
Q: Which predictors were used? What was their relative importance?
Q: Why is the x-coordinate important in this dataset? Do you think this would be important in other parts of the flood plain?
Now we can map with this.
First, make an sp
version of the grid, with the coordinates having special status, i.e., not as feature-space attributes:
meuse.grid.sp <- meuse.grid
coordinates(meuse.grid.sp) <- ~ x + y; gridded(meuse.grid.sp) <- TRUE
meuse.grid.sp$predictedLog10Zn <- predict(m.lzn.gp, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn")
Q: How realistic is this map?
Single regression trees can vary quite a bit depending on small differences in the dataset. Convince yourself of this.
Remove a random 15 of the 155 points, then fit and prune the first tree we attempted, i.e., log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, land use, whether the site has had added limestone 灰岩, and elevation above sea level.
meuse.140.ix <- sample(1:dim(meuse)[1], size=140, replace=FALSE)
meuse.140 <- meuse[meuse.140.ix,]
m.lzn.rp.140 <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
data=meuse.140,
minsplit=2,
cp=0.005)
rpart.plot(m.lzn.rp.140, digits=3, type=4, extra=1)
print(tmp <- m.lzn.rp.140$variable.importance)
## dist.m elev lime x y soil ffreq
## 8.632981 5.719639 4.393028 1.972617 1.555428 1.352222 1.085110
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist.m 34.935746
## elev 23.146102
## lime 17.777604
## x 7.982740
## y 6.294470
## soil 5.472139
## ffreq 4.391199
Prune the tree:
plotcp(m.lzn.rp.140)
cp.table <- m.lzn.rp.140[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])
## CP nsplit rel error xerror xstd
## 0.008830864 9.000000000 0.164033250 0.288800633 0.032236651
cp.min <- cp.table[cp.ix,"CP"]
m.lzn.rpp.140 <- prune(m.lzn.rp.140, cp=cp.min)
rpart.plot(m.lzn.rpp.140, digits=3, type=4, extra=1)
Q: What are the similarities and differences between your tree and the one we did at the beginning of the exercise? Consider the “optimum” complexity parameter, number of leaves and split, where the predictors are used, which predictors are used.
We can do this also for the map:
m.lzn.g.140 <- rpart(logZn ~ x + y + ffreq + dist + soil,
data=meuse.140,
minsplit=2,
cp=0.005)
Prune the tree to the indicated complexity:
plotcp(m.lzn.g.140)
rpart.plot(m.lzn.g.140, digits=3, type=4, extra=1)
cp.table <- m.lzn.g.140[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]
## CP nsplit rel error xerror xstd
## 0.02202392 5.00000000 0.25857520 0.38985094 0.06241297
(cp.min <- cp.table[cp.ix,"CP"])
## [1] 0.02202392
m.lzn.gp.140 <- prune(m.lzn.g.140, cp=cp.min)
rpart.plot(m.lzn.gp.140)
print(tmp <- m.lzn.gp.140$variable.importance)
## dist x ffreq y soil
## 8.6023189 1.8590649 1.1787964 1.0352421 0.2431314
data.frame(variableImportance = 100 * tmp / sum(tmp))
## variableImportance
## dist 66.588870
## x 14.390658
## ffreq 9.124832
## y 8.013607
## soil 1.882033
Computed and display the map using the pruned tree:
meuse.grid.sp$predictedLog10Zn.140 <- predict(m.lzn.gp.140, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.140")
Compute and display the differences in predictions:
meuse.grid.sp$predictedLog10Zn.diff <-
meuse.grid.sp$predictedLog10Zn - meuse.grid.sp$predictedLog10Zn.140
summary(meuse.grid.sp$predictedLog10Zn.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.5488081 -0.0199888 0.0054074 0.0007744 0.0054074 0.5624135
spplot(meuse.grid.sp, zcol="predictedLog10Zn.diff", col.regions=topo.colors(36))
Q: How different are the maps made with the regression tree based on all 155 observations, and the one made with 140 randomly-selected observations? Can you explain the spatial pattern of the differences? Hint: plot the locations of the omitted points.
meuse.omitted <- meuse[-meuse.140.ix,]
plot(y ~ x, data=meuse.omitted, asp=1,
cex=2*meuse.omitted$logZn/max(meuse.omitted$logZn))
data(meuse.riv)
lines(meuse.riv)
grid()
The idea here is simple: build many regression trees and average them.
Reference: Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Packages used in this section: randomForest
for random forests, randomForestExplainer
for some diagnostics.
library(randomForest)
library(randomForestExplainer)
First build the forest, using the randomForest
function:
m.lzn.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=5)
print(m.lzn.rf)
##
## Call:
## randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01956509
## % Var explained: 79.96
The mtry
optional argument gives the number of variables randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor{5/3}\rfloor = 2\). We want to try all the variables at each split for more accurate splitting.
Q: How much of the variance is explained by the forest?
Here we can not see a tree structure, since each tree in the forest is different.
Display the cross-validation error rate against the number of trees:
plot(m.lzn.rf)
Q: How many trees were needed to make a reliable forest?
Each RF is random, so several attempts will not be the same. How consistent are they?
Compute the RF several times and collect statistics:
n <- 24
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
for (i in 1:n) {
model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
data=meuse, importance=T, na.action=na.omit, mtry=5)
summary(model.rf$rsq)
summary(model.rf$mse)
rf.stats[i, "mse"] <- median(summary(model.rf$mse))
rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
}
summary(rf.stats[,2:3])
## rsq mse
## Min. :0.7939 Min. :0.01881
## 1st Qu.:0.7990 1st Qu.:0.01909
## Median :0.8008 Median :0.01946
## Mean :0.8012 Mean :0.01942
## 3rd Qu.:0.8045 3rd Qu.:0.01963
## Max. :0.8074 Max. :0.02013
hist(rf.stats[,"rsq"], xlab="RandomForest R^2", main="random forests")
rug(rf.stats[,"rsq"])
hist(rf.stats[,"mse"], xlab="RandomForest RMSE", main="random forests")
rug(rf.stats[,"mse"])
Q: How consistent are the 24 runs of the random forest?
Examine the variable importance. Type 1 is the increase in MSE if that predictor is assigned to its actuql point, rather than randomly assigned to a point. It is computed as follows (from the Help): “For each tree, the prediction error on the OOB portion of the data is recorded (…MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences”.
Type 2 is the increase in node purity. It is computed as follows: “The total decrease in node impurities from splitting on the [given] variable, averaged over all trees… measured by residual sum of squares.”
importance(m.lzn.rf, type=1)
## %IncMSE
## ffreq 15.784522
## x 27.679406
## y 17.651208
## dist.m 53.929499
## elev 34.311213
## soil 6.102535
## lime 10.305590
importance(m.lzn.rf, type=2)
## IncNodePurity
## ffreq 0.4758883
## x 1.1245359
## y 0.7414590
## dist.m 7.9016551
## elev 3.3382712
## soil 0.1317567
## lime 0.9734237
varImpPlot(m.lzn.rf, type=1)
Q: Does this agree with the variable importance from the single trees we built above?
Q: Why are some variables given importance, although they were not used in the regression tree models?
randomForestExplainer
Another way to look at this is by the depth in the tree at which each predictor is used. This is found with the min_depth_frame
function of the randomForestExplainer
package.
min_depth_frame <- min_depth_distribution(m.lzn.rf)
str(min_depth_frame) # has results for all the trees
## 'data.frame': 3232 obs. of 3 variables:
## $ tree : int 1 1 1 1 1 1 1 2 2 2 ...
## $ variable : chr "dist.m" "elev" "ffreq" "lime" ...
## $ minimal_depth: num 0 2 2 1 4 3 3 0 1 2 ...
plot_min_depth_distribution(min_depth_frame)
Q: Which predictor is used most as the root?
Q: Which predictor is used most at the second level?
Q: Which predictors are not used in any trees?
Q: Do these results agree with the variable importance plot?
This package gives a wider choice of importance measures:
importance_frame <- measure_importance(m.lzn.rf)
print(importance_frame)
## variable mean_min_depth no_of_nodes mse_increase node_purity_increase
## 1 dist.m 0.354000 5415 0.079512659 7.9016551
## 2 elev 1.124000 6367 0.033398644 3.3382712
## 3 ffreq 3.593508 1051 0.009845755 0.4758883
## 4 lime 4.003500 506 0.003900991 0.9734237
## 5 soil 5.362688 671 0.001496646 0.1317567
## 6 x 2.166000 5759 0.011855387 1.1245359
## 7 y 2.652000 5381 0.006667004 0.7414590
## no_of_trees times_a_root p_value
## 1 500 347 2.408062e-210
## 2 500 88 0.000000e+00
## 3 461 0 1.000000e+00
## 4 375 63 1.000000e+00
## 5 396 2 1.000000e+00
## 6 500 0 7.688689e-291
## 7 500 0 4.901770e-203
Q: Which predictor was used in the most nodes, over all trees?
Q: Which predictors were never used as the root of the tree?
Q: Which predictors were used in every tree?
The p-value measures whether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random, following a binomial distribution. See https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html for details.
Another interesting plot is the “multiway importance plot”, which by default shows the number of times as root vs. the mean minimum depth for each predictor. There is typically a negative relation between times_a_root
and mean_min_depth
.
Q: Which predictor is most important by both measures?
Q: Is this a consistent negative relation? If not, where are the exceptions?
Another interesting plot is the pairwise comparison of different importance measures:
plot_importance_ggpairs(importance_frame)
Q: Which are the most and least related measures?
plot_multi_way_importance(m.lzn.rf, size_measure = "no_of_nodes")
Examine the interactions between predictors, in terms of their use within the forest:
interactions_frame <- min_depth_interactions(m.lzn.rf)
interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE)[1:12], ]
## variable root_variable mean_min_depth occurrences interaction
## 8 elev dist.m 0.4036145 498 dist.m:elev
## 36 x dist.m 1.1000321 494 dist.m:x
## 43 y dist.m 1.7028514 493 dist.m:y
## 1 dist.m dist.m 1.1482651 489 dist.m:dist.m
## 37 x elev 1.2661165 481 elev:x
## 9 elev elev 1.3224779 475 elev:elev
## 44 y elev 1.8635582 464 elev:y
## 2 dist.m elev 1.5867912 449 elev:dist.m
## 15 ffreq dist.m 3.7058715 417 dist.m:ffreq
## 41 x x 2.2486948 393 x:x
## 48 y x 2.0801406 375 x:y
## 13 elev x 2.3134538 368 x:elev
## uncond_mean_min_depth
## 8 1.124000
## 36 2.166000
## 43 2.652000
## 1 0.354000
## 37 2.166000
## 9 1.124000
## 44 2.652000
## 2 0.354000
## 15 3.593508
## 41 2.166000
## 48 2.652000
## 13 1.124000
plot_min_depth_interactions(interactions_frame)
Q: Which pair of predictors are most used near the root of the tree, and with the most co-occurrences?
“To further investigate the most frequent interaction we use the function plot_predict_interaction to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on x and y-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter grid.””
Here we see the predicted log(Zn) content for all combinations of distance from river and elevation:
plot_predict_interaction(m.lzn.rf, meuse, "dist.m", "elev")
These plots show the effect of one predictor on the predictand, holding all the others constant at their median value. We show these in order of variable importance, as determined just above.
partialPlot(m.lzn.rf, pred.data=meuse, x.var="x")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="y")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="dist.m")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="elev")
partialPlot(m.lzn.rf, pred.data=meuse, x.var="ffreq", which.class=1)
Q: What is the effect of just the distance to river on the RF prediction of log(Zn), when the other variables are held at their mean values (continuous variables, i.e., elev
) or most common class (classified variables, i.e. ffreq
, soil
and lime
). Is the effect monotonic (i.e., consistently increasing or decreasing)? Why or why not?
Predict back onto the known points, and evaluate the goodness-of-fit:
p.rf <- predict(m.lzn.rf, newdata=meuse)
length(unique(p.rf))
## [1] 155
summary(r.rpp <- meuse$logZn - p.rf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1858643 -0.0354089 -0.0035181 -0.0007279 0.0346652 0.1677528
(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)))
## [1] 0.06234585
plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
Q: What is the RMSE?
Q: How many different predictions were made? Why not just a few, as in regression trees?
A more realistic view of the predictive power of the model is the out-of-bag validation. Here we use predict
with no newdata
argument, so the prediction is taken from the predicted
component of the fitted tree, which was computed during tree building from the out-of-bag observations of each tree..
p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.43958 -0.08056 -0.01097 -0.00142 0.07454 0.36201
(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))
## [1] 0.1398753
plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
xlab="Out-of-bag cross-validation estimates",
ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)
Q: What is the OOB RMSE? How does this compare with the fitted RMSE? Which is more realistic for predictions?
Q: How does this compare to the model fits?
Q: How does this compare with the RMSE from the Regression Tree?
We can try to build an RF model to map the grid, but again we can only use coordinates, flood frequency, soil type and relative distance to the river, because these are the only variables we have over the whole area, i.e., in the spatial object meuse.grid
.
names(meuse.grid)
## [1] "x" "y" "part.a" "part.b" "dist" "soil" "ffreq"
m.lzn.rf.g <- randomForest(logZn ~ x + y + ffreq + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.lzn.rf)
##
## Call:
## randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01956509
## % Var explained: 79.96
(tmp <- importance(m.lzn.rf.g, type=1))
## %IncMSE
## x 32.61332
## y 19.35550
## ffreq 37.38123
## dist 51.53406
## soil 13.51469
(100*tmp/sum(tmp))
## %IncMSE
## x 21.122783
## y 12.536043
## ffreq 24.210827
## dist 33.377240
## soil 8.753107
varImpPlot(m.lzn.rf.g, type=1)
meuse.grid.sp$predictedLog10Zn.rf <- predict(m.lzn.rf.g, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.rf")
Q: Does this map look realistic? How does it compare with the map made with the single regressiom tree?
This method is derived from the C4.5 approach of Quinlan.
References:
Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279
A recent article comparing it to RF is: Houborg, R., & McCabe, M. F. (2018). A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 173–188. https://doi.org/10.1016/j.isprsjprs.2017.10.004
It is similar to RF but instead of single values at leaves it creates a multivariate linear regression for the cases in the leaf, using the variables in the nodes leading to it (?).
The Cubist
package is an R
port of the Cubist GPL C
code released by RuleQuest at http://rulequest.com/cubist-info.html
.
An explanation from the Cubist
vignette:
Cubist is a rule–based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
Package used in this section: Cubist
for the C5 algorithm.
library(Cubist)
Split into training sets, here 80% in the training set:
inTrain <- sample(1:nrow(meuse), floor(.8*nrow(meuse)))
Build training and test matrices. Note there is no formula method for cubist
; the predictors are specified as matrix or data frame and the outcome is a numeric vector.
preds <- c("x","y","ffreq","dist.m","elev","soil","lime")
train.pred <- meuse[ inTrain, preds]
test.pred <- meuse[-inTrain, preds]
train.resp <- meuse$logZn[ inTrain]
test.resp <- meuse$logZn[-inTrain]
Now fit the model to the training set:
(c.model <- cubist(x = train.pred, y = train.resp))
##
## Call:
## cubist.default(x = train.pred, y = train.resp)
##
## Number of samples: 124
## Number of predictors: 7
##
## Number of committees: 1
## Number of rules: 2
summary(c.model)
##
## Call:
## cubist.default(x = train.pred, y = train.resp)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Jun 21 11:41:13 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 124 cases (8 attributes) from undefined.data
##
## Model:
##
## Rule 1: [80 cases, mean 2.415175, range 2.075547 to 2.89098, est err 0.117872]
##
## if
## dist.m > 140
## then
## outcome = -11.454599 + 0.000179 y - 0.000247 x - 0.116 elev
## - 7e-05 dist.m
##
## Rule 2: [44 cases, mean 2.850381, range 2.281033 to 3.264582, est err 0.153651]
##
## if
## dist.m <= 140
## then
## outcome = 3.623907 - 0.00202 dist.m - 0.082 elev
##
##
## Evaluation on training data (124 cases):
##
## Average |error| 0.122804
## Relative |error| 0.48
## Correlation coefficient 0.86
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% dist.m
## 100% elev
## 65% x
## 65% y
##
##
## Time: 0.0 secs
In this case there are only two leaves! each with a regression model. The split is on distance to river.
We predict onto the test set to get an estimate of predictive accuracy:
c.pred <- predict(c.model, test.pred)
## Test set RMSE
sqrt(mean((c.pred - test.resp)^2))
## [1] 0.1520256
## Test set R^2
cor(c.pred, test.resp)^2
## [1] 0.8670161
Now, fit a model on the entire set and then predict over the grid. Again we can only use predictors that are in the grid.
preds <- c("x","y","dist","soil","ffreq")
all.preds <- meuse[, preds]
all.resp <- meuse$logZn
c.model.grid <- cubist(x = all.preds, y = all.resp)
summary(c.model.grid)
##
## Call:
## cubist.default(x = all.preds, y = all.resp)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Jun 21 11:41:13 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 155 cases (6 attributes) from undefined.data
##
## Model:
##
## Rule 1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
##
## if
## x > 179095
## dist > 0.211846
## then
## outcome = 2.406759 - 0.32 dist
##
## Rule 2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
##
## if
## x <= 179095
## dist > 0.211846
## then
## outcome = -277.415278 + 0.000847 y + 0.56 dist
##
## Rule 3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
##
## if
## dist <= 0.211846
## then
## outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
##
##
## Evaluation on training data (155 cases):
##
## Average |error| 0.111292
## Relative |error| 0.41
## Correlation coefficient 0.85
##
##
## Attribute usage:
## Conds Model
##
## 100% 100% dist
## 48% 52% x
## 57% y
##
##
## Time: 0.0 secs
meuse.grid.sp$predictedLog10Zn.cubist <- predict(c.model.grid, newdata=meuse.grid[,preds])
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist", col.regions=bpy.colors(64))
Here we can see the effect of the local regression models, as well as the hard split on x
, y
, and dist
.
Cubist can improve performance in two ways: (1) with “committees” and (2) adjusting by nearest neighbours:
The Cubist model can also use a boosting–like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over–predicted a value, the response is adjusted downward for the next model (and so on). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
Another innovation in Cubist using nearest–neighbors to adjust the predictions from the rule–based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points.
To find the optimum values for these, we turn to the caret
package.
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques.
See also: http://topepo.github.io/caret/model-training-and-tuning.html
The train
function in caret
can use many models, supplied by other packages, by specifying the method
argument. Then the tuneGrid
parameter refers to a grid of possible parameters to compare by cross-validation. Note we use all the points, because train
does the splitting as it evaluates the alternative models.
require(caret)
test <- train(x = all.preds, y = all.resp, method="cubist",
tuneGrid = expand.grid(.committees = 1:12,
.neighbors = 0:5),
trControl = trainControl(method = 'cv'))
print(test)
## Cubist
##
## 155 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 138, 140, 141, 139, 140, 139, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 0.1894278 0.6496111 0.1419179
## 1 1 0.1814152 0.6942065 0.1257052
## 1 2 0.1670760 0.7304305 0.1197129
## 1 3 0.1652328 0.7312889 0.1196747
## 1 4 0.1666086 0.7243422 0.1231867
## 1 5 0.1690508 0.7157745 0.1270865
## 2 0 0.1710778 0.7096912 0.1281719
## 2 1 0.1860139 0.6804579 0.1255095
## 2 2 0.1707818 0.7198970 0.1196356
## 2 3 0.1660291 0.7286866 0.1178113
## 2 4 0.1637145 0.7318926 0.1184247
## 2 5 0.1640964 0.7303724 0.1198547
## 3 0 0.1685245 0.7209211 0.1283659
## 3 1 0.1823722 0.6905295 0.1229561
## 3 2 0.1643650 0.7378263 0.1149210
## 3 3 0.1586798 0.7492127 0.1140697
## 3 4 0.1563527 0.7527732 0.1145676
## 3 5 0.1565914 0.7513679 0.1164066
## 4 0 0.1713729 0.7133692 0.1316064
## 4 1 0.1831968 0.6864192 0.1249382
## 4 2 0.1663870 0.7298149 0.1166409
## 4 3 0.1600404 0.7440668 0.1154950
## 4 4 0.1574077 0.7490784 0.1159470
## 4 5 0.1573277 0.7482071 0.1171381
## 5 0 0.1681069 0.7244834 0.1264697
## 5 1 0.1802858 0.6943582 0.1229848
## 5 2 0.1636044 0.7389211 0.1142720
## 5 3 0.1565319 0.7551840 0.1133320
## 5 4 0.1535981 0.7611870 0.1135889
## 5 5 0.1539955 0.7593028 0.1153939
## 6 0 0.1686125 0.7221074 0.1275708
## 6 1 0.1817304 0.6908429 0.1243756
## 6 2 0.1650978 0.7340920 0.1160568
## 6 3 0.1584983 0.7491020 0.1143697
## 6 4 0.1555288 0.7550528 0.1145491
## 6 5 0.1557105 0.7532503 0.1162426
## 7 0 0.1666813 0.7282440 0.1263042
## 7 1 0.1796026 0.6963742 0.1228387
## 7 2 0.1632439 0.7396326 0.1148584
## 7 3 0.1567832 0.7542766 0.1138369
## 7 4 0.1538616 0.7600767 0.1140501
## 7 5 0.1539837 0.7588860 0.1159290
## 8 0 0.1690046 0.7203095 0.1273635
## 8 1 0.1825678 0.6885467 0.1249667
## 8 2 0.1664451 0.7309120 0.1172158
## 8 3 0.1598984 0.7458670 0.1154182
## 8 4 0.1568636 0.7519905 0.1153745
## 8 5 0.1569257 0.7505485 0.1173632
## 9 0 0.1686822 0.7225742 0.1265510
## 9 1 0.1808938 0.6927622 0.1231957
## 9 2 0.1652848 0.7340160 0.1155964
## 9 3 0.1585826 0.7491854 0.1137817
## 9 4 0.1558456 0.7544834 0.1140752
## 9 5 0.1561652 0.7525235 0.1164895
## 10 0 0.1669221 0.7284473 0.1268756
## 10 1 0.1808163 0.6936500 0.1233647
## 10 2 0.1647897 0.7362088 0.1154954
## 10 3 0.1578002 0.7523411 0.1135312
## 10 4 0.1549745 0.7579958 0.1136374
## 10 5 0.1551573 0.7563532 0.1157767
## 11 0 0.1655149 0.7331397 0.1246649
## 11 1 0.1793334 0.6980443 0.1224620
## 11 2 0.1632807 0.7409552 0.1145469
## 11 3 0.1563271 0.7568367 0.1124276
## 11 4 0.1537992 0.7616880 0.1125552
## 11 5 0.1542451 0.7594513 0.1149355
## 12 0 0.1649211 0.7352316 0.1256688
## 12 1 0.1795084 0.6977924 0.1227620
## 12 2 0.1632699 0.7418168 0.1146453
## 12 3 0.1562419 0.7578989 0.1129060
## 12 4 0.1537268 0.7626763 0.1129656
## 12 5 0.1541418 0.7603350 0.1154308
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 5 and neighbors = 4.
plot(test, metric="RMSE")
plot(test, metric="Rsquared")
Of course this is random in the splits, so each run will be different. In this case the graph shows clearly that more committees boost the performance considerably, from around 0.19 to 0.15 cross-validation RMSE. The nearest-neighbours also help considerably, up to about four, and then the improvement is less.
So we fit a model with these recommended parameters. Note the neighbour count is only used in prediction, not model building.
c.model.grid.2 <- cubist(x = all.preds, y = all.resp, committees=4)
summary(c.model.grid.2)
##
## Call:
## cubist.default(x = all.preds, y = all.resp, committees = 4)
##
##
## Cubist [Release 2.07 GPL Edition] Tue Jun 21 11:41:23 2022
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 155 cases (6 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
##
## if
## x > 179095
## dist > 0.211846
## then
## outcome = 2.406759 - 0.32 dist
##
## Rule 1/2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
##
## if
## x <= 179095
## dist > 0.211846
## then
## outcome = -277.415278 + 0.000847 y + 0.56 dist
##
## Rule 1/3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
##
## if
## dist <= 0.211846
## then
## outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
##
## Model 2:
##
## Rule 2/1: [45 cases, mean 2.418724, range 2.10721 to 2.893762, est err 0.182228]
##
## if
## x <= 179826
## ffreq in {2, 3}
## then
## outcome = 128.701732 - 0.000705 x
##
## Rule 2/2: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.181513]
##
## if
## dist > 0.0703468
## then
## outcome = 30.512065 - 0.87 dist - 0.000154 x
##
## Rule 2/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.125950]
##
## if
## dist > 0.0703468
## ffreq = 1
## then
## outcome = 37.730889 - 0.000314 x - 0.35 dist + 6.5e-05 y
##
## Rule 2/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.139639]
##
## if
## dist <= 0.0703468
## then
## outcome = 2.982852 - 0.36 dist
##
## Model 3:
##
## Rule 3/1: [75 cases, mean 2.325347, range 2.053078 to 2.89098, est err 0.150262]
##
## if
## dist > 0.211846
## then
## outcome = 2.263175 - 0.11 dist
##
## Rule 3/2: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.156918]
##
## if
## dist <= 0.211846
## then
## outcome = 2.978074 - 2.42 dist
##
## Model 4:
##
## Rule 4/1: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.161020]
##
## if
## dist > 0.0703468
## then
## outcome = 4.601694 - 0.61 dist - 0.000101 x + 4.9e-05 y
##
## Rule 4/2: [51 cases, mean 2.505994, range 2.075547 to 3.055378, est err 0.196863]
##
## if
## x <= 179731
## dist > 0.0703468
## then
## outcome = 175.779227 - 0.000967 x
##
## Rule 4/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.147504]
##
## if
## dist > 0.0703468
## ffreq = 1
## then
## outcome = 53.629078 - 0.000343 x - 0.44 dist + 3.3e-05 y
##
## Rule 4/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.140965]
##
## if
## dist <= 0.0703468
## then
## outcome = 2.975269 - 0.08 dist
##
##
## Evaluation on training data (155 cases):
##
## Average |error| 0.110135
## Relative |error| 0.40
## Correlation coefficient 0.84
##
##
## Attribute usage:
## Conds Model
##
## 95% 88% dist
## 21% 64% x
## 19% ffreq
## 39% y
##
##
## Time: 0.0 secs
meuse.grid.sp$predictedLog10Zn.cubist.2 <-
predict(c.model.grid.2, newdata=meuse.grid[,preds],
neighbors=5)
summary(meuse.grid.sp$predictedLog10Zn.cubist.2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.954 2.228 2.346 2.445 2.640 3.154
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist.2", col.regions=bpy.colors(64),
main="Optimized Cubist prediction")
A much nicer looking map.
The usage
field of the fitted model shows the usage of each variable in either the rule conditions or the (terminal) linear model.
c.model.grid$usage
## Conditions Model Variable
## 1 100 100 dist
## 2 48 52 x
## 3 0 57 y
## 4 0 0 soil
## 5 0 0 ffreq
c.model.grid.2$usage
## Conditions Model Variable
## 1 95 88 dist
## 2 21 64 x
## 3 19 0 ffreq
## 4 0 39 y
## 5 0 0 soil
caret::varImp(c.model.grid)
## Overall
## dist 100.0
## x 50.0
## y 28.5
## soil 0.0
## ffreq 0.0
caret::varImp(c.model.grid.2)
## Overall
## dist 91.5
## x 42.5
## ffreq 9.5
## y 19.5
## soil 0.0
Distance is always used in the single model, but sometimes not in the committee model.
Notice how flood frequency came into the model when a committee was formed, i.e., subsidiary rules to adjust mis-matches from higher-level rules.
It is interesting to see how the different models fit a single predictor, in this case distance to river:
distances <- all.preds[, "dist", drop = FALSE]
rules_only <- cubist(x = distances, y = all.resp)
rules_and_com <- cubist(x = distances, y = all.resp, committees = 4)
predictions <- distances
predictions$true <- all.resp
predictions$rules <- predict(rules_only, distances, neighbors = 0)
predictions$rules_neigh <- predict(rules_only, distances, neighbors = 5)
predictions$committees <- predict(rules_and_com, distances, neighbors = 0)
predictions$committees_neigh <- predict(rules_and_com, distances, neighbors = 5)
summary(predictions)
## dist true rules rules_neigh
## Min. :0.00000 Min. :2.053 Min. :2.069 Min. :2.046
## 1st Qu.:0.07569 1st Qu.:2.297 1st Qu.:2.311 1st Qu.:2.331
## Median :0.21184 Median :2.513 Median :2.482 Median :2.541
## Mean :0.24002 Mean :2.556 Mean :2.545 Mean :2.556
## 3rd Qu.:0.36407 3rd Qu.:2.829 3rd Qu.:2.804 3rd Qu.:2.791
## Max. :0.88039 Max. :3.265 Max. :2.982 Max. :2.992
## committees committees_neigh
## Min. :2.043 Min. :2.035
## 1st Qu.:2.347 1st Qu.:2.330
## Median :2.495 Median :2.524
## Mean :2.545 Mean :2.555
## 3rd Qu.:2.771 3rd Qu.:2.791
## Max. :2.924 Max. :2.990
plot(true ~ dist, data=predictions, pch=20, col="darkgray",
xlab="covariate: normalized distance to river",
ylab="log10(Zn)", main="Cubist models")
grid()
m <- matrix(cbind(predictions$dist, predictions$rules), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=1, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$rules_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=2, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=3, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=4, lwd=2)
legend("topright", c("rules", "rules + neighbour",
"committee", "committee + neighbour"),
pch="----", col=1:4)
Notice how the committee (green) smoothes out the centre of distance range from the single-rules model (black).
Notice how the neighbours come much closer to some of the points – too close?
A pure “instance based” approach is to predict at each location from the average (possibly weighted) of some set of similar values. The caret
package includes a KNN regression with the knnreg
function.
Try with five clusters:
knn.5 <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq, data=meuse, k=5)
knn.pred <- predict(knn.5, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
However, unless we know a priori a number of classes, we should let train
evaluate the optimum number of clusters.
knn.tune <- train(x = all.preds, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
str(knn.tune)
## List of 21
## $ method : chr "knn"
## $ modelInfo :List of 13
## ..$ label : chr "k-Nearest Neighbors"
## ..$ library : NULL
## ..$ loop : NULL
## ..$ type : chr [1:2] "Classification" "Regression"
## ..$ parameters:'data.frame': 1 obs. of 3 variables:
## .. ..$ parameter: chr "k"
## .. ..$ class : chr "numeric"
## .. ..$ label : chr "#Neighbors"
## ..$ grid :function (x, y, len = NULL, search = "grid")
## ..$ fit :function (x, y, wts, param, lev, last, classProbs, ...)
## ..$ predict :function (modelFit, newdata, submodels = NULL)
## ..$ predictors:function (x, ...)
## ..$ tags : chr "Prototype Models"
## ..$ prob :function (modelFit, newdata, submodels = NULL)
## ..$ levels :function (x)
## ..$ sort :function (x)
## $ modelType : chr "Regression"
## $ results :'data.frame': 20 obs. of 7 variables:
## ..$ k : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ RMSE : num [1:20] 0.239 0.214 0.201 0.206 0.209 ...
## ..$ Rsquared : num [1:20] 0.512 0.562 0.615 0.594 0.576 ...
## ..$ MAE : num [1:20] 0.173 0.161 0.156 0.16 0.161 ...
## ..$ RMSESD : num [1:20] 0.0651 0.0473 0.0399 0.0355 0.0349 ...
## ..$ RsquaredSD: num [1:20] 0.19 0.16 0.139 0.127 0.152 ...
## ..$ MAESD : num [1:20] 0.0442 0.0344 0.0319 0.0282 0.0277 ...
## $ pred : NULL
## $ bestTune :'data.frame': 1 obs. of 1 variable:
## ..$ k: int 3
## $ call : language train.default(x = all.preds, y = all.resp, method = "knn", trControl = trainControl(method = "cv"), tuneGrid| __truncated__
## $ dots : list()
## $ metric : chr "RMSE"
## $ control :List of 28
## ..$ method : chr "cv"
## ..$ number : num 10
## ..$ repeats : logi NA
## ..$ search : chr "grid"
## ..$ p : num 0.75
## ..$ initialWindow : NULL
## ..$ horizon : num 1
## ..$ fixedWindow : logi TRUE
## ..$ skip : num 0
## ..$ verboseIter : logi FALSE
## ..$ returnData : logi TRUE
## ..$ returnResamp : chr "final"
## ..$ savePredictions : chr "none"
## ..$ classProbs : logi FALSE
## ..$ summaryFunction :function (data, lev = NULL, model = NULL)
## ..$ selectionFunction: chr "best"
## ..$ preProcOptions :List of 6
## .. ..$ thresh : num 0.95
## .. ..$ ICAcomp : num 3
## .. ..$ k : num 5
## .. ..$ freqCut : num 19
## .. ..$ uniqueCut: num 10
## .. ..$ cutoff : num 0.9
## ..$ sampling : NULL
## ..$ index :List of 10
## .. ..$ Fold01: int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold02: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold03: int [1:141] 1 2 3 4 6 7 8 10 11 12 ...
## .. ..$ Fold04: int [1:139] 1 2 3 4 5 7 9 11 12 13 ...
## .. ..$ Fold05: int [1:139] 2 3 4 5 6 8 9 10 11 12 ...
## .. ..$ Fold06: int [1:139] 1 2 4 5 6 7 8 9 10 11 ...
## .. ..$ Fold07: int [1:140] 1 2 3 5 6 7 8 9 10 11 ...
## .. ..$ Fold08: int [1:140] 1 3 4 5 6 7 8 9 10 11 ...
## .. ..$ Fold09: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ Fold10: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ indexOut :List of 10
## .. ..$ Resample01: int [1:15] 11 12 19 24 28 48 70 81 87 90 ...
## .. ..$ Resample02: int [1:16] 31 32 46 47 51 63 69 79 96 100 ...
## .. ..$ Resample03: int [1:14] 5 9 44 59 64 76 80 94 117 124 ...
## .. ..$ Resample04: int [1:16] 6 8 10 15 30 56 62 72 73 77 ...
## .. ..$ Resample05: int [1:16] 1 7 21 25 34 35 45 68 71 92 ...
## .. ..$ Resample06: int [1:16] 3 36 38 41 42 43 52 54 65 91 ...
## .. ..$ Resample07: int [1:15] 4 13 18 22 39 50 55 74 78 84 ...
## .. ..$ Resample08: int [1:15] 2 14 20 27 49 57 60 83 88 99 ...
## .. ..$ Resample09: int [1:16] 16 17 26 29 33 40 53 75 82 95 ...
## .. ..$ Resample10: int [1:16] 23 37 58 61 66 67 86 101 106 115 ...
## ..$ indexFinal : NULL
## ..$ timingSamps : num 0
## ..$ predictionBounds : logi [1:2] FALSE FALSE
## ..$ seeds :List of 11
## .. ..$ : int [1:20] 331478 879352 914281 736229 345552 137000 305698 936220 477590 796949 ...
## .. ..$ : int [1:20] 942228 606351 531425 563971 556165 350857 269839 165305 219210 793670 ...
## .. ..$ : int [1:20] 344001 499400 435143 425003 315768 600008 430462 674254 508596 663124 ...
## .. ..$ : int [1:20] 25303 853779 773164 637561 418224 680961 650579 986430 29211 763715 ...
## .. ..$ : int [1:20] 121081 756384 51529 496944 72256 110558 7104 941019 520871 107068 ...
## .. ..$ : int [1:20] 347957 554661 820542 192229 343071 628777 425467 597347 575189 599158 ...
## .. ..$ : int [1:20] 190643 924754 983605 127721 912488 828814 458233 144726 662184 451934 ...
## .. ..$ : int [1:20] 70305 564507 62247 439284 658329 807435 317743 900451 251307 956902 ...
## .. ..$ : int [1:20] 757682 729339 29882 784946 140110 991703 378115 589952 624084 83332 ...
## .. ..$ : int [1:20] 775494 583733 85255 556345 587481 779124 383623 180975 784581 181348 ...
## .. ..$ : int 86593
## ..$ adaptive :List of 4
## .. ..$ min : num 5
## .. ..$ alpha : num 0.05
## .. ..$ method : chr "gls"
## .. ..$ complete: logi TRUE
## ..$ trim : logi FALSE
## ..$ allowParallel : logi TRUE
## ..$ yLimits : num [1:2] 1.99 3.33
## $ finalModel :List of 8
## ..$ learn :List of 2
## .. ..$ y: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
## .. ..$ X: chr [1:155, 1:5] "181072" "181025" "181165" "181298" ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:155] "X1" "X2" "X3" "X4" ...
## .. .. .. ..$ : chr [1:5] "x" "y" "dist" "soil" ...
## ..$ k : int 3
## ..$ theDots : list()
## ..$ xNames : chr [1:5] "x" "y" "dist" "soil" ...
## ..$ problemType: chr "Regression"
## ..$ tuneValue :'data.frame': 1 obs. of 1 variable:
## .. ..$ k: int 3
## ..$ obsLevels : logi NA
## ..$ param : list()
## ..- attr(*, "class")= chr "knnreg"
## $ preProcess : NULL
## $ trainingData:'data.frame': 155 obs. of 6 variables:
## ..$ x : num [1:155] 181072 181025 181165 181298 181307 ...
## ..$ y : num [1:155] 333611 333558 333537 333484 333330 ...
## ..$ dist : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
## ..$ soil : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
## ..$ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ .outcome: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
## $ ptype :'data.frame': 0 obs. of 5 variables:
## ..$ x : num(0)
## ..$ y : num(0)
## ..$ dist : num(0)
## ..$ soil : Factor w/ 3 levels "1","2","3":
## ..$ ffreq: Factor w/ 3 levels "1","2","3":
## $ resample :'data.frame': 10 obs. of 4 variables:
## ..$ RMSE : num [1:10] 0.159 0.231 0.23 0.232 0.192 ...
## ..$ Rsquared: num [1:10] 0.772 0.457 0.675 0.566 0.659 ...
## ..$ MAE : num [1:10] 0.129 0.182 0.148 0.174 0.148 ...
## ..$ Resample: chr [1:10] "Fold07" "Fold02" "Fold09" "Fold06" ...
## $ resampledCM : NULL
## $ perfNames : chr [1:3] "RMSE" "Rsquared" "MAE"
## $ maximize : logi FALSE
## $ yLimits : num [1:2] 1.99 3.33
## $ times :List of 3
## ..$ everything: 'proc_time' Named num [1:5] 1.009 0.006 1.019 0 0
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ final : 'proc_time' Named num [1:5] 0.001 0 0.002 0 0
## .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
## ..$ prediction: logi [1:3] NA NA NA
## $ levels : logi NA
## - attr(*, "class")= chr "train"
print(knn.tune)
## k-Nearest Neighbors
##
## 155 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 140, 139, 141, 139, 139, 139, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 0.2387850 0.5123081 0.1734232
## 2 0.2143472 0.5620450 0.1607389
## 3 0.2014253 0.6146012 0.1562451
## 4 0.2063183 0.5935644 0.1598601
## 5 0.2086454 0.5761987 0.1608352
## 6 0.2107982 0.5680247 0.1620349
## 7 0.2142721 0.5563196 0.1662006
## 8 0.2132208 0.5608793 0.1665361
## 9 0.2210012 0.5239257 0.1747875
## 10 0.2172485 0.5432590 0.1732183
## 11 0.2229628 0.5254602 0.1776785
## 12 0.2283766 0.5089757 0.1837060
## 13 0.2324025 0.4987415 0.1866540
## 14 0.2328633 0.5024031 0.1892235
## 15 0.2344155 0.4959076 0.1909953
## 16 0.2398379 0.4733137 0.1976369
## 17 0.2426532 0.4626844 0.2005701
## 18 0.2441876 0.4651065 0.2042799
## 19 0.2462564 0.4589822 0.2058679
## 20 0.2493915 0.4487920 0.2099251
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
min(knn.tune$results$RMSE)
## [1] 0.2014253
(ix <- which.min(knn.tune$results$RMSE))
## [1] 3
max(knn.tune$results$Rsquared)
## [1] 0.6146012
(ix <- which.max(knn.tune$results$Rsquared))
## [1] 3
# directly find optimum
knn.tune$finalModel$k
## [1] 3
Here there is a clear winner, k=3
. It is not a very good model: cross-validation RMSE about 0.201, \(R^2 =\) 0.615.
Use this to predict:
knn.final <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
knn.pred <- predict(knn.final, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()
We can’t map with this because we don’t have elevation in the grid. So re-tune with what we do have:
preds.2 <- c("x","y","ffreq","dist.m","soil","lime")
all.preds.2 <- meuse[, preds.2]
knn.tune <- train(x = all.preds.2, y = all.resp,
method="knn",
tuneGrid = data.frame(.k = 1:20),
trControl = trainControl(method = 'cv'))
min(knn.tune$results$RMSE)
## [1] 0.1923328
(ix <- which.min(knn.tune$results$RMSE))
## [1] 3
max(knn.tune$results$Rsquared)
## [1] 0.6350597
(ix <- which.max(knn.tune$results$Rsquared))
## [1] 3
# directly find optimum
knn.tune$finalModel$k
## [1] 3
plot.train(knn.tune, metric="RMSE")
plot.train(knn.tune, metric="Rsquared")
plot.train(knn.tune, metric="MAE")
## the two criteria do not agree
Map with this:
knn.final <- caret::knnreg(logZn ~ x + y + dist + ffreq,
data=meuse,
k = knn.tune$finalModel$k)
meuse.grid.sp$knn <- predict(knn.final, newdata=meuse.grid)
summary(meuse.grid.sp$knn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.074 2.276 2.429 2.475 2.669 3.217
spplot(meuse.grid.sp, zcol="knn")
Interesting map! Note here ‘nearest neighbour’ concept includes both coordinates (x, y, distance) and feature-space (flood frequency).
Random forests can use coordinates and distances to geographic features as predictors; we used the x and y coordinates in the previous section.
RF can also use distances to multiple points as predictors:
Reference: Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518
Packages used in this section: randomForest
to build the RF, scales
for visualization, GSIF
for buffer distances.
# library(quantregForest) # the main work is here
library(scales) # for visualization
# devtools::install_github("envirometrix/landmap")
library(landmap) # for buffer distances
Set up quantiles for the transformed and original Zn concentrations:
library(ggplot2)
g <- ggplot(meuse, mapping=aes(x=logZn))
g + geom_histogram(bins=16, fill="lightgreen", color="blue") + geom_rug()
(q.lzn <- quantile(meuse$logZn, seq(0,1,by=0.0625)))
# must include.lowest to have lowest value in some class
classes.lzn.q <- cut(meuse$logZn, breaks=q.lzn, ordered_result=TRUE, include.lowest=TRUE)
levels(classes.lzn.q)
## 0% 6.25% 12.5% 18.75% 25% 31.25% 37.5% 43.75%
## 2.053078 2.139461 2.201372 2.261554 2.296665 2.322219 2.362643 2.414125
## 50% 56.25% 62.5% 68.75% 75% 81.25% 87.5% 93.75%
## 2.513218 2.603414 2.680704 2.754247 2.828981 2.873820 2.920515 3.056094
## 100%
## 3.264582
## [1] "[2.05,2.14]" "(2.14,2.2]" "(2.2,2.26]" "(2.26,2.3]" "(2.3,2.32]"
## [6] "(2.32,2.36]" "(2.36,2.41]" "(2.41,2.51]" "(2.51,2.6]" "(2.6,2.68]"
## [11] "(2.68,2.75]" "(2.75,2.83]" "(2.83,2.87]" "(2.87,2.92]" "(2.92,3.06]"
## [16] "(3.06,3.26]"
Q: Are the 16 quantiles more or less equally distributed along the range of the target variable?
Compute distance buffers to points in each quantile. For this, the points object must be converted to a SpatialPointsDataFrame
and the function buffer.dist
in the landmap
package is used.
To display the distance buffers, we stack the fields of the data frame as set of rasters, using the stack
method of the raster
package.
coordinates(meuse) <- ~x + y
grid.dist.lzn <- landmap::buffer.dist(meuse["logZn"], meuse.grid.sp, classes.lzn.q)
summary(grid.dist.lzn)
## Object of class SpatialPixelsDataFrame
## Coordinates:
## min max
## x 178440 181560
## y 329600 333760
## Is projected: NA
## proj4string : [NA]
## Number of points: 3103
## Grid attributes:
## cellcentre.offset cellsize cells.dim
## s1 178460 40 78
## s2 329620 40 104
## Data attributes:
## layer.1 layer.2 layer.3 layer.4
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 291.2 1st Qu.: 233.2 1st Qu.: 233.2 1st Qu.:200.0
## Median : 605.3 Median : 377.4 Median : 377.4 Median :322.5
## Mean : 695.5 Mean : 513.4 Mean : 401.1 Mean :330.4
## 3rd Qu.:1038.5 3rd Qu.: 698.6 3rd Qu.: 536.7 3rd Qu.:441.8
## Max. :2050.6 Max. :1938.7 Max. :1193.3 Max. :937.2
## layer.5 layer.6 layer.7 layer.8
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:240.0 1st Qu.: 240.0 1st Qu.: 204.0 1st Qu.: 233.2
## Median :377.4 Median : 368.8 Median : 329.8 Median : 379.5
## Mean :393.4 Mean : 379.4 Mean : 363.2 Mean : 430.7
## 3rd Qu.:521.5 3rd Qu.: 501.2 3rd Qu.: 494.8 3rd Qu.: 560.0
## Max. :990.4 Max. :1245.8 Max. :1131.4 Max. :1631.7
## layer.9 layer.10 layer.11 layer.12
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:256.1 1st Qu.: 243.3 1st Qu.: 322.5 1st Qu.: 243.3
## Median :417.6 Median : 402.0 Median : 544.1 Median : 425.2
## Mean :431.3 Mean : 464.3 Mean : 570.4 Mean : 482.8
## 3rd Qu.:600.0 3rd Qu.: 632.5 3rd Qu.: 776.7 3rd Qu.: 648.7
## Max. :950.8 Max. :1302.9 Max. :1612.5 Max. :1531.0
## layer.13 layer.14 layer.15 layer.16
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 282.8 1st Qu.: 291.2 1st Qu.: 312.4 1st Qu.: 339.4
## Median : 483.3 Median : 468.2 Median : 526.1 Median : 536.7
## Mean : 475.3 Mean : 533.7 Mean : 563.6 Mean : 538.6
## 3rd Qu.: 651.2 3rd Qu.: 688.2 3rd Qu.: 800.0 3rd Qu.: 720.0
## Max. :1019.8 Max. :1697.5 Max. :1618.4 Max. :1322.4
plot(raster::stack(grid.dist.lzn))
Q: Explain the pattern of Layer 16.
Extract the distances to each buffer at each point using the over
“overlay point on grid” function of the sp
package, and add these as fields to the points:
buffer.dists <- over(meuse, grid.dist.lzn)
str(buffer.dists)
## 'data.frame': 155 obs. of 16 variables:
## $ layer.1 : num 1897 1809 1865 1878 1737 ...
## $ layer.2 : num 1784 1695 1755 1772 1633 ...
## $ layer.3 : num 735 680 621 561 402 ...
## $ layer.4 : num 468 412 362 330 179 ...
## $ layer.5 : num 804 730 721 699 544 ...
## $ layer.6 : num 685 612 601 582 431 ...
## $ layer.7 : num 268 283 126 0 160 ...
## $ layer.8 : num 369 330 233 160 0 ...
## $ layer.9 : num 268 226 160 170 126 ...
## $ layer.10: num 243 160 226 305 283 ...
## $ layer.11: num 369 283 344 400 330 ...
## $ layer.12: num 144 160 0 126 233 ...
## $ layer.13: num 734 645 738 794 700 ...
## $ layer.14: num 1531 1442 1537 1585 1471 ...
## $ layer.15: num 0 89.4 144.2 268.3 368.8 ...
## $ layer.16: num 89.4 0 160 282.8 344.1 ...
meuse@data <- cbind(meuse@data, buffer.dists)
Show how far point 1 is from the nearest point of each quantile:
buffer.dists[1,]
## layer.1 layer.2 layer.3 layer.4 layer.5 layer.6 layer.7 layer.8
## 1 1897.367 1783.928 735.3911 468.188 803.99 684.6897 268.3282 368.7818
## layer.9 layer.10 layer.11 layer.12 layer.13 layer.14 layer.15 layer.16
## 1 268.3282 243.3105 368.7818 144.2221 734.3024 1531.013 0 89.44272
Of course it is zero distance to its own quantile, in this case 15.
First make a formula with all the quantiles as predictors:
dn <- paste(names(grid.dist.lzn), collapse="+")
(fm <- as.formula(paste("logZn ~", dn)))
## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 +
## layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 +
## layer.13 + layer.14 + layer.15 + layer.16
Now compute the random forest model, use it to predict at the observation points.
(rf <- randomForest(fm , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
##
## Call:
## randomForest(formula = fm, data = meuse@data, importance = TRUE, min.split = 6, mtry = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.02515229
## % Var explained: 74.24
pred.rf <- predict(rf, newdata=meuse@data)
plot(rf)
Q: How many trees were needed to make a reliable forest?
Variable importance:
importance(rf, type=1)
## %IncMSE
## layer.1 15.777625
## layer.2 14.000921
## layer.3 12.192586
## layer.4 11.116130
## layer.5 11.328830
## layer.6 17.637204
## layer.7 8.513953
## layer.8 9.900595
## layer.9 10.901250
## layer.10 10.956214
## layer.11 5.475709
## layer.12 17.369070
## layer.13 11.046008
## layer.14 12.776170
## layer.15 16.842512
## layer.16 23.586323
varImpPlot(rf, type=1)
Q: Which distance buffer was most important? Explain why.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
plot(meuse$logZn ~ pred.rf, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf <- sqrt(sum((pred.rf-meuse$logZn)^2)/length(pred.rf)))
## [1] 0.06573453
#
oob.rf <- predict(rf)
plot(meuse$logZn ~ oob.rf, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob <- sqrt(sum((oob.rf-meuse$logZn)^2)/length(oob.rf)))
## [1] 0.1585947
Q: What are the goodness-of-fit and Out-of-bag RMSE?
pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.100 2.316 2.436 2.486 2.648 3.179
meuse.grid.sp$logZn.rf <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p1 <- spplot(meuse.grid.sp, zcol="logZn.rf", main="Zn, log(mg kg-1)", sub="RRF 16 distance buffers", at=breaks)
print(p1)
This random forest is based on distances from known points to prediction points. An obvious comparison is with Ordinary Kriging (OK), which is also based on these distances. However, OK requires a model of spatial dependence (i.e., a variogram model).
Compare this with OK cross-validation. First compute the ordinary variogram and fit a model:
require(gstat)
v <- variogram(logZn ~ 1, meuse)
(vmf <- fit.variogram(v, model=vgm(0.1, "Sph", 1000, 0.02)))
## model psill range
## 1 Nug 0.009554751 0.0000
## 2 Sph 0.111394769 896.9909
plot(v, model=vmf, plot.numbers=T)
Here we get a well-fit model.
Now compute the cross-validation statistics:
k.cv <- krige.cv(logZn ~ 1, meuse, model=vmf, verbose=FALSE)
plot(meuse$logZn ~ k.cv$var1.pred, pch=20, xlab="Cross-validation fit", ylab="Actual value", main="Zn, log(mg kg-1)", asp=1)
abline(0,1); grid()
(rmse.kcv <- sqrt(sum((k.cv$var1.pred-meuse$logZn)^2)/length(k.cv$var1.pred)))
## [1] 0.170157
Q: Which approach gives the lower cross-validation error in this case?
Compare the kriging map with the spatial random forest map:
k.ok <- krige(logZn ~ 1, meuse, newdata=meuse.grid.sp, model=vmf)
## [using ordinary kriging]
summary(k.ok$var1.pred)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.074 2.275 2.420 2.479 2.680 3.231
p2 <- spplot(k.ok, zcol="var1.pred", main="Zn, log(mg kg-1)", sub="Ordinary Kriging", at=breaks)
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
Compute and display the differences:
k.ok$diff.srf.ok <- meuse.grid.sp$logZn.rf - k.ok$var1.pred
summary(k.ok$diff.srf.ok)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.237016 -0.042862 0.004085 0.007542 0.047866 0.385077
p3 <- spplot(k.ok, zcol="diff.srf.ok", main="Spatial RF - OK predictions, log(Zn)", col.regions=topo.colors(36))
print(p3)
Q: Comment on the differences/similarities between the Spatial Random Forest map and the Ordinary Kriging map. Note that for Kriging we needed to specify a model, for RF no model was needed.
The concept of quantile buffers is similar to the empirical variogram, i.e., assume stationarity of spatial correlation structure.
Another approach is to use the distance from each known point as a predictor. Obviously this increases the size of the problem. This is the approach taken in the Hengl paper cited above.
Set up the buffer distances from each known point and visualize a few of them:
grid.dist.0 <- landmap::buffer.dist(meuse["logZn"], meuse.grid.sp, as.factor(1:nrow(meuse)))
dim(grid.dist.0)
## [1] 3103 155
summary(grid.dist.0$layer.1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1677 2685 2556 3506 4569
plot(raster::stack(grid.dist.0[1:9]))
Extract the distances to each buffer at each point, and add these as fields to the points. Note that we first have to remove the previous buffer distances (the 16 quantiles).
buffer.dists <- over(meuse, grid.dist.0)
dim(buffer.dists)
## [1] 155 155
# distances from point 1 to all the other points
(buffer.dists[1,1:10])
## layer.1 layer.2 layer.3 layer.4 layer.5 layer.6 layer.7 layer.8
## 1 0 89.44272 144.2221 268.3282 368.7818 481.6638 268.3282 243.3105
## layer.9 layer.10
## 1 400 468.188
names(meuse@data)
## [1] "cadmium" "copper" "lead" "zinc" "elev" "dist"
## [7] "om" "ffreq" "soil" "lime" "landuse" "dist.m"
## [13] "logZn" "logCu" "logPb" "logCd" "layer.1" "layer.2"
## [19] "layer.3" "layer.4" "layer.5" "layer.6" "layer.7" "layer.8"
## [25] "layer.9" "layer.10" "layer.11" "layer.12" "layer.13" "layer.14"
## [31] "layer.15" "layer.16"
meuse@data <- meuse@data[,1:16]
meuse@data <- cbind(meuse@data, buffer.dists)
So now we have 155 buffer distances. Set up a formula for the RF and build the RF. Note that the default mtry
will be 52, i.e., distance to 1/3 of the points. These will be chosen randomly at each split, so they will in general not be the closest points.
dn0 <- paste(names(buffer.dists), collapse="+")
fm0 <- as.formula(paste("logZn ~ ", dn0))
(rf0 <- randomForest(fm0 , meuse@data, importance=TRUE,
min.split=6, ntree=640))
##
## Call:
## randomForest(formula = fm0, data = meuse@data, importance = TRUE, min.split = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 51
##
## Mean of squared residuals: 0.03369799
## % Var explained: 65.49
This does not perform as well as using the quantiles, the \(R^2\) is about 10% lower.
Mapping:
pred.grid <- predict(rf0, newdata=grid.dist.0@data)
summary(pred.grid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.099 2.308 2.446 2.488 2.679 3.172
meuse.grid.sp$logZn.rf0 <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p2 <- spplot(meuse.grid.sp, zcol="logZn.rf0", main="Zn, log(mg kg-1)",
sub="RF distance all points", at=breaks)
# compare with 16-quantile buffers
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
A clearly inferior map. So we don’t care about individual point values, rather about the distance to quantiles.
Random forests can also be used to predict classes. As with RF for continuous variables, the forest is made up of many trees. The prediction is the class predicted by the majority of trees.
m.ffreq.rf <- randomForest(ffreq ~ x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.ffreq.rf)
##
## Call:
## randomForest(formula = ffreq ~ x + y + dist.m + elev + soil + lime, data = meuse, importance = T, mtry = 3, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.13%
## Confusion matrix:
## 1 2 3 class.error
## 1 77 7 0 0.08333333
## 2 2 41 5 0.14583333
## 3 2 9 12 0.47826087
importance(m.ffreq.rf)
## 1 2 3 MeanDecreaseAccuracy MeanDecreaseGini
## x 18.29162 11.344016 8.060948 21.165561 13.189545
## y 34.98967 20.135347 41.654420 47.717887 32.971857
## dist.m 10.11069 1.711775 5.707771 9.391853 7.262792
## elev 57.11589 36.914994 12.138088 60.901345 32.347641
## soil 10.15313 2.885389 4.660304 10.613597 2.226856
## lime 11.18376 -2.155811 8.122504 11.160544 2.380405
varImpPlot(m.ffreq.rf)
Here we see the class error rate, i.e., the training samples not predicted into the correct class. In this example class 3 is poorly-predicted, whereas class 1 is well-predicted.
Elevation is by far the most important predictor of accurary for all classes. The coordinates are next, because of the geographic configuration of the studty area – these replace distance to river when that is not used in the model.
However for the GINI coefficient (how often a randomly-chosen training observation would be incorrectly assigned if it were randomly labelled according to the frequency distribution of labels in the subset) the y-coordinate becomes somewhat more important – this perhaps because of the one high value far to the east, at the river bend.
See where the predictors are used:
min_depth_frame <- min_depth_distribution(m.ffreq.rf)
plot_min_depth_distribution(min_depth_frame)
Soil type and lime application are not used in more than half the trees. The y-coordinate is used first in most trees, followed by elevation and x-coordinate.
We can see the probabilities of each class by using the model to predict back at the known points:
summary(prob_rf <- predict(m.ffreq.rf, meuse, type = "prob"))
## 1 2 3
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0380 1st Qu.:0.0050 1st Qu.:0.0000
## Median :0.8020 Median :0.0460 Median :0.0080
## Mean :0.5461 Mean :0.3103 Mean :0.1436
## 3rd Qu.:0.9910 3rd Qu.:0.7920 3rd Qu.:0.0520
## Max. :1.0000 Max. :0.9800 Max. :0.9880
# probability of each observation being predicted to class 1
prob_rf[,1]
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.984 1.000 0.998 1.000 1.000 1.000 0.998 0.992 0.990 0.990 0.998 0.922 0.998
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.976 0.984 0.988 1.000 0.996 0.992 1.000 0.946 0.990 0.984 0.996 1.000 1.000
## 27 28 29 30 31 32 33 34 35 37 38 39 40
## 1.000 1.000 0.996 0.988 1.000 0.996 1.000 1.000 1.000 1.000 0.998 0.984 1.000
## 41 42 43 44 45 46 47 48 49 50 51 52 53
## 1.000 0.974 0.988 0.940 0.772 0.952 0.984 0.952 0.802 0.680 0.774 0.766 0.964
## 54 55 56 57 58 59 60 61 62 63 64 65 66
## 0.962 0.958 0.980 0.984 0.836 0.956 0.984 0.988 0.982 0.992 0.994 0.992 0.982
## 67 69 75 76 79 80 81 82 83 84 85 86 87
## 1.000 0.964 0.720 0.942 1.000 0.996 0.976 0.976 0.988 0.984 0.988 0.896 0.772
## 88 89 90 123 160 163 70 71 91 92 93 94 95
## 1.000 0.994 1.000 0.878 0.994 0.996 0.120 0.072 0.016 0.008 0.008 0.004 0.104
## 96 97 98 99 100 101 102 103 104 105 106 108 109
## 0.042 0.070 0.076 0.002 0.096 0.032 0.030 0.016 0.006 0.064 0.062 0.038 0.016
## 110 111 112 113 114 115 116 117 118 119 120 121 122
## 0.158 0.094 0.016 0.046 0.016 0.002 0.038 0.012 0.004 0.024 0.010 0.032 0.038
## 124 125 126 127 128 129 130 131 132 133 134 135 136
## 0.072 0.074 0.048 0.010 0.102 0.216 0.308 0.074 0.066 0.050 0.158 0.118 0.074
## 161 162 137 138 140 141 142 143 144 145 146 147 148
## 0.114 0.042 0.336 0.064 0.008 0.000 0.014 0.000 0.026 0.006 0.000 0.004 0.000
## 149 150 151 152 153 154 155 156 157 158 159 164
## 0.000 0.000 0.172 0.102 0.000 0.002 0.002 0.002 0.004 0.004 0.002 0.106
# an observation in order of probability
sort(prob_rf[1,], decreasing=TRUE)
## 1 2 3
## 0.984 0.008 0.008
order(prob_rf[1,], decreasing=TRUE)
## [1] 1 2 3
# its actual class -- in this case it agrees
meuse$ffreq[1]
## [1] 1
## Levels: 1 2 3
These can be summarized into list of most-probable, 2nd-most probable etc. classes:
# list of most likely classes
probs <- apply(prob_rf, 1, order, decreasing=TRUE)
probs[,1]
## [1] 1 2 3
probs[,121]
## [1] 2 1 3
# most likely, 2nd most, 3rd most, for each observation
table(probs.1 <- probs[1,])
##
## 1 2 3
## 84 48 23
# cross-classification of most and 2nd-most probable classes
table(probs.1, probs.2 <- probs[2,])
##
## probs.1 1 2 3
## 1 0 72 12
## 2 26 0 22
## 3 3 20 0
Again we need to restrict the RF model to variables also available in the prediction grid.
(m.ffreq.rf.2 <- randomForest(ffreq ~ x + y + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3))
##
## Call:
## randomForest(formula = ffreq ~ x + y + dist + soil, data = meuse, importance = T, mtry = 3, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 27.1%
## Confusion matrix:
## 1 2 3 class.error
## 1 72 11 1 0.1428571
## 2 14 29 5 0.3958333
## 3 3 8 12 0.4782609
This is less accurate than the model with elevation. Anyway, we show the map:
meuse.grid.sp$predictedffreq.rf <- predict(m.ffreq.rf.2, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedffreq.rf", main="Most probable flood frequency class")
Not a nice-looking map. It disagrees strongly with the actual flooding-frequency map.
summary(meuse.grid.sp$ffreq.tf <-
as.factor((meuse.grid.sp$predictedffreq.rf==
meuse.grid.sp$ffreq)))
## FALSE TRUE
## 1461 1642
spplot(meuse.grid.sp, zcol="ffreq.tf",
main="Accuracy of RF model")
Probability of each class at each grid cell:
meuse.grid.sp$probffreq <-
predict(m.ffreq.rf.2, newdata=meuse.grid, type="prob")
meuse.grid.sp$probffreq.rf.1 <- meuse.grid.sp$probffreq[,1]
meuse.grid.sp$probffreq.rf.2 <- meuse.grid.sp$probffreq[,2]
meuse.grid.sp$probffreq.rf.3 <- meuse.grid.sp$probffreq[,3]
p1.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.1",
main="Prob. class 1")
p2.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.2",
main="Prob. class 2")
p3.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.3",
main="Prob. class 3")
print(p1.p, split=c(1,1,3,1), more=T)
print(p2.p, split=c(2,1,3,1), more=T)
print(p3.p, split=c(3,1,3,1), more=F)
Does adding covariates improve the prediction?
(covar.names <- paste(intersect(names(meuse), names(meuse.grid)), collapse="+"))
## [1] "dist+ffreq+soil"
(fm.c <- as.formula(paste("logZn ~", dn, "+", covar.names)))
## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 +
## layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 +
## layer.13 + layer.14 + layer.15 + layer.16 + dist + ffreq +
## soil
Now compute the random forest model including these covariates, use it to predict at the observation points.
(rf.c <- randomForest(fm.c , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))
##
## Call:
## randomForest(formula = fm.c, data = meuse@data, importance = TRUE, min.split = 6, mtry = 6, ntree = 640)
## Type of random forest: regression
## Number of trees: 640
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.02806032
## % Var explained: 71.27
pred.rf.c <- predict(rf.c, newdata=meuse@data)
plot(rf.c)
Display the variable importance:
importance(rf.c, type=1)
## %IncMSE
## layer.1 9.048894
## layer.2 9.535551
## layer.3 8.410602
## layer.4 8.587360
## layer.5 10.529506
## layer.6 11.239619
## layer.7 8.394626
## layer.8 8.562815
## layer.9 9.408776
## layer.10 10.414668
## layer.11 10.157294
## layer.12 10.373565
## layer.13 9.935686
## layer.14 8.634690
## layer.15 8.074228
## layer.16 10.795800
## dist 44.034304
## ffreq 32.572061
## soil 17.952098
varImpPlot(rf.c, type=1)
Q: Are any of the covariables important? Are any more important than the distance buffers? Explain.
Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.
plot(meuse$logZn ~ pred.rf.c, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.rf.c <- sqrt(sum((pred.rf.c-meuse$logZn)^2)/length(pred.rf.c)))
## [1] 0.07758499
#
oob.rf.c <- predict(rf.c)
plot(meuse$logZn ~ oob.rf.c, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()
(rmse.oob.c <- sqrt(sum((oob.rf.c-meuse$logZn)^2)/length(oob.rf.c)))
## [1] 0.1675122
Q: What are the goodness-of-fit and Out-of-bag RMSE? How do these compare to the spatial RF without covariates?
We first have to add the covariates to the prediction grid.
grid.dist.lzn.covar <- grid.dist.lzn
grid.dist.lzn.covar$dist <- meuse.grid$dist
grid.dist.lzn.covar$ffreq <- meuse.grid$ffreq
grid.dist.lzn.covar$soil <- meuse.grid$soil
Now predict over the grid and display:
pred.grid.c <- predict(rf.c, newdata=grid.dist.lzn.covar@data)
summary(pred.grid.c)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.293 2.345 2.431 2.495 2.614 2.925
meuse.grid.sp$logZn.rf.c <- pred.grid.c
breaks <- seq(2, 3.3, by=.05)
p1.c <- spplot(meuse.grid.sp, zcol="logZn.rf.c", main="Zn, log(mg kg-1)", sub="Random forest on distance buffers & covariates", at=breaks)
print(p1.c, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)
Compute and display the differences:
meuse.grid.sp$logZn.rf.diff <- meuse.grid.sp$logZn.rf.c - meuse.grid.sp$logZn.rf
summary(meuse.grid.sp$logZn.rf.diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.444049 -0.078527 0.007356 0.008760 0.096491 0.438853
p3.c <- spplot(meuse.grid.sp, zcol="logZn.rf.diff", main="Spatial RF with - without covariates, log(Zn)", col.regions=topo.colors(36))
print(p3.c)
Q: Compare this to the maps made by RF/Spatial Distances without covariates. Where does using the covariates result in higher and lower predicted values? Try to explain why.
Challenge: Map logZn
over the grid by Kriging with External Drift (KED), using the same three covariates (dist
, ffreq
, soil
), and compare with the map made by spatial random forest with covariates. Reminder: the variogram must be computed for the linear model residuals using these covariates.
Q: Describe the differences between the residual and ordinary variogram.
Q: Describe the differences between the OK and KED maps.
Q: Compare the KED map with the spatial random forest + covariates map.
Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726
Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-0-387-84858-7
Kuhn, M. (2013). Applied Predictive Modeling. New York, NY: Springer New York. Retrieved from https://link.springer.com/openurl?genre=book&isbn=978-1-4614-6848-6
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-1-4614-7138-7
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2