In this exercise we return to the Meuse heavy metals dataset.

1 Dataset

Packages used in this section: sp for the sample dataset.

library(sp)

Sample point dataset, comes with sp; make a log10-transformed copy of the metal concentrations.

Also load the prediction grid: locations where we do not know the value of the target variable, at which we want to make prediction of that variable, from the model built (trained, calibrated) from the known points, where we know both the target variable and the predictors.

data(meuse)
str(meuse)

## 'data.frame':    155 obs. of  14 variables:
##  $ x      : num  181072 181025 181165 181298 181307 ...
##  $ y      : num  333611 333558 333537 333484 333330 ...
##  $ cadmium: num  11.7 8.6 6.5 2.6 2.8 3 3.2 2.8 2.4 1.6 ...
##  $ copper : num  85 81 68 81 48 61 31 29 37 24 ...
##  $ lead   : num  299 277 199 116 117 137 132 150 133 80 ...
##  $ zinc   : num  1022 1141 640 257 269 ...
##  $ elev   : num  7.91 6.98 7.8 7.66 7.48 ...
##  $ dist   : num  0.00136 0.01222 0.10303 0.19009 0.27709 ...
##  $ om     : num  13.6 14 13 8 8.7 7.8 9.2 9.5 10.6 6.3 ...
##  $ ffreq  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ soil   : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##  $ lime   : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ landuse: Factor w/ 15 levels "Aa","Ab","Ag",..: 4 4 4 11 4 11 4 2 2 15 ...
##  $ dist.m : num  50 30 150 270 380 470 240 120 240 420 ...

meuse$logZn <- log10(meuse$zinc)    # 锌
meuse$logCu <- log10(meuse$copper)  # 铜
meuse$logPb <- log10(meuse$lead)    # 铅
meuse$logCd <- log10(meuse$cadmium) # 镉
data(meuse.grid)
str(meuse.grid)

## 'data.frame':    3103 obs. of  7 variables:
##  $ x     : num  181180 181140 181180 181220 181100 ...
##  $ y     : num  333740 333700 333700 333700 333660 ...
##  $ part.a: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ part.b: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dist  : num  0 0 0.0122 0.0435 0 ...
##  $ soil  : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ffreq : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...

2 Regression trees

The objective is to model the log10(Zn) concentration from covariables, not including the other metals. One approach is the regression tree, which splits the target according to thresholds of predictor variables.

Reference: Brieman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton, FL: Chapman & Hall, CRC Press.

Packages used in this section: sp for spatial objects, rpart for regression trees.

library(sp)
library(rpart)
library(rpart.plot)

names(meuse)

##  [1] "x"       "y"       "cadmium" "copper"  "lead"    "zinc"    "elev"   
##  [8] "dist"    "om"      "ffreq"   "soil"    "lime"    "landuse" "dist.m" 
## [15] "logZn"   "logCu"   "logPb"   "logCd"

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

Q: What covariables are available in the point dataset? Are these all available in the prediction grid?

We will try to model the log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, whether the site has had added limestone 灰岩, and elevation above sea level. We also use the coordinates to capture any geographic trend.

Q: Why could these be good predictors of the metal concentration in this site?

Q: Do you see any problem in using so many predictors? Would you do this in a linear model?

2.1 Building a regression tree

We will use the rpart function to build the regression tree. This uses a helper function rpart.control to control the behaviour of rpart.

Examine the help for rpart.control.

help(rpart.control)

Q: What is the meaning of the minsplit and cp parameters?

Build a regression tree for this, print its results, and display as a tree plot. Allow splits all the way to single observations, if the \(R^2\) increases by at least 0.5% at the split.

m.lzn.rp <- rpart(logZn ~ x + y + ffreq + dist.m + soil + lime + elev,
                  data=meuse,
                  minsplit=2,
                  cp=0.005)
print(m.lzn.rp)

## n= 155 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 155 15.13633000 2.556160  
##     2) dist.m>=145 101  4.65211000 2.388584  
##       4) elev>=6.943 93  2.83545700 2.349952  
##         8) dist.m>=230 78  1.73584100 2.311717  
##          16) x>=179102.5 71  1.23063300 2.288808  
##            32) ffreq=2,3 40  0.36471060 2.223342  
##              64) y>=330942 21  0.14958940 2.167449 *
##              65) y< 330942 19  0.07701018 2.285117 *
##            33) ffreq=1 31  0.47328450 2.373281  
##              66) x< 179913.5 1  0.00000000 2.075547 *
##              67) x>=179913.5 30  0.38168440 2.383205  
##               134) x>=180346.5 25  0.20839560 2.355868  
##                 268) y< 333191 19  0.09154361 2.320174 *
##                 269) y>=333191 6  0.01598697 2.468900 *
##               135) x< 180346.5 5  0.06119400 2.519889 *
##          17) x< 179102.5 7  0.08998420 2.544084 *
##         9) dist.m< 230 15  0.39263770 2.548774  
##          18) x< 180563 10  0.23688250 2.494602  
##            36) x>=179405.5 7  0.10694750 2.420923 *
##            37) x< 179405.5 3  0.00326675 2.666521 *
##          19) x>=180563 5  0.06771894 2.657116 *
##       5) elev< 6.943 8  0.06436102 2.837680 *
##     3) dist.m< 145 54  2.34313500 2.869589  
##       6) elev>=8.1455 15  0.49618250 2.646331  
##        12) elev< 8.48 4  0.07502826 2.456464 *
##        13) elev>=8.48 11  0.22452210 2.715373 *
##       7) elev< 8.1455 39  0.81172600 2.955457  
##        14) dist.m>=75 11  0.08001190 2.848717 *
##        15) dist.m< 75 28  0.55715050 2.997391  
##          30) x< 179565 9  0.12124970 2.914541 *
##          31) x>=179565 19  0.34486030 3.036636  
##            62) x>=180194 12  0.12169000 2.966064 *
##            63) x< 180194 7  0.06095067 3.157617 *

rpart.plot(m.lzn.rp, digits=3, type=4, extra=1)

Q: Which variables were actually used in the tree?

Q: Which variable was used at the first split?

Q: Does that agree with your theory of the origin of the metals in these soils?

Q: What was the value of the splitting variable at that split?

Q: What was the average log10(Zn) before the split, and in the two leaves after the split?

Q: How many leaves (i.e., different predictions)? What proportion of the observations is this?

Q: What is the minimum number of observations among the leaves? The maximum?

2.2 Variable importance

Now we examine the variable importance:

print(tmp <- m.lzn.rp$variable.importance)

##   dist.m     elev     lime        x        y     soil    ffreq 
## 9.410837 6.134458 4.633959 2.097574 1.911179 1.169126 1.143282

data.frame(variableImportance = 100 * tmp / sum(tmp))

##        variableImportance
## dist.m          35.512035
## elev            23.148534
## lime            17.486364
## x                7.915251
## y                7.211884
## soil             4.411728
## ffreq            4.314205

Q: Which variable explained most of the metal concentration? What proportion? Were all predictor variables used somewhere in the tree?

2.3 Pruning the regression tree

Have we built a tree that is too complex, i.e., fitting noise not signal? In other words, fitting our dataset well but not the process?

Examine the cross-validation relative error vs. the complexity parameter.

Q: What complexity parameter minimizes the cross-validation relative error?

We can also compute this, rather than just looking at the graph.

plotcp(m.lzn.rp)

cp.table <- m.lzn.rp[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])

##           CP       nsplit    rel error       xerror         xstd 
##  0.006663768 15.000000000  0.099886839  0.305124342  0.037987293

cp.min <- cp.table[cp.ix,"CP"]

(Note that some other cross-validation errors are quite close to this minimum, so we might choose to use a higher value of the control parameter to have a more parsimonious model, that might extrapolate better.)

Prune the tree with this complexity parameter, and examine in the same way as above.

(Note we would get the same result if we re-fit with this parameter.)

(m.lzn.rpp <- prune(m.lzn.rp, cp=cp.min))

## n= 155 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 155 15.13633000 2.556160  
##     2) dist.m>=145 101  4.65211000 2.388584  
##       4) elev>=6.943 93  2.83545700 2.349952  
##         8) dist.m>=230 78  1.73584100 2.311717  
##          16) x>=179102.5 71  1.23063300 2.288808  
##            32) ffreq=2,3 40  0.36471060 2.223342  
##              64) y>=330942 21  0.14958940 2.167449 *
##              65) y< 330942 19  0.07701018 2.285117 *
##            33) ffreq=1 31  0.47328450 2.373281  
##              66) x< 179913.5 1  0.00000000 2.075547 *
##              67) x>=179913.5 30  0.38168440 2.383205  
##               134) x>=180346.5 25  0.20839560 2.355868 *
##               135) x< 180346.5 5  0.06119400 2.519889 *
##          17) x< 179102.5 7  0.08998420 2.544084 *
##         9) dist.m< 230 15  0.39263770 2.548774  
##          18) x< 180563 10  0.23688250 2.494602  
##            36) x>=179405.5 7  0.10694750 2.420923 *
##            37) x< 179405.5 3  0.00326675 2.666521 *
##          19) x>=180563 5  0.06771894 2.657116 *
##       5) elev< 6.943 8  0.06436102 2.837680 *
##     3) dist.m< 145 54  2.34313500 2.869589  
##       6) elev>=8.1455 15  0.49618250 2.646331  
##        12) elev< 8.48 4  0.07502826 2.456464 *
##        13) elev>=8.48 11  0.22452210 2.715373 *
##       7) elev< 8.1455 39  0.81172600 2.955457  
##        14) dist.m>=75 11  0.08001190 2.848717 *
##        15) dist.m< 75 28  0.55715050 2.997391  
##          30) x< 179565 9  0.12124970 2.914541 *
##          31) x>=179565 19  0.34486030 3.036636  
##            62) x>=180194 12  0.12169000 2.966064 *
##            63) x< 180194 7  0.06095067 3.157617 *

rpart.plot(m.lzn.rpp, digits=3, type=4, extra=1)

Q: How does this tree differ from the full (un-pruned tree)? Look at the predictors actually used, number of splits, number of leaves, number of observations in each leaf.

Q: Which leaf predicts the highest concentration of log10(Zn)? Explain that by reading the predictors and their values down the tree from the root to this leaf. Does this fit your theory of the origin of these metals?

2.4 Goodness-of-fit

Now, compare the model predictions to the known values:

p.rpp <- predict(m.lzn.rpp, newdata=meuse)
length(unique(p.rpp))

## [1] 16

summary(r.rpp <- meuse$logZn - p.rpp)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.312253 -0.071205 -0.008655  0.000000  0.055852  0.226281

sqrt(sum(r.rpp^2)/length(r.rpp))

## [1] 0.09876398

plot(meuse$logZn ~ p.rpp, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Regression Tree")
grid()
abline(0,1)

Q: What is the RMSE?

Q: Why are there vertical lines at all the fitted values?

3 Classification tree

Another use of CART is to predict a category. Here we try to predict whether a site is in flood frequency class 1, 2, or 3. If this is a successful model, we could use it to map flood frequency, if we didn’t already have this map.

Q: What covariables might help predict flooding frequency?

3.1 Building a classification tree

Flooding is influenced by the distance from the river and the elevation above the river. Build a model with these, using a complexity parameter of 2% increase in \(R^2\).

m.ff.rp <- rpart(ffreq ~ x + y + dist.m + elev,
                  data=meuse,
                  minsplit=2,
                  cp=0.02)
print(m.ff.rp)

## n= 155 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 155 71 1 (0.54193548 0.30967742 0.14838710)  
##     2) y>=331984.5 63  6 1 (0.90476190 0.07936508 0.01587302)  
##       4) elev< 9.399 60  3 1 (0.95000000 0.03333333 0.01666667) *
##       5) elev>=9.399 3  0 2 (0.00000000 1.00000000 0.00000000) *
##     3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)  
##       6) elev< 7.552 24  2 1 (0.91666667 0.00000000 0.08333333)  
##        12) y>=330043 22  0 1 (1.00000000 0.00000000 0.00000000) *
##        13) y< 330043 2  0 3 (0.00000000 0.00000000 1.00000000) *
##       7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)  
##        14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)  
##          28) elev< 9.947 49  9 2 (0.10204082 0.81632653 0.08163265)  
##            56) y>=330644 35  5 2 (0.14285714 0.85714286 0.00000000)  
##             112) dist.m>=470 17  5 2 (0.29411765 0.70588235 0.00000000)  
##               224) elev< 8.526 3  0 1 (1.00000000 0.00000000 0.00000000) *
##               225) elev>=8.526 14  2 2 (0.14285714 0.85714286 0.00000000) *
##             113) dist.m< 470 18  0 2 (0.00000000 1.00000000 0.00000000) *
##            57) y< 330644 14  4 2 (0.00000000 0.71428571 0.28571429)  
##             114) elev< 8.7225 11  1 2 (0.00000000 0.90909091 0.09090909) *
##             115) elev>=8.7225 3  0 3 (0.00000000 0.00000000 1.00000000) *
##          29) elev>=9.947 5  1 3 (0.00000000 0.20000000 0.80000000) *
##        15) y< 330330 14  2 3 (0.00000000 0.14285714 0.85714286) *

rpart.plot(m.ff.rp)

Q: Were both predictors used?

Q: What was the first split?

3.2 Pruning the classification tree

Find the value of the complexity parameter that minimizes the cross-validation error, and prune the tree with that value.

printcp(m.ff.rp)

## 
## Classification tree:
## rpart(formula = ffreq ~ x + y + dist.m + elev, data = meuse, 
##     minsplit = 2, cp = 0.02)
## 
## Variables actually used in tree construction:
## [1] dist.m elev   y     
## 
## Root node error: 71/155 = 0.45806
## 
## n= 155 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.267606      0   1.00000 1.00000 0.087366
## 2 0.140845      2   0.46479 0.49296 0.073316
## 3 0.042254      3   0.32394 0.38028 0.066506
## 4 0.028169      5   0.23944 0.40845 0.068385
## 5 0.021127      6   0.21127 0.36620 0.065518
## 6 0.020000     10   0.12676 0.38028 0.066506

plotcp(m.ff.rp)

cp.table.class <- m.ff.rp[["cptable"]]
cp.ix <- which.min(cp.table.class[,"xerror"])
print(cp.table.class[cp.ix,])

##         CP     nsplit  rel error     xerror       xstd 
## 0.02112676 6.00000000 0.21126761 0.36619718 0.06551750

(cp.min <- cp.table.class[cp.ix,"CP"])

## [1] 0.02112676

m.ff.rpp <- prune(m.ff.rp, cp=cp.min)
print(m.ff.rpp)

## n= 155 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 155 71 1 (0.54193548 0.30967742 0.14838710)  
##    2) y>=331984.5 63  6 1 (0.90476190 0.07936508 0.01587302)  
##      4) elev< 9.399 60  3 1 (0.95000000 0.03333333 0.01666667) *
##      5) elev>=9.399 3  0 2 (0.00000000 1.00000000 0.00000000) *
##    3) y< 331984.5 92 49 2 (0.29347826 0.46739130 0.23913043)  
##      6) elev< 7.552 24  2 1 (0.91666667 0.00000000 0.08333333)  
##       12) y>=330043 22  0 1 (1.00000000 0.00000000 0.00000000) *
##       13) y< 330043 2  0 3 (0.00000000 0.00000000 1.00000000) *
##      7) elev>=7.552 68 25 2 (0.07352941 0.63235294 0.29411765)  
##       14) y>=330330 54 13 2 (0.09259259 0.75925926 0.14814815)  
##         28) elev< 9.947 49  9 2 (0.10204082 0.81632653 0.08163265) *
##         29) elev>=9.947 5  1 3 (0.00000000 0.20000000 0.80000000) *
##       15) y< 330330 14  2 3 (0.00000000 0.14285714 0.85714286) *

rpart.plot(m.ff.rpp, type=2, extra=1)

Q: Now how many splits? Which variables were used? Are all classes predicted? How pure are the leaves?

Q: How useful is this model? In other words, how well could flooding frequency class be mapped from knowing the distance to river and elevation?

Variable importance:

m.ff.rp$variable.importance

##     elev        y        x   dist.m 
## 38.26320 38.24051 16.58514 11.40992

m.ff.rpp$variable.importance

##        y     elev        x   dist.m 
## 36.74909 29.85313 14.99264  9.89731

In both the unpruned and pruned trees the y-coordinate and elevation are most important.

4 Mapping over the grid.

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

Q: What variables are available for mapping over the grid?

The grid only has coordinates, relative distance, soil type, and flood frequency. So if we want a model to map log(Zn) over the grid, we can only use these variables.

Build and prune a regression tree for log(Zn) with these predictors; display the pruned tree and the variable importance.

m.lzn.g <- rpart(logZn ~ x + y +  ffreq + dist + soil,
                  data=meuse,
                  minsplit=2,
                  cp=0.005)
plotcp(m.lzn.g)

rpart.plot(m.lzn.g, digits=3, type=4, extra=1)

cp.table <- m.lzn.g[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]

##         CP     nsplit  rel error     xerror       xstd 
## 0.01792115 8.00000000 0.17301816 0.32353445 0.05034038

(cp.min <- cp.table[cp.ix,"CP"])

## [1] 0.01792115

Now prune the tree to the indicated complexity:

m.lzn.gp <- prune(m.lzn.g, cp=cp.min)
rpart.plot(m.lzn.gp)

print(tmp <- m.lzn.gp$variable.importance)

##      dist         x      soil         y     ffreq 
## 10.189013  3.777877  3.363417  2.473806  2.136605

data.frame(variableImportance = 100 * tmp / sum(tmp))

##       variableImportance
## dist            46.43883
## x               17.21857
## soil            15.32957
## y               11.27496
## ffreq            9.73808

Q: Which predictors were used? What was their relative importance?

Q: Why is the x-coordinate important in this dataset? Do you think this would be important in other parts of the flood plain?

Now we can map with this.

First, make an sp version of the grid, with the coordinates having special status, i.e., not as feature-space attributes:

meuse.grid.sp <- meuse.grid
coordinates(meuse.grid.sp) <- ~ x + y; gridded(meuse.grid.sp) <- TRUE

meuse.grid.sp$predictedLog10Zn <- predict(m.lzn.gp, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn")

Q: How realistic is this map?

5 The variability of regression trees

Single regression trees can vary quite a bit depending on small differences in the dataset. Convince yourself of this.

Remove a random 15 of the 155 points, then fit and prune the first tree we attempted, i.e., log10(Zn) concentration as a function of flooding frequency class, distance to the river in meters, soil type, land use, whether the site has had added limestone 灰岩, and elevation above sea level.

meuse.140.ix <- sample(1:dim(meuse)[1], size=140, replace=FALSE)
meuse.140 <- meuse[meuse.140.ix,]
m.lzn.rp.140 <- rpart(logZn ~ x + y + ffreq + dist.m + soil +  lime + elev,
                  data=meuse.140,
                  minsplit=2,
                  cp=0.005)
rpart.plot(m.lzn.rp.140, digits=3, type=4, extra=1)

print(tmp <- m.lzn.rp.140$variable.importance)

##   dist.m     elev     lime        x        y     soil    ffreq 
## 8.632981 5.719639 4.393028 1.972617 1.555428 1.352222 1.085110

data.frame(variableImportance = 100 * tmp / sum(tmp))

##        variableImportance
## dist.m          34.935746
## elev            23.146102
## lime            17.777604
## x                7.982740
## y                6.294470
## soil             5.472139
## ffreq            4.391199

Prune the tree:

plotcp(m.lzn.rp.140)

cp.table <- m.lzn.rp.140[["cptable"]]
cp.ix <- which.min(cp.table[,"xerror"])
print(cp.table[cp.ix,])

##          CP      nsplit   rel error      xerror        xstd 
## 0.008830864 9.000000000 0.164033250 0.288800633 0.032236651

cp.min <- cp.table[cp.ix,"CP"]
m.lzn.rpp.140 <- prune(m.lzn.rp.140, cp=cp.min)

rpart.plot(m.lzn.rpp.140, digits=3, type=4, extra=1)

Q: What are the similarities and differences between your tree and the one we did at the beginning of the exercise? Consider the “optimum” complexity parameter, number of leaves and split, where the predictors are used, which predictors are used.

We can do this also for the map:

m.lzn.g.140 <- rpart(logZn ~ x + y + ffreq + dist + soil,
                  data=meuse.140,
                  minsplit=2,
                  cp=0.005)

Prune the tree to the indicated complexity:

plotcp(m.lzn.g.140)

rpart.plot(m.lzn.g.140, digits=3, type=4, extra=1)

cp.table <- m.lzn.g.140[["cptable"]]
cp.table[cp.ix <- which.min(cp.table[,"xerror"]),]

##         CP     nsplit  rel error     xerror       xstd 
## 0.02202392 5.00000000 0.25857520 0.38985094 0.06241297

(cp.min <- cp.table[cp.ix,"CP"])

## [1] 0.02202392

m.lzn.gp.140 <- prune(m.lzn.g.140, cp=cp.min)

rpart.plot(m.lzn.gp.140)

print(tmp <- m.lzn.gp.140$variable.importance)

##      dist         x     ffreq         y      soil 
## 8.6023189 1.8590649 1.1787964 1.0352421 0.2431314

data.frame(variableImportance = 100 * tmp / sum(tmp))

##       variableImportance
## dist           66.588870
## x              14.390658
## ffreq           9.124832
## y               8.013607
## soil            1.882033

Computed and display the map using the pruned tree:

meuse.grid.sp$predictedLog10Zn.140 <- predict(m.lzn.gp.140, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.140")

Compute and display the differences in predictions:

meuse.grid.sp$predictedLog10Zn.diff <- 
  meuse.grid.sp$predictedLog10Zn - meuse.grid.sp$predictedLog10Zn.140
summary(meuse.grid.sp$predictedLog10Zn.diff)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.5488081 -0.0199888  0.0054074  0.0007744  0.0054074  0.5624135

spplot(meuse.grid.sp, zcol="predictedLog10Zn.diff", col.regions=topo.colors(36))

Q: How different are the maps made with the regression tree based on all 155 observations, and the one made with 140 randomly-selected observations? Can you explain the spatial pattern of the differences? Hint: plot the locations of the omitted points.

meuse.omitted <- meuse[-meuse.140.ix,]
plot(y ~ x, data=meuse.omitted, asp=1,
     cex=2*meuse.omitted$logZn/max(meuse.omitted$logZn))
data(meuse.riv)
lines(meuse.riv)
grid()

6 Random Forests

The idea here is simple: build many regression trees and average them.

Reference: Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Packages used in this section: randomForest for random forests, randomForestExplainer for some diagnostics.

library(randomForest)
library(randomForestExplainer)

First build the forest, using the randomForest function:

m.lzn.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=5)
print(m.lzn.rf)

## 
## Call:
##  randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev +      soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.01956509
##                     % Var explained: 79.96

The mtry optional argument gives the number of variables randomly sampled as candidates at each split. By default this is \(\lfloor{p/3}\rfloor\) where \(p\) is the number of predictors, here \(5\), so \(\lfloor{5/3}\rfloor = 2\). We want to try all the variables at each split for more accurate splitting.

Q: How much of the variance is explained by the forest?

Here we can not see a tree structure, since each tree in the forest is different.

Display the cross-validation error rate against the number of trees:

plot(m.lzn.rf)

Q: How many trees were needed to make a reliable forest?

6.1 Sensitivity

Each RF is random, so several attempts will not be the same. How consistent are they?

Compute the RF several times and collect statistics:

n <- 24
rf.stats <- data.frame(rep=1:10, rsq=as.numeric(NA), mse=as.numeric(NA))
for (i in 1:n) {
  model.rf <- randomForest(logZn ~ ffreq + x + y + dist.m + elev + soil + lime,
                           data=meuse, importance=T, na.action=na.omit, mtry=5)
  summary(model.rf$rsq)
  summary(model.rf$mse)
  rf.stats[i, "mse"] <- median(summary(model.rf$mse))
  rf.stats[i, "rsq"] <- median(summary(model.rf$rsq))
}
summary(rf.stats[,2:3])

##       rsq              mse         
##  Min.   :0.7939   Min.   :0.01881  
##  1st Qu.:0.7990   1st Qu.:0.01909  
##  Median :0.8008   Median :0.01946  
##  Mean   :0.8012   Mean   :0.01942  
##  3rd Qu.:0.8045   3rd Qu.:0.01963  
##  Max.   :0.8074   Max.   :0.02013

hist(rf.stats[,"rsq"], xlab="RandomForest R^2", main="random forests")
rug(rf.stats[,"rsq"])

hist(rf.stats[,"mse"], xlab="RandomForest RMSE", main="random forests")
rug(rf.stats[,"mse"])

Q: How consistent are the 24 runs of the random forest?

6.2 Variable importance

Examine the variable importance. Type 1 is the increase in MSE if that predictor is assigned to its actuql point, rather than randomly assigned to a point. It is computed as follows (from the Help): “For each tree, the prediction error on the OOB portion of the data is recorded (…MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences”.

Type 2 is the increase in node purity. It is computed as follows: “The total decrease in node impurities from splitting on the [given] variable, averaged over all trees… measured by residual sum of squares.”

importance(m.lzn.rf, type=1)

##          %IncMSE
## ffreq  15.784522
## x      27.679406
## y      17.651208
## dist.m 53.929499
## elev   34.311213
## soil    6.102535
## lime   10.305590

importance(m.lzn.rf, type=2)

##        IncNodePurity
## ffreq      0.4758883
## x          1.1245359
## y          0.7414590
## dist.m     7.9016551
## elev       3.3382712
## soil       0.1317567
## lime       0.9734237

varImpPlot(m.lzn.rf, type=1)

Q: Does this agree with the variable importance from the single trees we built above?

Q: Why are some variables given importance, although they were not used in the regression tree models?

6.3 Examine variable importance and interaction with `randomForestExplainer`

Another way to look at this is by the depth in the tree at which each predictor is used. This is found with the min_depth_frame function of the randomForestExplainer package.

min_depth_frame <- min_depth_distribution(m.lzn.rf)
str(min_depth_frame)  # has results for all the trees

## 'data.frame':    3232 obs. of  3 variables:
##  $ tree         : int  1 1 1 1 1 1 1 2 2 2 ...
##  $ variable     : chr  "dist.m" "elev" "ffreq" "lime" ...
##  $ minimal_depth: num  0 2 2 1 4 3 3 0 1 2 ...

plot_min_depth_distribution(min_depth_frame)

Q: Which predictor is used most as the root?

Q: Which predictor is used most at the second level?

Q: Which predictors are not used in any trees?

Q: Do these results agree with the variable importance plot?

This package gives a wider choice of importance measures:

importance_frame <- measure_importance(m.lzn.rf)
print(importance_frame)

##   variable mean_min_depth no_of_nodes mse_increase node_purity_increase
## 1   dist.m       0.354000        5415  0.079512659            7.9016551
## 2     elev       1.124000        6367  0.033398644            3.3382712
## 3    ffreq       3.593508        1051  0.009845755            0.4758883
## 4     lime       4.003500         506  0.003900991            0.9734237
## 5     soil       5.362688         671  0.001496646            0.1317567
## 6        x       2.166000        5759  0.011855387            1.1245359
## 7        y       2.652000        5381  0.006667004            0.7414590
##   no_of_trees times_a_root       p_value
## 1         500          347 2.408062e-210
## 2         500           88  0.000000e+00
## 3         461            0  1.000000e+00
## 4         375           63  1.000000e+00
## 5         396            2  1.000000e+00
## 6         500            0 7.688689e-291
## 7         500            0 4.901770e-203

Q: Which predictor was used in the most nodes, over all trees?

Q: Which predictors were never used as the root of the tree?

Q: Which predictors were used in every tree?

The p-value measures whether the observed number of successes (number of nodes in which \(X_j\) was used for splitting) exceeds the theoretical number of successes if they were random, following a binomial distribution. See https://cran.rstudio.com/web/packages/randomForestExplainer/vignettes/randomForestExplainer.html for details.

Another interesting plot is the “multiway importance plot”, which by default shows the number of times as root vs. the mean minimum depth for each predictor. There is typically a negative relation between times_a_root and mean_min_depth.

Q: Which predictor is most important by both measures?

Q: Is this a consistent negative relation? If not, where are the exceptions?

Another interesting plot is the pairwise comparison of different importance measures:

plot_importance_ggpairs(importance_frame)

Q: Which are the most and least related measures?

plot_multi_way_importance(m.lzn.rf, size_measure = "no_of_nodes")

Examine the interactions between predictors, in terms of their use within the forest:

interactions_frame <- min_depth_interactions(m.lzn.rf)
interactions_frame[order(interactions_frame$occurrences, decreasing = TRUE)[1:12], ]

##    variable root_variable mean_min_depth occurrences   interaction
## 8      elev        dist.m      0.4036145         498   dist.m:elev
## 36        x        dist.m      1.1000321         494      dist.m:x
## 43        y        dist.m      1.7028514         493      dist.m:y
## 1    dist.m        dist.m      1.1482651         489 dist.m:dist.m
## 37        x          elev      1.2661165         481        elev:x
## 9      elev          elev      1.3224779         475     elev:elev
## 44        y          elev      1.8635582         464        elev:y
## 2    dist.m          elev      1.5867912         449   elev:dist.m
## 15    ffreq        dist.m      3.7058715         417  dist.m:ffreq
## 41        x             x      2.2486948         393           x:x
## 48        y             x      2.0801406         375           x:y
## 13     elev             x      2.3134538         368        x:elev
##    uncond_mean_min_depth
## 8               1.124000
## 36              2.166000
## 43              2.652000
## 1               0.354000
## 37              2.166000
## 9               1.124000
## 44              2.652000
## 2               0.354000
## 15              3.593508
## 41              2.166000
## 48              2.652000
## 13              1.124000

plot_min_depth_interactions(interactions_frame)

Q: Which pair of predictors are most used near the root of the tree, and with the most co-occurrences?

“To further investigate the most frequent interaction we use the function plot_predict_interaction to plot the prediction of our forest on a grid of values for the components of each interaction. The function requires the forest, training data, variable to use on x and y-axis, respectively. In addition, one can also decrease the number of points in both dimensions of the grid from the default of 100 in case of insufficient memory using the parameter grid.””

Here we see the predicted log(Zn) content for all combinations of distance from river and elevation:

plot_predict_interaction(m.lzn.rf, meuse, "dist.m", "elev")

6.4 Partial dependence plots

These plots show the effect of one predictor on the predictand, holding all the others constant at their median value. We show these in order of variable importance, as determined just above.

partialPlot(m.lzn.rf, pred.data=meuse, x.var="x")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="y")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="dist.m")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="elev")

partialPlot(m.lzn.rf, pred.data=meuse, x.var="ffreq", which.class=1)

Q: What is the effect of just the distance to river on the RF prediction of log(Zn), when the other variables are held at their mean values (continuous variables, i.e., elev) or most common class (classified variables, i.e. ffreq, soil and lime). Is the effect monotonic (i.e., consistently increasing or decreasing)? Why or why not?

6.5 Goodness-of-fit and out-of-bag errors

Predict back onto the known points, and evaluate the goodness-of-fit:

p.rf <- predict(m.lzn.rf, newdata=meuse)
length(unique(p.rf))

## [1] 155

summary(r.rpp <- meuse$logZn - p.rf)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.1858643 -0.0354089 -0.0035181 -0.0007279  0.0346652  0.1677528

(rmse.rf <- sqrt(sum(r.rpp^2)/length(r.rpp)))

## [1] 0.06234585

plot(meuse$logZn ~ p.rf, asp=1, pch=20, xlab="fitted", ylab="actual", xlim=c(2,3.3),          ylim=c(2,3.3), main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)

Q: What is the RMSE?

Q: How many different predictions were made? Why not just a few, as in regression trees?

A more realistic view of the predictive power of the model is the out-of-bag validation. Here we use predict with no newdata argument, so the prediction is taken from the predicted component of the fitted tree, which was computed during tree building from the out-of-bag observations of each tree..

p.rf.oob <- predict(m.lzn.rf)
summary(r.rpp.oob <- meuse$logZn - p.rf.oob)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.43958 -0.08056 -0.01097 -0.00142  0.07454  0.36201

(rmse.oob <- sqrt(sum(r.rpp.oob^2)/length(r.rpp.oob)))

## [1] 0.1398753

plot(meuse$logZn ~ p.rf.oob, asp=1, pch=20,
     xlab="Out-of-bag cross-validation estimates",
     ylab="actual", xlim=c(2,3.3), ylim=c(2,3.3),
     main="log10(Zn), Meuse topsoils, Random Forest")
grid()
abline(0,1)

Q: What is the OOB RMSE? How does this compare with the fitted RMSE? Which is more realistic for predictions?

Q: How does this compare to the model fits?

Q: How does this compare with the RMSE from the Regression Tree?

6.6 Map over the grid with an RF model

We can try to build an RF model to map the grid, but again we can only use coordinates, flood frequency, soil type and relative distance to the river, because these are the only variables we have over the whole area, i.e., in the spatial object meuse.grid.

names(meuse.grid)

## [1] "x"      "y"      "part.a" "part.b" "dist"   "soil"   "ffreq"

m.lzn.rf.g <- randomForest(logZn ~ x + y + ffreq + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.lzn.rf)

## 
## Call:
##  randomForest(formula = logZn ~ ffreq + x + y + dist.m + elev +      soil + lime, data = meuse, importance = T, mtry = 5, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 0.01956509
##                     % Var explained: 79.96

(tmp <- importance(m.lzn.rf.g, type=1))

##        %IncMSE
## x     32.61332
## y     19.35550
## ffreq 37.38123
## dist  51.53406
## soil  13.51469

(100*tmp/sum(tmp))

##         %IncMSE
## x     21.122783
## y     12.536043
## ffreq 24.210827
## dist  33.377240
## soil   8.753107

varImpPlot(m.lzn.rf.g, type=1)

meuse.grid.sp$predictedLog10Zn.rf <- predict(m.lzn.rf.g, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedLog10Zn.rf")

Q: Does this map look realistic? How does it compare with the map made with the single regressiom tree?

7 Cubist

This method is derived from the C4.5 approach of Quinlan.

References:

Quinlan, J. R. (1993). C4.5: programs for machine learning. San Mateo, Calif: Morgan Kaufmann Publishers.

Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90. https://doi.org/10.1613/jair.279

A recent article comparing it to RF is: Houborg, R., & McCabe, M. F. (2018). A hybrid training approach for leaf area index estimation via Cubist and random forests machine-learning. ISPRS Journal of Photogrammetry and Remote Sensing, 135, 173–188. https://doi.org/10.1016/j.isprsjprs.2017.10.004

It is similar to RF but instead of single values at leaves it creates a multivariate linear regression for the cases in the leaf, using the variables in the nodes leading to it (?).

The Cubist package is an R port of the Cubist GPL C code released by RuleQuest at http://rulequest.com/cubist-info.html.

An explanation from the Cubist vignette:

Cubist is a rule–based model that is an extension of Quinlan’s M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is “smoothed” by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.

Package used in this section: Cubist for the C5 algorithm.

library(Cubist)

Split into training sets, here 80% in the training set:

inTrain <- sample(1:nrow(meuse), floor(.8*nrow(meuse)))

Build training and test matrices. Note there is no formula method for cubist; the predictors are specified as matrix or data frame and the outcome is a numeric vector.

preds <- c("x","y","ffreq","dist.m","elev","soil","lime")
train.pred <- meuse[ inTrain, preds]
test.pred  <- meuse[-inTrain, preds]
train.resp <- meuse$logZn[ inTrain]
test.resp  <- meuse$logZn[-inTrain]

7.1 Unoptimized Cubist model

Now fit the model to the training set:

(c.model <- cubist(x = train.pred, y = train.resp))

## 
## Call:
## cubist.default(x = train.pred, y = train.resp)
## 
## Number of samples: 124 
## Number of predictors: 7 
## 
## Number of committees: 1 
## Number of rules: 2

summary(c.model)

## 
## Call:
## cubist.default(x = train.pred, y = train.resp)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Tue Jun 21 11:41:13 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 124 cases (8 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [80 cases, mean 2.415175, range 2.075547 to 2.89098, est err 0.117872]
## 
##     if
##  dist.m > 140
##     then
##  outcome = -11.454599 + 0.000179 y - 0.000247 x - 0.116 elev
##            - 7e-05 dist.m
## 
##   Rule 2: [44 cases, mean 2.850381, range 2.281033 to 3.264582, est err 0.153651]
## 
##     if
##  dist.m <= 140
##     then
##  outcome = 3.623907 - 0.00202 dist.m - 0.082 elev
## 
## 
## Evaluation on training data (124 cases):
## 
##     Average  |error|           0.122804
##     Relative |error|               0.48
##     Correlation coefficient        0.86
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##    100%   100%    dist.m
##           100%    elev
##            65%    x
##            65%    y
## 
## 
## Time: 0.0 secs

In this case there are only two leaves! each with a regression model. The split is on distance to river.

We predict onto the test set to get an estimate of predictive accuracy:

c.pred <- predict(c.model, test.pred)
## Test set RMSE
sqrt(mean((c.pred - test.resp)^2))

## [1] 0.1520256

## Test set R^2
cor(c.pred, test.resp)^2

## [1] 0.8670161

Now, fit a model on the entire set and then predict over the grid. Again we can only use predictors that are in the grid.

preds <- c("x","y","dist","soil","ffreq")
all.preds <- meuse[, preds]
all.resp <- meuse$logZn
c.model.grid <- cubist(x = all.preds, y = all.resp)
summary(c.model.grid)

## 
## Call:
## cubist.default(x = all.preds, y = all.resp)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Tue Jun 21 11:41:13 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 155 cases (6 attributes) from undefined.data
## 
## Model:
## 
##   Rule 1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
## 
##     if
##  x > 179095
##  dist > 0.211846
##     then
##  outcome = 2.406759 - 0.32 dist
## 
##   Rule 2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
## 
##     if
##  x <= 179095
##  dist > 0.211846
##     then
##  outcome = -277.415278 + 0.000847 y + 0.56 dist
## 
##   Rule 3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
## 
## 
## Evaluation on training data (155 cases):
## 
##     Average  |error|           0.111292
##     Relative |error|               0.41
##     Correlation coefficient        0.85
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##    100%   100%    dist
##     48%    52%    x
##            57%    y
## 
## 
## Time: 0.0 secs

meuse.grid.sp$predictedLog10Zn.cubist <- predict(c.model.grid, newdata=meuse.grid[,preds])
spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist", col.regions=bpy.colors(64))

Here we can see the effect of the local regression models, as well as the hard split on x, y, and dist.

7.2 Optimizing cubist

Cubist can improve performance in two ways: (1) with “committees” and (2) adjusting by nearest neighbours:

The Cubist model can also use a boosting–like scheme called committees where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over–predicted a value, the response is adjusted downward for the next model (and so on). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.

Another innovation in Cubist using nearest–neighbors to adjust the predictions from the rule–based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it’s nearest neighbors and determine the average of these training set points.

To find the optimum values for these, we turn to the caret package.

The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques.

The train function in caret can use many models, supplied by other packages, by specifying the method argument. Then the tuneGrid parameter refers to a grid of possible parameters to compare by cross-validation. Note we use all the points, because train does the splitting as it evaluates the alternative models.

require(caret)
test <- train(x = all.preds, y = all.resp, method="cubist", 
              tuneGrid = expand.grid(.committees = 1:12, 
                                     .neighbors = 0:5), 
              trControl = trainControl(method = 'cv'))
print(test)

## Cubist 
## 
## 155 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 138, 140, 141, 139, 140, 139, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    1          0          0.1894278  0.6496111  0.1419179
##    1          1          0.1814152  0.6942065  0.1257052
##    1          2          0.1670760  0.7304305  0.1197129
##    1          3          0.1652328  0.7312889  0.1196747
##    1          4          0.1666086  0.7243422  0.1231867
##    1          5          0.1690508  0.7157745  0.1270865
##    2          0          0.1710778  0.7096912  0.1281719
##    2          1          0.1860139  0.6804579  0.1255095
##    2          2          0.1707818  0.7198970  0.1196356
##    2          3          0.1660291  0.7286866  0.1178113
##    2          4          0.1637145  0.7318926  0.1184247
##    2          5          0.1640964  0.7303724  0.1198547
##    3          0          0.1685245  0.7209211  0.1283659
##    3          1          0.1823722  0.6905295  0.1229561
##    3          2          0.1643650  0.7378263  0.1149210
##    3          3          0.1586798  0.7492127  0.1140697
##    3          4          0.1563527  0.7527732  0.1145676
##    3          5          0.1565914  0.7513679  0.1164066
##    4          0          0.1713729  0.7133692  0.1316064
##    4          1          0.1831968  0.6864192  0.1249382
##    4          2          0.1663870  0.7298149  0.1166409
##    4          3          0.1600404  0.7440668  0.1154950
##    4          4          0.1574077  0.7490784  0.1159470
##    4          5          0.1573277  0.7482071  0.1171381
##    5          0          0.1681069  0.7244834  0.1264697
##    5          1          0.1802858  0.6943582  0.1229848
##    5          2          0.1636044  0.7389211  0.1142720
##    5          3          0.1565319  0.7551840  0.1133320
##    5          4          0.1535981  0.7611870  0.1135889
##    5          5          0.1539955  0.7593028  0.1153939
##    6          0          0.1686125  0.7221074  0.1275708
##    6          1          0.1817304  0.6908429  0.1243756
##    6          2          0.1650978  0.7340920  0.1160568
##    6          3          0.1584983  0.7491020  0.1143697
##    6          4          0.1555288  0.7550528  0.1145491
##    6          5          0.1557105  0.7532503  0.1162426
##    7          0          0.1666813  0.7282440  0.1263042
##    7          1          0.1796026  0.6963742  0.1228387
##    7          2          0.1632439  0.7396326  0.1148584
##    7          3          0.1567832  0.7542766  0.1138369
##    7          4          0.1538616  0.7600767  0.1140501
##    7          5          0.1539837  0.7588860  0.1159290
##    8          0          0.1690046  0.7203095  0.1273635
##    8          1          0.1825678  0.6885467  0.1249667
##    8          2          0.1664451  0.7309120  0.1172158
##    8          3          0.1598984  0.7458670  0.1154182
##    8          4          0.1568636  0.7519905  0.1153745
##    8          5          0.1569257  0.7505485  0.1173632
##    9          0          0.1686822  0.7225742  0.1265510
##    9          1          0.1808938  0.6927622  0.1231957
##    9          2          0.1652848  0.7340160  0.1155964
##    9          3          0.1585826  0.7491854  0.1137817
##    9          4          0.1558456  0.7544834  0.1140752
##    9          5          0.1561652  0.7525235  0.1164895
##   10          0          0.1669221  0.7284473  0.1268756
##   10          1          0.1808163  0.6936500  0.1233647
##   10          2          0.1647897  0.7362088  0.1154954
##   10          3          0.1578002  0.7523411  0.1135312
##   10          4          0.1549745  0.7579958  0.1136374
##   10          5          0.1551573  0.7563532  0.1157767
##   11          0          0.1655149  0.7331397  0.1246649
##   11          1          0.1793334  0.6980443  0.1224620
##   11          2          0.1632807  0.7409552  0.1145469
##   11          3          0.1563271  0.7568367  0.1124276
##   11          4          0.1537992  0.7616880  0.1125552
##   11          5          0.1542451  0.7594513  0.1149355
##   12          0          0.1649211  0.7352316  0.1256688
##   12          1          0.1795084  0.6977924  0.1227620
##   12          2          0.1632699  0.7418168  0.1146453
##   12          3          0.1562419  0.7578989  0.1129060
##   12          4          0.1537268  0.7626763  0.1129656
##   12          5          0.1541418  0.7603350  0.1154308
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 5 and neighbors = 4.

plot(test, metric="RMSE")

plot(test, metric="Rsquared")

Of course this is random in the splits, so each run will be different. In this case the graph shows clearly that more committees boost the performance considerably, from around 0.19 to 0.15 cross-validation RMSE. The nearest-neighbours also help considerably, up to about four, and then the improvement is less.

So we fit a model with these recommended parameters. Note the neighbour count is only used in prediction, not model building.

c.model.grid.2 <- cubist(x = all.preds, y = all.resp, committees=4)
summary(c.model.grid.2)

## 
## Call:
## cubist.default(x = all.preds, y = all.resp, committees = 4)
## 
## 
## Cubist [Release 2.07 GPL Edition]  Tue Jun 21 11:41:23 2022
## ---------------------------------
## 
##     Target attribute `outcome'
## 
## Read 155 cases (6 attributes) from undefined.data
## 
## Model 1:
## 
##   Rule 1/1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, est err 0.103603]
## 
##     if
##  x > 179095
##  dist > 0.211846
##     then
##  outcome = 2.406759 - 0.32 dist
## 
##   Rule 1/2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, est err 0.116378]
## 
##     if
##  x <= 179095
##  dist > 0.211846
##     then
##  outcome = -277.415278 + 0.000847 y + 0.56 dist
## 
##   Rule 1/3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.157513]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
## 
## Model 2:
## 
##   Rule 2/1: [45 cases, mean 2.418724, range 2.10721 to 2.893762, est err 0.182228]
## 
##     if
##  x <= 179826
##  ffreq in {2, 3}
##     then
##  outcome = 128.701732 - 0.000705 x
## 
##   Rule 2/2: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.181513]
## 
##     if
##  dist > 0.0703468
##     then
##  outcome = 30.512065 - 0.87 dist - 0.000154 x
## 
##   Rule 2/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.125950]
## 
##     if
##  dist > 0.0703468
##  ffreq = 1
##     then
##  outcome = 37.730889 - 0.000314 x - 0.35 dist + 6.5e-05 y
## 
##   Rule 2/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.139639]
## 
##     if
##  dist <= 0.0703468
##     then
##  outcome = 2.982852 - 0.36 dist
## 
## Model 3:
## 
##   Rule 3/1: [75 cases, mean 2.325347, range 2.053078 to 2.89098, est err 0.150262]
## 
##     if
##  dist > 0.211846
##     then
##  outcome = 2.263175 - 0.11 dist
## 
##   Rule 3/2: [80 cases, mean 2.772547, range 2.187521 to 3.264582, est err 0.156918]
## 
##     if
##  dist <= 0.211846
##     then
##  outcome = 2.978074 - 2.42 dist
## 
## Model 4:
## 
##   Rule 4/1: [121 cases, mean 2.443053, range 2.053078 to 3.055378, est err 0.161020]
## 
##     if
##  dist > 0.0703468
##     then
##  outcome = 4.601694 - 0.61 dist - 0.000101 x + 4.9e-05 y
## 
##   Rule 4/2: [51 cases, mean 2.505994, range 2.075547 to 3.055378, est err 0.196863]
## 
##     if
##  x <= 179731
##  dist > 0.0703468
##     then
##  outcome = 175.779227 - 0.000967 x
## 
##   Rule 4/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, est err 0.147504]
## 
##     if
##  dist > 0.0703468
##  ffreq = 1
##     then
##  outcome = 53.629078 - 0.000343 x - 0.44 dist + 3.3e-05 y
## 
##   Rule 4/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, est err 0.140965]
## 
##     if
##  dist <= 0.0703468
##     then
##  outcome = 2.975269 - 0.08 dist
## 
## 
## Evaluation on training data (155 cases):
## 
##     Average  |error|           0.110135
##     Relative |error|               0.40
##     Correlation coefficient        0.84
## 
## 
##  Attribute usage:
##    Conds  Model
## 
##     95%    88%    dist
##     21%    64%    x
##     19%           ffreq
##            39%    y
## 
## 
## Time: 0.0 secs

meuse.grid.sp$predictedLog10Zn.cubist.2 <- 
  predict(c.model.grid.2, newdata=meuse.grid[,preds],
          neighbors=5)
summary(meuse.grid.sp$predictedLog10Zn.cubist.2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.954   2.228   2.346   2.445   2.640   3.154

spplot(meuse.grid.sp, zcol="predictedLog10Zn.cubist.2", col.regions=bpy.colors(64),
       main="Optimized Cubist prediction")

A much nicer looking map.

7.3 Variable importance in Cubist.

The usage field of the fitted model shows the usage of each variable in either the rule conditions or the (terminal) linear model.

c.model.grid$usage

##   Conditions Model Variable
## 1        100   100     dist
## 2         48    52        x
## 3          0    57        y
## 4          0     0     soil
## 5          0     0    ffreq

c.model.grid.2$usage

##   Conditions Model Variable
## 1         95    88     dist
## 2         21    64        x
## 3         19     0    ffreq
## 4          0    39        y
## 5          0     0     soil

caret::varImp(c.model.grid)

##       Overall
## dist    100.0
## x        50.0
## y        28.5
## soil      0.0
## ffreq     0.0

caret::varImp(c.model.grid.2)

##       Overall
## dist     91.5
## x        42.5
## ffreq     9.5
## y        19.5
## soil      0.0

Distance is always used in the single model, but sometimes not in the committee model.

Notice how flood frequency came into the model when a committee was formed, i.e., subsidiary rules to adjust mis-matches from higher-level rules.

It is interesting to see how the different models fit a single predictor, in this case distance to river:

distances <- all.preds[, "dist", drop = FALSE]
rules_only <- cubist(x = distances, y = all.resp)
rules_and_com <- cubist(x = distances, y = all.resp, committees = 4)

predictions <- distances
predictions$true <- all.resp
predictions$rules <- predict(rules_only, distances, neighbors = 0)
predictions$rules_neigh <- predict(rules_only, distances, neighbors = 5)
predictions$committees <- predict(rules_and_com, distances, neighbors = 0)
predictions$committees_neigh <- predict(rules_and_com, distances, neighbors = 5)
summary(predictions)

##       dist              true           rules        rules_neigh   
##  Min.   :0.00000   Min.   :2.053   Min.   :2.069   Min.   :2.046  
##  1st Qu.:0.07569   1st Qu.:2.297   1st Qu.:2.311   1st Qu.:2.331  
##  Median :0.21184   Median :2.513   Median :2.482   Median :2.541  
##  Mean   :0.24002   Mean   :2.556   Mean   :2.545   Mean   :2.556  
##  3rd Qu.:0.36407   3rd Qu.:2.829   3rd Qu.:2.804   3rd Qu.:2.791  
##  Max.   :0.88039   Max.   :3.265   Max.   :2.982   Max.   :2.992  
##    committees    committees_neigh
##  Min.   :2.043   Min.   :2.035   
##  1st Qu.:2.347   1st Qu.:2.330   
##  Median :2.495   Median :2.524   
##  Mean   :2.545   Mean   :2.555   
##  3rd Qu.:2.771   3rd Qu.:2.791   
##  Max.   :2.924   Max.   :2.990

plot(true ~ dist, data=predictions, pch=20, col="darkgray",
     xlab="covariate: normalized distance to river",
     ylab="log10(Zn)", main="Cubist models")
grid()
m <- matrix(cbind(predictions$dist, predictions$rules), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=1, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$rules_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=2, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=3, lwd=2)
m <- matrix(cbind(predictions$dist, predictions$committees_neigh), ncol=2)
ix <- order(m[,1]); m <- m[ix,]
lines(m, col=4, lwd=2)
legend("topright", c("rules", "rules + neighbour",
                     "committee", "committee + neighbour"),
       pch="----", col=1:4)

Notice how the committee (green) smoothes out the centre of distance range from the single-rules model (black).

Notice how the neighbours come much closer to some of the points – too close?

8 K nearest neighbours

A pure “instance based” approach is to predict at each location from the average (possibly weighted) of some set of similar values. The caret package includes a KNN regression with the knnreg function.

Try with five clusters:

knn.5 <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq, data=meuse, k=5)
knn.pred <- predict(knn.5, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()

However, unless we know a priori a number of classes, we should let train evaluate the optimum number of clusters.

knn.tune <- train(x = all.preds, y = all.resp,
                  method="knn", 
                tuneGrid = data.frame(.k = 1:20), 
                trControl = trainControl(method = 'cv'))
str(knn.tune)

## List of 21
##  $ method      : chr "knn"
##  $ modelInfo   :List of 13
##   ..$ label     : chr "k-Nearest Neighbors"
##   ..$ library   : NULL
##   ..$ loop      : NULL
##   ..$ type      : chr [1:2] "Classification" "Regression"
##   ..$ parameters:'data.frame':   1 obs. of  3 variables:
##   .. ..$ parameter: chr "k"
##   .. ..$ class    : chr "numeric"
##   .. ..$ label    : chr "#Neighbors"
##   ..$ grid      :function (x, y, len = NULL, search = "grid")  
##   ..$ fit       :function (x, y, wts, param, lev, last, classProbs, ...)  
##   ..$ predict   :function (modelFit, newdata, submodels = NULL)  
##   ..$ predictors:function (x, ...)  
##   ..$ tags      : chr "Prototype Models"
##   ..$ prob      :function (modelFit, newdata, submodels = NULL)  
##   ..$ levels    :function (x)  
##   ..$ sort      :function (x)  
##  $ modelType   : chr "Regression"
##  $ results     :'data.frame':    20 obs. of  7 variables:
##   ..$ k         : int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ RMSE      : num [1:20] 0.239 0.214 0.201 0.206 0.209 ...
##   ..$ Rsquared  : num [1:20] 0.512 0.562 0.615 0.594 0.576 ...
##   ..$ MAE       : num [1:20] 0.173 0.161 0.156 0.16 0.161 ...
##   ..$ RMSESD    : num [1:20] 0.0651 0.0473 0.0399 0.0355 0.0349 ...
##   ..$ RsquaredSD: num [1:20] 0.19 0.16 0.139 0.127 0.152 ...
##   ..$ MAESD     : num [1:20] 0.0442 0.0344 0.0319 0.0282 0.0277 ...
##  $ pred        : NULL
##  $ bestTune    :'data.frame':    1 obs. of  1 variable:
##   ..$ k: int 3
##  $ call        : language train.default(x = all.preds, y = all.resp, method = "knn", trControl = trainControl(method = "cv"),      tuneGrid| __truncated__
##  $ dots        : list()
##  $ metric      : chr "RMSE"
##  $ control     :List of 28
##   ..$ method           : chr "cv"
##   ..$ number           : num 10
##   ..$ repeats          : logi NA
##   ..$ search           : chr "grid"
##   ..$ p                : num 0.75
##   ..$ initialWindow    : NULL
##   ..$ horizon          : num 1
##   ..$ fixedWindow      : logi TRUE
##   ..$ skip             : num 0
##   ..$ verboseIter      : logi FALSE
##   ..$ returnData       : logi TRUE
##   ..$ returnResamp     : chr "final"
##   ..$ savePredictions  : chr "none"
##   ..$ classProbs       : logi FALSE
##   ..$ summaryFunction  :function (data, lev = NULL, model = NULL)  
##   ..$ selectionFunction: chr "best"
##   ..$ preProcOptions   :List of 6
##   .. ..$ thresh   : num 0.95
##   .. ..$ ICAcomp  : num 3
##   .. ..$ k        : num 5
##   .. ..$ freqCut  : num 19
##   .. ..$ uniqueCut: num 10
##   .. ..$ cutoff   : num 0.9
##   ..$ sampling         : NULL
##   ..$ index            :List of 10
##   .. ..$ Fold01: int [1:140] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold02: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold03: int [1:141] 1 2 3 4 6 7 8 10 11 12 ...
##   .. ..$ Fold04: int [1:139] 1 2 3 4 5 7 9 11 12 13 ...
##   .. ..$ Fold05: int [1:139] 2 3 4 5 6 8 9 10 11 12 ...
##   .. ..$ Fold06: int [1:139] 1 2 4 5 6 7 8 9 10 11 ...
##   .. ..$ Fold07: int [1:140] 1 2 3 5 6 7 8 9 10 11 ...
##   .. ..$ Fold08: int [1:140] 1 3 4 5 6 7 8 9 10 11 ...
##   .. ..$ Fold09: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ Fold10: int [1:139] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ indexOut         :List of 10
##   .. ..$ Resample01: int [1:15] 11 12 19 24 28 48 70 81 87 90 ...
##   .. ..$ Resample02: int [1:16] 31 32 46 47 51 63 69 79 96 100 ...
##   .. ..$ Resample03: int [1:14] 5 9 44 59 64 76 80 94 117 124 ...
##   .. ..$ Resample04: int [1:16] 6 8 10 15 30 56 62 72 73 77 ...
##   .. ..$ Resample05: int [1:16] 1 7 21 25 34 35 45 68 71 92 ...
##   .. ..$ Resample06: int [1:16] 3 36 38 41 42 43 52 54 65 91 ...
##   .. ..$ Resample07: int [1:15] 4 13 18 22 39 50 55 74 78 84 ...
##   .. ..$ Resample08: int [1:15] 2 14 20 27 49 57 60 83 88 99 ...
##   .. ..$ Resample09: int [1:16] 16 17 26 29 33 40 53 75 82 95 ...
##   .. ..$ Resample10: int [1:16] 23 37 58 61 66 67 86 101 106 115 ...
##   ..$ indexFinal       : NULL
##   ..$ timingSamps      : num 0
##   ..$ predictionBounds : logi [1:2] FALSE FALSE
##   ..$ seeds            :List of 11
##   .. ..$ : int [1:20] 331478 879352 914281 736229 345552 137000 305698 936220 477590 796949 ...
##   .. ..$ : int [1:20] 942228 606351 531425 563971 556165 350857 269839 165305 219210 793670 ...
##   .. ..$ : int [1:20] 344001 499400 435143 425003 315768 600008 430462 674254 508596 663124 ...
##   .. ..$ : int [1:20] 25303 853779 773164 637561 418224 680961 650579 986430 29211 763715 ...
##   .. ..$ : int [1:20] 121081 756384 51529 496944 72256 110558 7104 941019 520871 107068 ...
##   .. ..$ : int [1:20] 347957 554661 820542 192229 343071 628777 425467 597347 575189 599158 ...
##   .. ..$ : int [1:20] 190643 924754 983605 127721 912488 828814 458233 144726 662184 451934 ...
##   .. ..$ : int [1:20] 70305 564507 62247 439284 658329 807435 317743 900451 251307 956902 ...
##   .. ..$ : int [1:20] 757682 729339 29882 784946 140110 991703 378115 589952 624084 83332 ...
##   .. ..$ : int [1:20] 775494 583733 85255 556345 587481 779124 383623 180975 784581 181348 ...
##   .. ..$ : int 86593
##   ..$ adaptive         :List of 4
##   .. ..$ min     : num 5
##   .. ..$ alpha   : num 0.05
##   .. ..$ method  : chr "gls"
##   .. ..$ complete: logi TRUE
##   ..$ trim             : logi FALSE
##   ..$ allowParallel    : logi TRUE
##   ..$ yLimits          : num [1:2] 1.99 3.33
##  $ finalModel  :List of 8
##   ..$ learn      :List of 2
##   .. ..$ y: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
##   .. ..$ X: chr [1:155, 1:5] "181072" "181025" "181165" "181298" ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:155] "X1" "X2" "X3" "X4" ...
##   .. .. .. ..$ : chr [1:5] "x" "y" "dist" "soil" ...
##   ..$ k          : int 3
##   ..$ theDots    : list()
##   ..$ xNames     : chr [1:5] "x" "y" "dist" "soil" ...
##   ..$ problemType: chr "Regression"
##   ..$ tuneValue  :'data.frame':  1 obs. of  1 variable:
##   .. ..$ k: int 3
##   ..$ obsLevels  : logi NA
##   ..$ param      : list()
##   ..- attr(*, "class")= chr "knnreg"
##  $ preProcess  : NULL
##  $ trainingData:'data.frame':    155 obs. of  6 variables:
##   ..$ x       : num [1:155] 181072 181025 181165 181298 181307 ...
##   ..$ y       : num [1:155] 333611 333558 333537 333484 333330 ...
##   ..$ dist    : num [1:155] 0.00136 0.01222 0.10303 0.19009 0.27709 ...
##   ..$ soil    : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 2 1 1 2 ...
##   ..$ ffreq   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ .outcome: num [1:155] 3.01 3.06 2.81 2.41 2.43 ...
##  $ ptype       :'data.frame':    0 obs. of  5 variables:
##   ..$ x    : num(0) 
##   ..$ y    : num(0) 
##   ..$ dist : num(0) 
##   ..$ soil : Factor w/ 3 levels "1","2","3": 
##   ..$ ffreq: Factor w/ 3 levels "1","2","3": 
##  $ resample    :'data.frame':    10 obs. of  4 variables:
##   ..$ RMSE    : num [1:10] 0.159 0.231 0.23 0.232 0.192 ...
##   ..$ Rsquared: num [1:10] 0.772 0.457 0.675 0.566 0.659 ...
##   ..$ MAE     : num [1:10] 0.129 0.182 0.148 0.174 0.148 ...
##   ..$ Resample: chr [1:10] "Fold07" "Fold02" "Fold09" "Fold06" ...
##  $ resampledCM : NULL
##  $ perfNames   : chr [1:3] "RMSE" "Rsquared" "MAE"
##  $ maximize    : logi FALSE
##  $ yLimits     : num [1:2] 1.99 3.33
##  $ times       :List of 3
##   ..$ everything: 'proc_time' Named num [1:5] 1.009 0.006 1.019 0 0
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ final     : 'proc_time' Named num [1:5] 0.001 0 0.002 0 0
##   .. ..- attr(*, "names")= chr [1:5] "user.self" "sys.self" "elapsed" "user.child" ...
##   ..$ prediction: logi [1:3] NA NA NA
##  $ levels      : logi NA
##  - attr(*, "class")= chr "train"

print(knn.tune)

## k-Nearest Neighbors 
## 
## 155 samples
##   5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 140, 139, 141, 139, 139, 139, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE       Rsquared   MAE      
##    1  0.2387850  0.5123081  0.1734232
##    2  0.2143472  0.5620450  0.1607389
##    3  0.2014253  0.6146012  0.1562451
##    4  0.2063183  0.5935644  0.1598601
##    5  0.2086454  0.5761987  0.1608352
##    6  0.2107982  0.5680247  0.1620349
##    7  0.2142721  0.5563196  0.1662006
##    8  0.2132208  0.5608793  0.1665361
##    9  0.2210012  0.5239257  0.1747875
##   10  0.2172485  0.5432590  0.1732183
##   11  0.2229628  0.5254602  0.1776785
##   12  0.2283766  0.5089757  0.1837060
##   13  0.2324025  0.4987415  0.1866540
##   14  0.2328633  0.5024031  0.1892235
##   15  0.2344155  0.4959076  0.1909953
##   16  0.2398379  0.4733137  0.1976369
##   17  0.2426532  0.4626844  0.2005701
##   18  0.2441876  0.4651065  0.2042799
##   19  0.2462564  0.4589822  0.2058679
##   20  0.2493915  0.4487920  0.2099251
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 3.

plot.train(knn.tune, metric="RMSE")

plot.train(knn.tune, metric="Rsquared")

plot.train(knn.tune, metric="MAE")

min(knn.tune$results$RMSE)

## [1] 0.2014253

(ix <- which.min(knn.tune$results$RMSE))

## [1] 3

max(knn.tune$results$Rsquared)

## [1] 0.6146012

(ix <- which.max(knn.tune$results$Rsquared))

## [1] 3

# directly find optimum
knn.tune$finalModel$k

## [1] 3

Here there is a clear winner, k=3. It is not a very good model: cross-validation RMSE about 0.201, \(R^2 =\) 0.615.

Use this to predict:

knn.final <- caret::knnreg(logZn ~ x + y + dist + elev + ffreq,
                       data=meuse,
                       k = knn.tune$finalModel$k)
knn.pred <- predict(knn.final, newdata=meuse)
plot(meuse$logZn ~ knn.pred, asp=1, col=meuse$ffreq, pch=20)
abline(0,1); grid()

We can’t map with this because we don’t have elevation in the grid. So re-tune with what we do have:

preds.2 <- c("x","y","ffreq","dist.m","soil","lime")
all.preds.2 <- meuse[, preds.2]

knn.tune <- train(x = all.preds.2, y = all.resp,
                  method="knn", 
                tuneGrid = data.frame(.k = 1:20), 
                trControl = trainControl(method = 'cv'))
min(knn.tune$results$RMSE)

## [1] 0.1923328

(ix <- which.min(knn.tune$results$RMSE))

## [1] 3

max(knn.tune$results$Rsquared)

## [1] 0.6350597

(ix <- which.max(knn.tune$results$Rsquared))

## [1] 3

# directly find optimum
knn.tune$finalModel$k

## [1] 3

plot.train(knn.tune, metric="RMSE")

plot.train(knn.tune, metric="Rsquared")

plot.train(knn.tune, metric="MAE")

## the two criteria do not agree

Map with this:

knn.final <- caret::knnreg(logZn ~ x + y + dist + ffreq,
                       data=meuse,
                       k = knn.tune$finalModel$k)
meuse.grid.sp$knn <- predict(knn.final, newdata=meuse.grid)
summary(meuse.grid.sp$knn)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.074   2.276   2.429   2.475   2.669   3.217

spplot(meuse.grid.sp, zcol="knn")

Interesting map! Note here ‘nearest neighbour’ concept includes both coordinates (x, y, distance) and feature-space (flood frequency).

9 Spatial random forests

Random forests can use coordinates and distances to geographic features as predictors; we used the x and y coordinates in the previous section.

RF can also use distances to multiple points as predictors:

Distance buffers: distance to closest point with some range of values
A common approach is to compute quantiles of the response variable and create one buffer for each
Each sample point has a distance to the closest point in each quantile
This uses separation between point-pairs of different values, but with no model.

Reference: Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B. (2018). Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables. PeerJ, 6, e5518. https://doi.org/10.7717/peerj.5518

Packages used in this section: randomForest to build the RF, scales for visualization, GSIF for buffer distances.

# library(quantregForest)  # the main work is here
library(scales)  # for visualization
# devtools::install_github("envirometrix/landmap")
library(landmap)    # for buffer distances

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

Set up quantiles for the transformed and original Zn concentrations:

library(ggplot2)
g <- ggplot(meuse, mapping=aes(x=logZn))
g + geom_histogram(bins=16, fill="lightgreen", color="blue") + geom_rug()

(q.lzn <- quantile(meuse$logZn, seq(0,1,by=0.0625)))
# must include.lowest to have lowest value in some class
classes.lzn.q <- cut(meuse$logZn, breaks=q.lzn, ordered_result=TRUE, include.lowest=TRUE)
levels(classes.lzn.q)

##       0%    6.25%    12.5%   18.75%      25%   31.25%    37.5%   43.75% 
## 2.053078 2.139461 2.201372 2.261554 2.296665 2.322219 2.362643 2.414125 
##      50%   56.25%    62.5%   68.75%      75%   81.25%    87.5%   93.75% 
## 2.513218 2.603414 2.680704 2.754247 2.828981 2.873820 2.920515 3.056094 
##     100% 
## 3.264582 
##  [1] "[2.05,2.14]" "(2.14,2.2]"  "(2.2,2.26]"  "(2.26,2.3]"  "(2.3,2.32]" 
##  [6] "(2.32,2.36]" "(2.36,2.41]" "(2.41,2.51]" "(2.51,2.6]"  "(2.6,2.68]" 
## [11] "(2.68,2.75]" "(2.75,2.83]" "(2.83,2.87]" "(2.87,2.92]" "(2.92,3.06]"
## [16] "(3.06,3.26]"

Q: Are the 16 quantiles more or less equally distributed along the range of the target variable?

9.1.2 Distances to each quantile

Compute distance buffers to points in each quantile. For this, the points object must be converted to a SpatialPointsDataFrame and the function buffer.dist in the landmap package is used.

To display the distance buffers, we stack the fields of the data frame as set of rasters, using the stack method of the raster package.

coordinates(meuse) <- ~x + y
grid.dist.lzn <- landmap::buffer.dist(meuse["logZn"], meuse.grid.sp, classes.lzn.q)
summary(grid.dist.lzn)

## Object of class SpatialPixelsDataFrame
## Coordinates:
##      min    max
## x 178440 181560
## y 329600 333760
## Is projected: NA 
## proj4string : [NA]
## Number of points: 3103
## Grid attributes:
##    cellcentre.offset cellsize cells.dim
## s1            178460       40        78
## s2            329620       40       104
## Data attributes:
##     layer.1          layer.2          layer.3          layer.4     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 291.2   1st Qu.: 233.2   1st Qu.: 233.2   1st Qu.:200.0  
##  Median : 605.3   Median : 377.4   Median : 377.4   Median :322.5  
##  Mean   : 695.5   Mean   : 513.4   Mean   : 401.1   Mean   :330.4  
##  3rd Qu.:1038.5   3rd Qu.: 698.6   3rd Qu.: 536.7   3rd Qu.:441.8  
##  Max.   :2050.6   Max.   :1938.7   Max.   :1193.3   Max.   :937.2  
##     layer.5         layer.6          layer.7          layer.8      
##  Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:240.0   1st Qu.: 240.0   1st Qu.: 204.0   1st Qu.: 233.2  
##  Median :377.4   Median : 368.8   Median : 329.8   Median : 379.5  
##  Mean   :393.4   Mean   : 379.4   Mean   : 363.2   Mean   : 430.7  
##  3rd Qu.:521.5   3rd Qu.: 501.2   3rd Qu.: 494.8   3rd Qu.: 560.0  
##  Max.   :990.4   Max.   :1245.8   Max.   :1131.4   Max.   :1631.7  
##     layer.9         layer.10         layer.11         layer.12     
##  Min.   :  0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:256.1   1st Qu.: 243.3   1st Qu.: 322.5   1st Qu.: 243.3  
##  Median :417.6   Median : 402.0   Median : 544.1   Median : 425.2  
##  Mean   :431.3   Mean   : 464.3   Mean   : 570.4   Mean   : 482.8  
##  3rd Qu.:600.0   3rd Qu.: 632.5   3rd Qu.: 776.7   3rd Qu.: 648.7  
##  Max.   :950.8   Max.   :1302.9   Max.   :1612.5   Max.   :1531.0  
##     layer.13         layer.14         layer.15         layer.16     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.: 282.8   1st Qu.: 291.2   1st Qu.: 312.4   1st Qu.: 339.4  
##  Median : 483.3   Median : 468.2   Median : 526.1   Median : 536.7  
##  Mean   : 475.3   Mean   : 533.7   Mean   : 563.6   Mean   : 538.6  
##  3rd Qu.: 651.2   3rd Qu.: 688.2   3rd Qu.: 800.0   3rd Qu.: 720.0  
##  Max.   :1019.8   Max.   :1697.5   Max.   :1618.4   Max.   :1322.4

plot(raster::stack(grid.dist.lzn))

Q: Explain the pattern of Layer 16.

Extract the distances to each buffer at each point using the over “overlay point on grid” function of the sp package, and add these as fields to the points:

buffer.dists <- over(meuse, grid.dist.lzn)
str(buffer.dists)

## 'data.frame':    155 obs. of  16 variables:
##  $ layer.1 : num  1897 1809 1865 1878 1737 ...
##  $ layer.2 : num  1784 1695 1755 1772 1633 ...
##  $ layer.3 : num  735 680 621 561 402 ...
##  $ layer.4 : num  468 412 362 330 179 ...
##  $ layer.5 : num  804 730 721 699 544 ...
##  $ layer.6 : num  685 612 601 582 431 ...
##  $ layer.7 : num  268 283 126 0 160 ...
##  $ layer.8 : num  369 330 233 160 0 ...
##  $ layer.9 : num  268 226 160 170 126 ...
##  $ layer.10: num  243 160 226 305 283 ...
##  $ layer.11: num  369 283 344 400 330 ...
##  $ layer.12: num  144 160 0 126 233 ...
##  $ layer.13: num  734 645 738 794 700 ...
##  $ layer.14: num  1531 1442 1537 1585 1471 ...
##  $ layer.15: num  0 89.4 144.2 268.3 368.8 ...
##  $ layer.16: num  89.4 0 160 282.8 344.1 ...

meuse@data <- cbind(meuse@data, buffer.dists)

Show how far point 1 is from the nearest point of each quantile:

buffer.dists[1,]

##    layer.1  layer.2  layer.3 layer.4 layer.5  layer.6  layer.7  layer.8
## 1 1897.367 1783.928 735.3911 468.188  803.99 684.6897 268.3282 368.7818
##    layer.9 layer.10 layer.11 layer.12 layer.13 layer.14 layer.15 layer.16
## 1 268.3282 243.3105 368.7818 144.2221 734.3024 1531.013        0 89.44272

Of course it is zero distance to its own quantile, in this case 15.

9.1.3 Spatial random forest

First make a formula with all the quantiles as predictors:

dn <- paste(names(grid.dist.lzn), collapse="+")
(fm <- as.formula(paste("logZn ~", dn)))

## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 + 
##     layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 + 
##     layer.13 + layer.14 + layer.15 + layer.16

Now compute the random forest model, use it to predict at the observation points.

(rf <- randomForest(fm , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm, data = meuse@data, importance = TRUE,      min.split = 6, mtry = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.02515229
##                     % Var explained: 74.24

pred.rf <- predict(rf, newdata=meuse@data)

plot(rf)

Q: How many trees were needed to make a reliable forest?

Variable importance:

importance(rf, type=1)

##            %IncMSE
## layer.1  15.777625
## layer.2  14.000921
## layer.3  12.192586
## layer.4  11.116130
## layer.5  11.328830
## layer.6  17.637204
## layer.7   8.513953
## layer.8   9.900595
## layer.9  10.901250
## layer.10 10.956214
## layer.11  5.475709
## layer.12 17.369070
## layer.13 11.046008
## layer.14 12.776170
## layer.15 16.842512
## layer.16 23.586323

varImpPlot(rf, type=1)

Q: Which distance buffer was most important? Explain why.

Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.

plot(meuse$logZn ~ pred.rf, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.rf <- sqrt(sum((pred.rf-meuse$logZn)^2)/length(pred.rf)))

## [1] 0.06573453

#
oob.rf <- predict(rf)
plot(meuse$logZn ~ oob.rf, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.oob <- sqrt(sum((oob.rf-meuse$logZn)^2)/length(oob.rf)))

## [1] 0.1585947

Q: What are the goodness-of-fit and Out-of-bag RMSE?

9.1.4 Mapping

pred.grid <- predict(rf, newdata=grid.dist.lzn@data)
summary(pred.grid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.100   2.316   2.436   2.486   2.648   3.179

meuse.grid.sp$logZn.rf <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p1 <- spplot(meuse.grid.sp, zcol="logZn.rf", main="Zn, log(mg kg-1)", sub="RRF 16 distance buffers", at=breaks)
print(p1)

This random forest is based on distances from known points to prediction points. An obvious comparison is with Ordinary Kriging (OK), which is also based on these distances. However, OK requires a model of spatial dependence (i.e., a variogram model).

Compare this with OK cross-validation. First compute the ordinary variogram and fit a model:

require(gstat)
v <- variogram(logZn ~ 1, meuse)
(vmf <- fit.variogram(v, model=vgm(0.1, "Sph", 1000, 0.02)))

##   model       psill    range
## 1   Nug 0.009554751   0.0000
## 2   Sph 0.111394769 896.9909

plot(v, model=vmf, plot.numbers=T)

Here we get a well-fit model.

Now compute the cross-validation statistics:

k.cv <- krige.cv(logZn ~ 1, meuse, model=vmf, verbose=FALSE)
plot(meuse$logZn ~ k.cv$var1.pred, pch=20, xlab="Cross-validation fit", ylab="Actual value", main="Zn, log(mg kg-1)", asp=1)
abline(0,1); grid()

(rmse.kcv <- sqrt(sum((k.cv$var1.pred-meuse$logZn)^2)/length(k.cv$var1.pred)))

## [1] 0.170157

Q: Which approach gives the lower cross-validation error in this case?

Compare the kriging map with the spatial random forest map:

k.ok <- krige(logZn ~ 1, meuse, newdata=meuse.grid.sp, model=vmf)

## [using ordinary kriging]

summary(k.ok$var1.pred)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.074   2.275   2.420   2.479   2.680   3.231

p2 <- spplot(k.ok, zcol="var1.pred", main="Zn, log(mg kg-1)", sub="Ordinary Kriging", at=breaks)
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

Compute and display the differences:

k.ok$diff.srf.ok <- meuse.grid.sp$logZn.rf - k.ok$var1.pred
summary(k.ok$diff.srf.ok)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.237016 -0.042862  0.004085  0.007542  0.047866  0.385077

p3 <- spplot(k.ok, zcol="diff.srf.ok", main="Spatial RF - OK predictions, log(Zn)", col.regions=topo.colors(36))
print(p3)

Q: Comment on the differences/similarities between the Spatial Random Forest map and the Ordinary Kriging map. Note that for Kriging we needed to specify a model, for RF no model was needed.

9.2 Method 2 – buffers around each point

The concept of quantile buffers is similar to the empirical variogram, i.e., assume stationarity of spatial correlation structure.

Another approach is to use the distance from each known point as a predictor. Obviously this increases the size of the problem. This is the approach taken in the Hengl paper cited above.

Set up the buffer distances from each known point and visualize a few of them:

grid.dist.0 <- landmap::buffer.dist(meuse["logZn"], meuse.grid.sp, as.factor(1:nrow(meuse)))
dim(grid.dist.0)

## [1] 3103  155

summary(grid.dist.0$layer.1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1677    2685    2556    3506    4569

plot(raster::stack(grid.dist.0[1:9]))

Extract the distances to each buffer at each point, and add these as fields to the points. Note that we first have to remove the previous buffer distances (the 16 quantiles).

buffer.dists <- over(meuse, grid.dist.0)
dim(buffer.dists)

## [1] 155 155

# distances from point 1 to all the other points
(buffer.dists[1,1:10])

##   layer.1  layer.2  layer.3  layer.4  layer.5  layer.6  layer.7  layer.8
## 1       0 89.44272 144.2221 268.3282 368.7818 481.6638 268.3282 243.3105
##   layer.9 layer.10
## 1     400  468.188

names(meuse@data)

##  [1] "cadmium"  "copper"   "lead"     "zinc"     "elev"     "dist"    
##  [7] "om"       "ffreq"    "soil"     "lime"     "landuse"  "dist.m"  
## [13] "logZn"    "logCu"    "logPb"    "logCd"    "layer.1"  "layer.2" 
## [19] "layer.3"  "layer.4"  "layer.5"  "layer.6"  "layer.7"  "layer.8" 
## [25] "layer.9"  "layer.10" "layer.11" "layer.12" "layer.13" "layer.14"
## [31] "layer.15" "layer.16"

meuse@data <- meuse@data[,1:16]
meuse@data <- cbind(meuse@data, buffer.dists)

So now we have 155 buffer distances. Set up a formula for the RF and build the RF. Note that the default mtry will be 52, i.e., distance to 1/3 of the points. These will be chosen randomly at each split, so they will in general not be the closest points.

dn0 <- paste(names(buffer.dists), collapse="+") 
fm0 <- as.formula(paste("logZn ~ ", dn0))    
(rf0 <- randomForest(fm0 , meuse@data, importance=TRUE, 
                    min.split=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm0, data = meuse@data, importance = TRUE,      min.split = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 51
## 
##           Mean of squared residuals: 0.03369799
##                     % Var explained: 65.49

This does not perform as well as using the quantiles, the \(R^2\) is about 10% lower.

Mapping:

pred.grid <- predict(rf0, newdata=grid.dist.0@data)
summary(pred.grid)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.099   2.308   2.446   2.488   2.679   3.172

meuse.grid.sp$logZn.rf0 <- pred.grid
breaks <- seq(2, 3.3, by=.05)
p2 <- spplot(meuse.grid.sp, zcol="logZn.rf0", main="Zn, log(mg kg-1)",
             sub="RF distance all points", at=breaks)
# compare with 16-quantile buffers
print(p2, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

A clearly inferior map. So we don’t care about individual point values, rather about the distance to quantiles.

10 Random forests for categorical variables

Random forests can also be used to predict classes. As with RF for continuous variables, the forest is made up of many trees. The prediction is the class predicted by the majority of trees.

m.ffreq.rf <- randomForest(ffreq ~ x + y + dist.m + elev + soil + lime, data=meuse, importance=T, na.action=na.omit, mtry=3)
print(m.ffreq.rf)

## 
## Call:
##  randomForest(formula = ffreq ~ x + y + dist.m + elev + soil +      lime, data = meuse, importance = T, mtry = 3, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 16.13%
## Confusion matrix:
##    1  2  3 class.error
## 1 77  7  0  0.08333333
## 2  2 41  5  0.14583333
## 3  2  9 12  0.47826087

importance(m.ffreq.rf)

##               1         2         3 MeanDecreaseAccuracy MeanDecreaseGini
## x      18.29162 11.344016  8.060948            21.165561        13.189545
## y      34.98967 20.135347 41.654420            47.717887        32.971857
## dist.m 10.11069  1.711775  5.707771             9.391853         7.262792
## elev   57.11589 36.914994 12.138088            60.901345        32.347641
## soil   10.15313  2.885389  4.660304            10.613597         2.226856
## lime   11.18376 -2.155811  8.122504            11.160544         2.380405

varImpPlot(m.ffreq.rf)

Here we see the class error rate, i.e., the training samples not predicted into the correct class. In this example class 3 is poorly-predicted, whereas class 1 is well-predicted.

Elevation is by far the most important predictor of accurary for all classes. The coordinates are next, because of the geographic configuration of the studty area – these replace distance to river when that is not used in the model.

However for the GINI coefficient (how often a randomly-chosen training observation would be incorrectly assigned if it were randomly labelled according to the frequency distribution of labels in the subset) the y-coordinate becomes somewhat more important – this perhaps because of the one high value far to the east, at the river bend.

See where the predictors are used:

min_depth_frame <- min_depth_distribution(m.ffreq.rf)
plot_min_depth_distribution(min_depth_frame)

Soil type and lime application are not used in more than half the trees. The y-coordinate is used first in most trees, followed by elevation and x-coordinate.

We can see the probabilities of each class by using the model to predict back at the known points:

summary(prob_rf <- predict(m.ffreq.rf, meuse, type = "prob"))

##        1                2                3         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0380   1st Qu.:0.0050   1st Qu.:0.0000  
##  Median :0.8020   Median :0.0460   Median :0.0080  
##  Mean   :0.5461   Mean   :0.3103   Mean   :0.1436  
##  3rd Qu.:0.9910   3rd Qu.:0.7920   3rd Qu.:0.0520  
##  Max.   :1.0000   Max.   :0.9800   Max.   :0.9880

# probability of each observation being predicted to class 1
prob_rf[,1]

##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 0.984 1.000 0.998 1.000 1.000 1.000 0.998 0.992 0.990 0.990 0.998 0.922 0.998 
##    14    15    16    17    18    19    20    21    22    23    24    25    26 
## 0.976 0.984 0.988 1.000 0.996 0.992 1.000 0.946 0.990 0.984 0.996 1.000 1.000 
##    27    28    29    30    31    32    33    34    35    37    38    39    40 
## 1.000 1.000 0.996 0.988 1.000 0.996 1.000 1.000 1.000 1.000 0.998 0.984 1.000 
##    41    42    43    44    45    46    47    48    49    50    51    52    53 
## 1.000 0.974 0.988 0.940 0.772 0.952 0.984 0.952 0.802 0.680 0.774 0.766 0.964 
##    54    55    56    57    58    59    60    61    62    63    64    65    66 
## 0.962 0.958 0.980 0.984 0.836 0.956 0.984 0.988 0.982 0.992 0.994 0.992 0.982 
##    67    69    75    76    79    80    81    82    83    84    85    86    87 
## 1.000 0.964 0.720 0.942 1.000 0.996 0.976 0.976 0.988 0.984 0.988 0.896 0.772 
##    88    89    90   123   160   163    70    71    91    92    93    94    95 
## 1.000 0.994 1.000 0.878 0.994 0.996 0.120 0.072 0.016 0.008 0.008 0.004 0.104 
##    96    97    98    99   100   101   102   103   104   105   106   108   109 
## 0.042 0.070 0.076 0.002 0.096 0.032 0.030 0.016 0.006 0.064 0.062 0.038 0.016 
##   110   111   112   113   114   115   116   117   118   119   120   121   122 
## 0.158 0.094 0.016 0.046 0.016 0.002 0.038 0.012 0.004 0.024 0.010 0.032 0.038 
##   124   125   126   127   128   129   130   131   132   133   134   135   136 
## 0.072 0.074 0.048 0.010 0.102 0.216 0.308 0.074 0.066 0.050 0.158 0.118 0.074 
##   161   162   137   138   140   141   142   143   144   145   146   147   148 
## 0.114 0.042 0.336 0.064 0.008 0.000 0.014 0.000 0.026 0.006 0.000 0.004 0.000 
##   149   150   151   152   153   154   155   156   157   158   159   164 
## 0.000 0.000 0.172 0.102 0.000 0.002 0.002 0.002 0.004 0.004 0.002 0.106

# an observation in order of probability
sort(prob_rf[1,], decreasing=TRUE)

##     1     2     3 
## 0.984 0.008 0.008

order(prob_rf[1,], decreasing=TRUE)

## [1] 1 2 3

# its actual class -- in this case it agrees
meuse$ffreq[1]

## [1] 1
## Levels: 1 2 3

These can be summarized into list of most-probable, 2nd-most probable etc. classes:

# list of most likely classes
probs <- apply(prob_rf, 1, order, decreasing=TRUE)
probs[,1]

## [1] 1 2 3

probs[,121]

## [1] 2 1 3

# most likely, 2nd most, 3rd most, for each observation
table(probs.1 <- probs[1,])

## 
##  1  2  3 
## 84 48 23

# cross-classification of most and 2nd-most probable classes
table(probs.1, probs.2 <- probs[2,])

##        
## probs.1  1  2  3
##       1  0 72 12
##       2 26  0 22
##       3  3 20  0

10.1 Mapping classes

Again we need to restrict the RF model to variables also available in the prediction grid.

(m.ffreq.rf.2 <- randomForest(ffreq ~ x + y + dist + soil, data=meuse, importance=T, na.action=na.omit, mtry=3))

## 
## Call:
##  randomForest(formula = ffreq ~ x + y + dist + soil, data = meuse,      importance = T, mtry = 3, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 27.1%
## Confusion matrix:
##    1  2  3 class.error
## 1 72 11  1   0.1428571
## 2 14 29  5   0.3958333
## 3  3  8 12   0.4782609

This is less accurate than the model with elevation. Anyway, we show the map:

meuse.grid.sp$predictedffreq.rf <- predict(m.ffreq.rf.2, newdata=meuse.grid)
spplot(meuse.grid.sp, zcol="predictedffreq.rf", main="Most probable flood frequency class")

Not a nice-looking map. It disagrees strongly with the actual flooding-frequency map.

summary(meuse.grid.sp$ffreq.tf <-
          as.factor((meuse.grid.sp$predictedffreq.rf==
                       meuse.grid.sp$ffreq)))

## FALSE  TRUE 
##  1461  1642

spplot(meuse.grid.sp, zcol="ffreq.tf",
       main="Accuracy of RF model")

Probability of each class at each grid cell:

meuse.grid.sp$probffreq <-
  predict(m.ffreq.rf.2, newdata=meuse.grid, type="prob")
meuse.grid.sp$probffreq.rf.1 <- meuse.grid.sp$probffreq[,1]
meuse.grid.sp$probffreq.rf.2 <- meuse.grid.sp$probffreq[,2]
meuse.grid.sp$probffreq.rf.3 <- meuse.grid.sp$probffreq[,3]
p1.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.1",
             main="Prob. class 1")
p2.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.2",
             main="Prob. class 2")
p3.p <- spplot(meuse.grid.sp, zcol="probffreq.rf.3",
             main="Prob. class 3")
print(p1.p, split=c(1,1,3,1), more=T)
print(p2.p, split=c(2,1,3,1), more=T)
print(p3.p, split=c(3,1,3,1), more=F)

11 Spatial random forest, including covariates

Does adding covariates improve the prediction?

(covar.names <- paste(intersect(names(meuse), names(meuse.grid)), collapse="+"))

## [1] "dist+ffreq+soil"

(fm.c <- as.formula(paste("logZn ~", dn, "+", covar.names)))

## logZn ~ layer.1 + layer.2 + layer.3 + layer.4 + layer.5 + layer.6 + 
##     layer.7 + layer.8 + layer.9 + layer.10 + layer.11 + layer.12 + 
##     layer.13 + layer.14 + layer.15 + layer.16 + dist + ffreq + 
##     soil

Now compute the random forest model including these covariates, use it to predict at the observation points.

(rf.c <- randomForest(fm.c , meuse@data, importance=TRUE, min.split=6, mtry=6, ntree=640))

## 
## Call:
##  randomForest(formula = fm.c, data = meuse@data, importance = TRUE,      min.split = 6, mtry = 6, ntree = 640) 
##                Type of random forest: regression
##                      Number of trees: 640
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.02806032
##                     % Var explained: 71.27

pred.rf.c <- predict(rf.c, newdata=meuse@data)

plot(rf.c)

Display the variable importance:

importance(rf.c, type=1)

##            %IncMSE
## layer.1   9.048894
## layer.2   9.535551
## layer.3   8.410602
## layer.4   8.587360
## layer.5  10.529506
## layer.6  11.239619
## layer.7   8.394626
## layer.8   8.562815
## layer.9   9.408776
## layer.10 10.414668
## layer.11 10.157294
## layer.12 10.373565
## layer.13  9.935686
## layer.14  8.634690
## layer.15  8.074228
## layer.16 10.795800
## dist     44.034304
## ffreq    32.572061
## soil     17.952098

varImpPlot(rf.c, type=1)

Q: Are any of the covariables important? Are any more important than the distance buffers? Explain.

Plot the RF fit and the OOB predictions. Compute the fitted and OOB RMSE.

plot(meuse$logZn ~ pred.rf.c, asp=1, pch=20, xlab="Random forest fit", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.rf.c <- sqrt(sum((pred.rf.c-meuse$logZn)^2)/length(pred.rf.c)))

## [1] 0.07758499

#
oob.rf.c <- predict(rf.c)
plot(meuse$logZn ~ oob.rf.c, asp=1, pch=20, xlab="Out-of-bag prediction", ylab="Actual value", main="Zn, log(mg kg-1)")
abline(0,1); grid()

(rmse.oob.c <- sqrt(sum((oob.rf.c-meuse$logZn)^2)/length(oob.rf.c)))

## [1] 0.1675122

Q: What are the goodness-of-fit and Out-of-bag RMSE? How do these compare to the spatial RF without covariates?

11.1 Mapping

We first have to add the covariates to the prediction grid.

grid.dist.lzn.covar <- grid.dist.lzn
grid.dist.lzn.covar$dist <- meuse.grid$dist
grid.dist.lzn.covar$ffreq <- meuse.grid$ffreq
grid.dist.lzn.covar$soil <- meuse.grid$soil

Now predict over the grid and display:

pred.grid.c <- predict(rf.c, newdata=grid.dist.lzn.covar@data)
summary(pred.grid.c)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.293   2.345   2.431   2.495   2.614   2.925

meuse.grid.sp$logZn.rf.c <- pred.grid.c
breaks <- seq(2, 3.3, by=.05)
p1.c <- spplot(meuse.grid.sp, zcol="logZn.rf.c", main="Zn, log(mg kg-1)", sub="Random forest on distance buffers & covariates", at=breaks)
print(p1.c, split=c(1,1,2,1), more=T)
print(p1, split=c(2,1,2,1), more=F)

Compute and display the differences:

meuse.grid.sp$logZn.rf.diff <- meuse.grid.sp$logZn.rf.c -  meuse.grid.sp$logZn.rf
summary(meuse.grid.sp$logZn.rf.diff)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.444049 -0.078527  0.007356  0.008760  0.096491  0.438853

p3.c <- spplot(meuse.grid.sp, zcol="logZn.rf.diff", main="Spatial RF with - without covariates, log(Zn)", col.regions=topo.colors(36))
print(p3.c)

Q: Compare this to the maps made by RF/Spatial Distances without covariates. Where does using the covariates result in higher and lower predicted values? Try to explain why.

Challenge: Map logZn over the grid by Kriging with External Drift (KED), using the same three covariates (dist, ffreq, soil), and compare with the map made by spatial random forest with covariates. Reminder: the variogram must be computed for the linear model residuals using these covariates.

Q: Describe the differences between the residual and ordinary variogram.

Q: Describe the differences between the OK and KED maps.

Q: Compare the KED map with the spatial random forest + covariates map.

12 General references

Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning data mining, inference, and prediction (2nd ed). New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-0-387-84858-7

Kuhn, M. (2013). Applied Predictive Modeling. New York, NY: Springer New York. Retrieved from https://link.springer.com/openurl?genre=book&isbn=978-1-4614-6848-6

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: with applications in R. New York: Springer. Retrieved from http://link.springer.com/book/10.1007%2F978-1-4614-7138-7

Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., … Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1–37. https://doi.org/10.1007/s10115-007-0114-2

Meuse heavy metals exercise — CART and random forests

D G Rossiter

2022-06-21

1 Dataset

2 Regression trees

2.1 Building a regression tree

2.2 Variable importance

2.3 Pruning the regression tree

2.4 Goodness-of-fit

3 Classification tree

3.1 Building a classification tree

3.2 Pruning the classification tree

4 Mapping over the grid.

5 The variability of regression trees

6 Random Forests

6.1 Sensitivity

6.2 Variable importance

6.3 Examine variable importance and interaction with `randomForestExplainer`

6.4 Partial dependence plots

6.5 Goodness-of-fit and out-of-bag errors

6.6 Map over the grid with an RF model

7 Cubist

7.1 Unoptimized Cubist model

7.2 Optimizing cubist

7.3 Variable importance in Cubist.

8 K nearest neighbours

9 Spatial random forests

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

9.1.2 Distances to each quantile

9.1.3 Spatial random forest

9.1.4 Mapping

9.2 Method 2 – buffers around each point

10 Random forests for categorical variables

10.1 Mapping classes

11 Spatial random forest, including covariates

11.1 Mapping

12 General references

Meuse heavy metals exercise — CART and random forests

D G Rossiter

2022-06-21

1 Dataset

2 Regression trees

2.1 Building a regression tree

2.2 Variable importance

2.3 Pruning the regression tree

2.4 Goodness-of-fit

3 Classification tree

3.1 Building a classification tree

3.2 Pruning the classification tree

4 Mapping over the grid.

5 The variability of regression trees

6 Random Forests

6.1 Sensitivity

6.2 Variable importance

6.3 Examine variable importance and interaction with randomForestExplainer

6.4 Partial dependence plots

6.5 Goodness-of-fit and out-of-bag errors

6.6 Map over the grid with an RF model

7 Cubist

7.1 Unoptimized Cubist model

7.2 Optimizing cubist

7.3 Variable importance in Cubist.

8 K nearest neighbours

9 Spatial random forests

9.1 Method 1 – buffers to quantiles

9.1.1 Quantiles

9.1.2 Distances to each quantile

9.1.3 Spatial random forest

9.1.4 Mapping

9.2 Method 2 – buffers around each point

10 Random forests for categorical variables

10.1 Mapping classes

11 Spatial random forest, including covariates

11.1 Mapping

12 General references

6.3 Examine variable importance and interaction with `randomForestExplainer`