Tutorials¶

Here are the many tutorials I’ve written [1] dealing with various aspects of spatial statistics. Please note the dates of each one, the older ones may be partly obsolete or require some adjustment to new versions of the computer programs.

Many of the documents are provided as R Markdown sources. These are most easily processed in the R Studio environment.

R Project for Statistical Computing¶

Introduction to the the R Project for Statistical Computing and the S language
An R Markdown source that introduces the most important features of R.
- compiled HTML
- answers to exercises (HTML)

Geostatistics¶

Spatial analysis in R is transitioning to the “Simple Features” representation of spatial objects, as implemented in the sf package. Many of the tutorials listed here were developed with the earlier “Spatial Objects in R” representation, as implemented in the sp package. Be alert also to changes in the GDAL and PROJ packages when specifying or transforming coördinate reference systems.

An introduction to (geo)statistics with R (sf version)
An introduction to (geo)statistics with R (sp version)
A brief introduction to exploratory and inferential geo- statistical analysis. At the same time, it introduces the R environment for statistical computing and visualisation] and several R packages, notably sf for spatial data structures and gstat for conventional geostatistics. The exercise assumes no prior knowledge of either geostatistics nor the R environment. You can also review an earlier version using the (now outmoded) sp package.
- R code (sf version)
- R code (sp version)
- Introduction to Rikken, M. G. J., & Van Rijn, R. P. G. (1993) : Soil pollution with heavy metals—An inquiry into spatial variation, cost of mapping and the risk evaluation of copper, cadmium, lead and zinc in the floodplains of the Meuse west of Stein, the Netherlands. Dept. of Physical Geography, Utrecht University. This is the orignal report from which the “Meuse dataset” was created.

Co-kriging with the gstat package of the R environment for statistical computing
Improving the mapping of an undersampled attribute that is co-regionalized with a more intensively sampled attribute.
- R code
- plotting functions

Spatial simulated annealing

Step-by-step explanation of a method to optimally place sample points in order to minimize the maximum or average kriging variance. This is especially useful in irregularly-shaped areas or areas with existing samples.
Distance education course Geostatistics & Open-source statistical computing
- Supplementary exercises
  - Exercise: Change of support
    
    All geographical measurements are made on some support, that is, an interval (1-D), area (2-D) or volume (3-D) of some finite size. As long as the measurements, interpretations, and predictions all refer to the same support, techniques that treat the support as a 0-D point are satisfactory. But if measurements and predictions are made on different supports, the relation between them must be determined and used to adjust the geostatistical analysis.
    
    R code (part 1)
    
    R code (part 2)
  - Exercise: Compositional variables
    
    Certain (geo)statistical variables, when considered as a group, are not independent in feature space, because they are constrained to sum to some constant; the set of these is called a composition. They should not be modelled separately, rather, as a group.
    
    R code
  - Exercise: Spatio-temporal Geostatistics
    
    Spatio-temporal observations are those for which both a spatial location (georeference) and a time of observation are recorded, as well as attributes measured at the specified location and time. This exercise introduces space-time geostatistics to analyze attributes in space and time, separately and simultaneously.
    
    R code
Interactive Excel worksheets explaining spatial autocorrelation, variograms and kriging (XLS, compressed)

Written by prof.dr.ir. Alfred Stein (University of Twente), formatted and with some more explanation by me. (1) Simulation of spatial correlation in one dimension, (2) Ordinary Kriging, (3) Universal Kriging, (4) Cokriging, (5) Selecting a grid spacing for kriging. Distributed by permission.

R Markdown tutorials¶

These illustrate some details of geostatistics. Load into R Studio and compile (“knit”) to HTML, or execute chunk-by-chunk

Constructing a prediction grid

Shows how to create a regular grid onto which kriging or another prediction method can be applied.
Kriging From scratch -- Spatial Classes version

A direct application of the ordinary kriging equations to derive kriging weights. This version uses the legacy sp “Classes and Methods for Spatial Data” representation of spatial objects.
Kriging From scratch -- Simple Features version

A direct application of the ordinary kriging equations to derive kriging weights. This version uses the newer sf “Simple Features” representation of spatial objects.
Detecting and modelling anisotropy for Ordinary Kriging
Mapping classes

Shows how to map class probabilities over a grid using indicator kriging in gstat, and then make a map of the most probable, along with a map of prediction reliability, represented by the maximum probability of any class. It also has a section on using classification trees and random forests to classify the same dataset.
(Geo)statistical simulation

Random numbers in R; simulation of binomial, normal, and mixed distributions; application to simulating random fields.

Spatial analysis¶

Spatial analysis in R is transitioning to the “Simple Features” representation of spatial objects, as implemented in the sf package. Many of the tutorials listed here were developed with the earlier “Spatial Objects in R” representation, as implemented in the sp package. Be alert also to changes in the GDAL and PROJ packages when specifying or transforming coördinate reference systems.

Trend surfaces in R by Ordinary and Generalized Least Squares
A trend surface is a map of some continuous variable, computed as a function of the coördinates. In many cases the assumption that the OLS residuals are spatially-independent is not true, so that GLS must be used to obtain a correct trend-surface formula. This one has been converted to use sf.
- Kansas aquifer dataset (TXT)
- R code
Thin-plate spline interpolation with R
Shows how to fit a surface (as in the trend surface) but adjusting to local observations (as in kriging) using 2-D smoothing splines. This is useful if one wants to quickly obtain a clear map showing the main features of the variable, without the model assumptions of trend surfaces or kriging.
- Sanford transect dataset (TXT)
- R code
Areal Data and Spatial Autocorrelation R
This tutorial gives an overview of spatial analysis of areal data, that is, attributes of polygonal entities on a map. Typical examples are political divisions, census tracts, and ownership or management parcels. The attribute relates to the whole area of the polygon, and can not be further localized.
- Central NY 8-county census tracts leukemia incidence (shapefile, neighbours, zipped)
- R code
Exercise: Exploratory Data Analysis with GeoDa
GeoDa is an open-source program, cross-platform program designed as a simple tool for exploratory spatial data analysis (ESDA) and some spatial modelling of spatial polygon data, that is, maps of polygon units such as census tracts or political divisions with a set of attributes measured on each one. This uses a portion of the 8-county data of the R exercise (see item just above).
- Syracuse census tracts leukemia incidence (shapefile, neighbours, zipped)
Point-pattern analysis with R
This tutorial gives an overview of spatial point-pattern analysis. This considers the distribution of one or more sets of points in some bounded region as the result of some stochastic process which produces a finite number of “events” or “occurrences”.
- R code
Optimal partitioning of soil transects with R
Applies the split moving window approach to optimally partition linear series of observations to find “natural” boundaries.
- R functions to implement the split moving window algorithm (R source code)
- Example transect (Jururena, Matto Grosso) (CSV)

Spatial sampling¶

Spatial Coverage Sampling
The spcosa R package is described in Walvoort, D. J. J., Brus, D. J., & de Gruijter, J. J. (2010). An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means. Computers & Geosciences, 36(10), 1261–1267. https://doi.org/10.1016/j.cageo.2010.04.005

The aim is to place the sample locations so that they cover the study area as uniformly as possible. In a rectangular area grid sampling will accomplish this, but most study areas are not rectangles. Further, this package accounts for two-stage sampling: we can use the existing points and add new ones in the optimal locations.
- datasets used in this tutorial (compressed)
- compiled HTML

General statistical methods¶

Using the R Environment for Statistical Computing: An example with the Mercer & Hall wheat yield dataset
A systematic analysis of a simple dataset: the Mercer & Hall wheat yield uniformity trial: exploratory graphics, descriptive statistics, univariate & bivariate modelling, bootstrapping, robust methods, multivariate modelling, principal components analysis, model evaluation, cross-validation, spatial analysis, spatial structure, generalized least squares, geographically-weighted regression, clustering, periodicity (spectral analysis).
- Mercer & Hall wheat yield dataset (CSV)
- R code
An example of statistical data analysis using the R environment for statistical computing
This tutorial presents a data analysis sequence which may be applied to environmental datasets, using a small but typical data set of multivariate point observations: 147 soil profile observations representative of the humid forest region of southwestern Cameroon . It is aimed at students in geo-information application fields who have some experience with basic statistics, but not necessarily with statistical computing. Five aspects are emphasised:
Placing statistical analysis in the framework of research questions;

Moving from simple to complex methods: first exploration, then selection of promising modelling approaches;

Visualising as well as computing;

Making correct inferences;

Statistical computation and visualization.
- point dataset (CSV)
- R code (compressed)
Analyzing land cover change with logistic regression in R
Land cover change at 1 064 grid cells from the Chapare region of Cochabamba province, Bolivia
- point dataset (CSV)
- R code (compressed)
Curve fitting with the R Environment for Statistical Computing
R gives the user more insight and control than provided by “push the button” programs such as CurveFit.
- R code

R Markdown tutorials¶

Model evaluation
Describes and illustrates measures of model success, i.e., its presumed predictive accuracy. This is sometimes called model “validation”. Defines and illustrates RMSE, variance explained against the 1:1 line, bias, gain, Lin’s concordance, Taylor diagrams, Gauch decomposition.
- compiled HTML
Explaining the concept of variance
Explains the concept of variance and how it is affected by classification.
- compiled HTML
Box-Cox transformation
Box and Cox (1964) developed a family of transformations designed to reduce non-normality of the errors in a linear model. Applying this transform often reduces non-linearity and heteroscedascity as well.
- compiled HTML
Explaining Ordinary Least Squares (OLS) regression with R
Shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.
- compiled HTML
Demonstrating Generalized Least Squares regression
GLS accounts for autocorrelation in the linear model residuals. This example is of spatial autocorrelation, using the Mercer & Hall wheat yield dataset (CSV)
- compiled HTML
Meuse heavy metals exercise — CART and random forests
Illustrates the use of regression trees, classification trees, Cubist, k-nearest neighbours, and spatially-explicit random forests
- compiled HTML
Regression Trees by hand
Shows how rpart decides on splits when building a regression tree.
- compiled HTML
Variability of Regression Trees
Demonstrates the sensivity of regression trees built by rpart to small changes in the training dataset, using the Meuse heavy metals dataset.
- compiled HTML
Finding the proper complexity parameter for a Regression Tree
Shows how cross-validation is used to assess the proper complexity parameter for a regression tree. This is the graph shown by printcp, based on cross-validation computations in rpart.
- compiled HTML
Comparing Random Forest packages
Compares the randomForest, ranger, and splmRF packages, also presents interpretable machine learning including Shapley and SHAP values.
- compiled HTML
- compiled PDF
Nonlinear Principal Components Analysis, a.k.a. Multivariate Analysis with Optimal Scaling
Practice with the “Gifi” approach of de Leeuw and colleagues. “The data analytic approach does not start with a model, but looks for transformations and combinations of [categorical] variables with the explicit purpose of representing the data in a simple and comprehensive, and usually graphical, way.”
- example dataset (RData)
- compiled HTML

Regional mapping¶

Regional mapping of climate variables from point samples (PDF)
Various methods for regional mapping of climate variables from station information using as predictors coördinates (Northing, Easting) and elevation, as well as the local neighbourhood: Ordinary Least Squares trend; Generalized Least Squares trend; Regression Kriging; Kriging with External Drift; Generalized Additive Models trend; Geographically-weighted regression; Data-driven methods: Regression trees, Random Forests, Cubist Thin-plate splines; Local interpolators: Ordinary kriging, inverse-distance, Thiessen polygons. Uses the ggplot2, nlme, rgdal, sp, gstat, rpart, randomForest, ranger, Cubist, caret, raster, plotKML and fields R packages.
R code

Datasets used in this tutorial:

weather station climate summaries (shapefiles, zipped, 2.5 Mb)

documentation of U.S. Climate Normals 1971-2000 products (PDF)

48-states SRTM DEM (ESRI ASCII grid, zipped, 22.3 Mb)

4-state bounding box 4km resolution DEM (NJ, NY, PA, VT) (RData)

4-state 4km resolution DEM (NJ, NY, PA, VT) (RData)

USA state boundaries (shapefile, zipped, 3.2 Mb)

Additional covariates

The above exercise uses only three regional predictors as well as the local neighbourhood. Additional covariates might have a relation to regional or local climate, and thus might improve the regional mapping. These are: distance to the Great Lakes shoreline, distance to the Atlantic Ocean coast, two terrain indices (Multi-resolution Valley-Bottom Flatness MRVBF and Terrain Ruggedness Index TRI), and population density within two radii around stations or prediction points (nominally 2.5’ and 15’).
Regional mapping of climate variables from point samples: Data preparation (PDF)
- R code
  Lakes Erie & Ontario (OGC Geopackage)
  
  Atlantic Ocean coastline, NE USA (OGC Geopackage)
  
  multiresolution index of valley bottom flatness (MRVBF) (SAGA raster)
  
  Terrain Ruggedness Index (SAGA raster)
  
  World population density (15' resolution) (GeoTiff, 1 Mb)
  
  World population density (2.5' resolution) (GeoTiff, 23.6 Mb)
  - Using the additional covariates (R Markdown source)
    
    Prepared dataset of additional covariates (RData, 2.5 Mb)

Regional mapping of climate variables from point samples: Northeast China
This applies many of the methods of the above tutorial to a region in China.
- In-class exercises Regional mapping of climate variables (R Markdown sources):
- Setting up the dataset
  - procedure to set up the dataset (PDF)
  - Procedure to set up the dataset (R Markdown source)
    
    Temperature summaries 1981-2010 (CSV)
    
    Precipitation summaries 1981-2010 (CSV)
    
    Administrative boundaries of China (Geopackage, 72.6 Mb)
  - Procedure to set up a prediction grid (R Markdown source)
    
    World SRTM 1km resolution DEM (ESRI grid, compressed, 195.2 Mb)
- Additional covariates
  
  The above exercise uses only three regional predictors as well as the local neighbourhood. Additional covariates might have a relation to regional or local climate, and thus might improve the regional mapping. These are: one terrain index (Multi-resolution Valley-Bottom Flatness MRVBF) and population density within two radii around stations or prediction points (nominally 2.5’ and 15’).
  - Setting up the additional covariates (R Markdown source)
    
    Multiresolution index of valley bottom flatness (MRVBF) (SAGA raster)
    
    World population density (15' resolution) (GeoTiff, 1 Mb)
    
    World population density (2.5' resolution) (GeoTiff, 23.6 Mb)
  - Using the additional covariates (R Markdown source)
  Prepared dataset of additional covariates (RData, 6 Mb)

Geographic information systems (GIS)¶

A “simple” analysis with QGIS

Water pollution hazard for a drinking water reservoir in Tompkins County, NY (USA), using QGIS 3.4
Creating geometrically-correct photo-interpretations, photomosaics, and base maps for a project GIS
Construcción de modelos digitales de terreno para la evaluación de tierras

R Markdown tutorials¶

Using Google Earth Engine with R

The rgee package provides an interface from R to Google Earth Engine (GEE). This tutorial leads you through its installation and basic use.
- compiled PDF
- compiled HTML
For an example use of rgee see Using SoilGrids and Google Earth Engine with R to identify “homogeneous” soil zones

Time-series analysis¶

Time-series analysis with R
- tutorial datasets (compressed)
- R code
Spatio-temporal geostatistics, includes a section on temporal structure
Fitting rational functions to time series in R
Rational functions are ratios of any two polynomials in a single variable. In this example we fit linear/quadratic rational functions to an irregular time-series of proportional changes from an initial condition of a soil property in response to land use. The fitted equation can be interpreted to find the time to reach a maximum proportional deviation, and the value of that deviation.
- tutorial dataset (CSV)

Last modified 10-May-2025