Tutorials¶
Here are the many tutorials I’ve written [1] dealing with various aspects of spatial statistics. Please note the dates of each one, the older ones may be partly obsolete or require some adjustment to new versions of the computer programs.
Many of the documents are provided as R Markdown sources. These are most easily processed in the R Studio environment.
R Project for Statistical Computing¶
Introduction to the the R Project for Statistical Computing and the S language
An R Markdown source that introduces the most important features of R.
Geostatistics¶
Spatial analysis in R is transitioning to the “Simple Features” representation of spatial objects,
as implemented in the sf package.
Many of the tutorials listed here were developed with the earlier “Spatial Objects in R” representation, as implemented in the sp
package.
Be alert also to changes in the GDAL and PROJ packages when specifying or transforming coördinate reference systems.
An introduction to (geo)statistics with R (sp version)
A brief introduction to exploratory and inferential geo- statistical analysis. At the same time, it introduces the R environment for statistical computing and visualisation] and several R packages, notably
sf
for spatial data structures andgstat
for conventional geostatistics. The exercise assumes no prior knowledge of either geostatistics nor the R environment. You can also review an earlier version using the (now outmoded)sp
package.Introduction to Rikken, M. G. J., & Van Rijn, R. P. G. (1993) : Soil pollution with heavy metals—An inquiry into spatial variation, cost of mapping and the risk evaluation of copper, cadmium, lead and zinc in the floodplains of the Meuse west of Stein, the Netherlands. Dept. of Physical Geography, Utrecht University. This is the orignal report from which the “Meuse dataset” was created.
Co-kriging with the gstat package of the R environment for statistical computing
Improving the mapping of an undersampled attribute that is co-regionalized with a more intensively sampled attribute.
-
Step-by-step explanation of a method to optimally place sample points in order to minimize the maximum or average kriging variance. This is especially useful in irregularly-shaped areas or areas with existing samples.
Distance education course Geostatistics & Open-source statistical computing
Supplementary exercises
-
All geographical measurements are made on some support, that is, an interval (1-D), area (2-D) or volume (3-D) of some finite size. As long as the measurements, interpretations, and predictions all refer to the same support, techniques that treat the support as a 0-D point are satisfactory. But if measurements and predictions are made on different supports, the relation between them must be determined and used to adjust the geostatistical analysis.
Exercise: Compositional variables
Certain (geo)statistical variables, when considered as a group, are not independent in feature space, because they are constrained to sum to some constant; the set of these is called a composition. They should not be modelled separately, rather, as a group.
Exercise: Spatio-temporal Geostatistics
Spatio-temporal observations are those for which both a spatial location (georeference) and a time of observation are recorded, as well as attributes measured at the specified location and time. This exercise introduces space-time geostatistics to analyze attributes in space and time, separately and simultaneously.
-
Interactive Excel worksheets explaining spatial autocorrelation, variograms and kriging
(XLS, compressed)Written by prof.dr.ir. Alfred Stein (University of Twente), formatted and with some more explanation by me. (1) Simulation of spatial correlation in one dimension, (2) Ordinary Kriging, (3) Universal Kriging, (4) Cokriging, (5) Selecting a grid spacing for kriging. Distributed by permission.
R Markdown tutorials¶
These illustrate some details of geostatistics. Load into R Studio and compile (“knit”) to HTML, or execute chunk-by-chunk
Constructing a prediction grid
Shows how to create a regular grid onto which kriging or another prediction method can be applied.
Kriging From scratch -- Spatial Classes version
A direct application of the ordinary kriging equations to derive kriging weights. This version uses the legacy
sp
“Classes and Methods for Spatial Data” representation of spatial objects.Kriging From scratch -- Simple Features version
A direct application of the ordinary kriging equations to derive kriging weights. This version uses the newer
sf
“Simple Features” representation of spatial objects.-
Shows how to map class probabilities over a grid using indicator kriging in
gstat
, and then make a map of the most probable, along with a map of prediction reliability, represented by the maximum probability of any class. It also has a section on using classification trees and random forests to classify the same dataset. -
Random numbers in R; simulation of binomial, normal, and mixed distributions; application to simulating random fields.
Spatial analysis¶
Spatial analysis in R is transitioning to the “Simple Features” representation of spatial objects,
as implemented in the sf package.
Many of the tutorials listed here were developed with the earlier “Spatial Objects in R” representation, as implemented in the sp
package.
Be alert also to changes in the GDAL and PROJ packages when specifying or transforming coördinate reference systems.
Trend surfaces in R by Ordinary and Generalized Least Squares
A trend surface is a map of some continuous variable, computed as a function of the coördinates. In many cases the assumption that the OLS residuals are spatially-independent is not true, so that GLS must be used to obtain a correct trend-surface formula. This one has been converted to use
sf
.Kansas aquifer dataset
(TXT)
Thin-plate spline interpolation with R
Shows how to fit a surface (as in the trend surface) but adjusting to local observations (as in kriging) using 2-D smoothing splines. This is useful if one wants to quickly obtain a clear map showing the main features of the variable, without the model assumptions of trend surfaces or kriging.
Areal Data and Spatial Autocorrelation R
This tutorial gives an overview of spatial analysis of areal data, that is, attributes of polygonal entities on a map. Typical examples are political divisions, census tracts, and ownership or management parcels. The attribute relates to the whole area of the polygon, and can not be further localized.
Central NY 8-county census tracts leukemia incidence
(shapefile, neighbours, zipped)
Exercise: Exploratory Data Analysis with GeoDa
GeoDa is an open-source program, cross-platform program designed as a simple tool for exploratory spatial data analysis (ESDA) and some spatial modelling of spatial polygon data, that is, maps of polygon units such as census tracts or political divisions with a set of attributes measured on each one. This uses a portion of the 8-county data of the R exercise (see item just above).
Syracuse census tracts leukemia incidence
(shapefile, neighbours, zipped)
-
This tutorial gives an overview of spatial point-pattern analysis. This considers the distribution of one or more sets of points in some bounded region as the result of some stochastic process which produces a finite number of “events” or “occurrences”.
Optimal partitioning of soil transects with R
Applies the split moving window approach to optimally partition linear series of observations to find “natural” boundaries.
Spatial sampling¶
-
The spcosa R package is described in Walvoort, D. J. J., Brus, D. J., & de Gruijter, J. J. (2010). An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means. Computers & Geosciences, 36(10), 1261–1267. https://doi.org/10.1016/j.cageo.2010.04.005
The aim is to place the sample locations so that they cover the study area as uniformly as possible. In a rectangular area grid sampling will accomplish this, but most study areas are not rectangles. Further, this package accounts for two-stage sampling: we can use the existing points and add new ones in the optimal locations.
datasets used in this tutorial
(compressed)
General statistical methods¶
-
A systematic analysis of a simple dataset: the Mercer & Hall wheat yield uniformity trial: exploratory graphics, descriptive statistics, univariate & bivariate modelling, bootstrapping, robust methods, multivariate modelling, principal components analysis, model evaluation, cross-validation, spatial analysis, spatial structure, generalized least squares, geographically-weighted regression, clustering, periodicity (spectral analysis).
An example of statistical data analysis using the R environment for statistical computing
This tutorial presents a data analysis sequence which may be applied to environmental datasets, using a small but typical data set of multivariate point observations: 147 soil profile observations representative of the humid forest region of southwestern Cameroon . It is aimed at students in geo-information application fields who have some experience with basic statistics, but not necessarily with statistical computing. Five aspects are emphasised:
Placing statistical analysis in the framework of research questions;
Moving from simple to complex methods: first exploration, then selection of promising modelling approaches;
Visualising as well as computing;
Making correct inferences;
Statistical computation and visualization.
point dataset
(CSV)R code
(compressed)
Analyzing land cover change with logistic regression in R
Land cover change at 1 064 grid cells from the Chapare region of Cochabamba province, Bolivia
point dataset
(CSV)R code
(compressed)
Curve fitting with the R Environment for Statistical Computing
R gives the user more insight and control than provided by “push the button” programs such as CurveFit.
R Markdown tutorials¶
-
Describes and illustrates measures of model success, i.e., its presumed predictive accuracy. This is sometimes called model “validation”. Defines and illustrates RMSE, variance explained against the 1:1 line, bias, gain, Lin’s concordance, Taylor diagrams, Gauch decomposition.
Explaining the concept of variance
Explains the concept of variance and how it is affected by classification.
-
Box and Cox (1964) developed a family of transformations designed to reduce non-normality of the errors in a linear model. Applying this transform often reduces non-linearity and heteroscedascity as well.
Explaining Ordinary Least Squares (OLS) regression with R
Shows how linear models are computed by ordinary least squares (OLS) and by a robust regression variant of OLS.
Demonstrating Generalized Least Squares regression
GLS accounts for autocorrelation in the linear model residuals. This example is of spatial autocorrelation, using the
Mercer & Hall wheat yield dataset
(CSV)Meuse heavy metals exercise — CART and random forests
Illustrates the use of regression trees, classification trees, Cubist, k-nearest neighbours, and spatially-explicit random forests
-
Shows how
rpart
decides on splits when building a regression tree. Variability of Regression Trees
Demonstrates the sensivity of regression trees built by
rpart
to small changes in the training dataset, using the Meuse heavy metals dataset.Finding the proper complexity parameter for a Regression Tree
Shows how cross-validation is used to assess the proper complexity parameter for a regression tree. This is the graph shown by
printcp
, based on cross-validation computations inrpart
.Comparing Random Forest packages
Compares the
randomForest
andranger
packages.Nonlinear Principal Components Analysis, a.k.a. Multivariate Analysis with Optimal Scaling
Practice with the “Gifi” approach of de Leeuw and colleagues. “The data analytic approach does not start with a model, but looks for transformations and combinations of [categorical] variables with the explicit purpose of representing the data in a simple and comprehensive, and usually graphical, way.”
example dataset
(RData)
Regional mapping¶
Regional mapping of climate variables from point samples (PDF)
Various methods for regional mapping of climate variables from station information using as predictors coördinates (Northing, Easting) and elevation, as well as the local neighbourhood: Ordinary Least Squares trend; Generalized Least Squares trend; Regression Kriging; Kriging with External Drift; Generalized Additive Models trend; Geographically-weighted regression; Data-driven methods: Regression trees, Random Forests, Cubist Thin-plate splines; Local interpolators: Ordinary kriging, inverse-distance, Thiessen polygons. Uses the
ggplot2
,nlme
,rgdal
,sp
,gstat
,rpart
,randomForest
,ranger
,Cubist
,caret
,raster
,plotKML
andfields
R packages.Datasets used in this tutorial:
weather station climate summaries
(shapefiles, zipped, 2.5 Mb)documentation of U.S. Climate Normals 1971-2000 products (PDF)
48-states SRTM DEM
(ESRI ASCII grid, zipped, 22.3 Mb)4-state bounding box 4km resolution DEM (NJ, NY, PA, VT)
(RData)USA state boundaries
(shapefile, zipped, 3.2 Mb)
Additional covariates
The above exercise uses only three regional predictors as well as the local neighbourhood. Additional covariates might have a relation to regional or local climate, and thus might improve the regional mapping. These are: distance to the Great Lakes shoreline, distance to the Atlantic Ocean coast, two terrain indices (Multi-resolution Valley-Bottom Flatness MRVBF and Terrain Ruggedness Index TRI), and population density within two radii around stations or prediction points (nominally 2.5’ and 15’).
Regional mapping of climate variables from point samples: Data preparation (PDF)
-
Lakes Erie & Ontario
(OGC Geopackage)Atlantic Ocean coastline, NE USA
(OGC Geopackage)multiresolution index of valley bottom flatness (MRVBF)
(SAGA raster)Terrain Ruggedness Index
(SAGA raster)World population density (15' resolution)
(GeoTiff, 1 Mb)World population density (2.5' resolution)
(GeoTiff, 23.6 Mb)
Using the additional covariates
(R Markdown source)Prepared dataset of additional covariates
(RData, 2.5 Mb)
-
Regional mapping of climate variables from point samples: Northeast China
This applies many of the methods of the above tutorial to a region in China.
In-class exercises Regional mapping of climate variables (R Markdown sources):
Setting up the dataset
Procedure to set up the dataset
(R Markdown source)Administrative boundaries of China
(Geopackage, 72.6 Mb)
Procedure to set up a prediction grid
(R Markdown source)World SRTM 1km resolution DEM
(ESRI grid, compressed, 195.2 Mb)
Additional covariates
The above exercise uses only three regional predictors as well as the local neighbourhood. Additional covariates might have a relation to regional or local climate, and thus might improve the regional mapping. These are: one terrain index (Multi-resolution Valley-Bottom Flatness MRVBF) and population density within two radii around stations or prediction points (nominally 2.5’ and 15’).
Setting up the additional covariates
(R Markdown source)Multiresolution index of valley bottom flatness (MRVBF)
(SAGA raster)World population density (15' resolution)
(GeoTiff, 1 Mb)World population density (2.5' resolution)
(GeoTiff, 23.6 Mb)
Using the additional covariates
(R Markdown source)
Prepared dataset of additional covariates
(RData, 6 Mb)
Geographic information systems (GIS)¶
-
Water pollution hazard for a drinking water reservoir in Tompkins County, NY (USA), using QGIS 3.4
Creating geometrically-correct photo-interpretations, photomosaics, and base maps for a project GIS
Construcción de modelos digitales de terreno para la evaluación de tierras
R Markdown tutorials¶
Using Google Earth Engine with R
The rgee package provides an interface from R to Google Earth Engine (GEE). This tutorial leads you through its installation and basic use.
For an example use of
rgee
see Using SoilGrids and Google Earth Engine with R to identify “homogeneous” soil zones
Time-series analysis¶
-
tutorial datasets
(compressed)
Spatio-temporal geostatistics, includes a section on temporal structure
Fitting rational functions to time series in R
Rational functions are ratios of any two polynomials in a single variable. In this example we fit linear/quadratic rational functions to an irregular time-series of proportional changes from an initial condition of a soil property in response to land use. The fitted equation can be interpreted to find the time to reach a maximum proportional deviation, and the value of that deviation.
tutorial dataset
(CSV)
Last modified 16-August-2024