The spcosa R package is described in Walvoort, D. J. J., Brus, D. J., & de Gruijter, J. J. (2010). An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means. Computers & Geosciences, 36(10), 1261–1267. https://doi.org/10.1016/j.cageo.2010.04.005

The aim is to place the sample locations so that they cover the study area as uniformly as possible. In a rectangular area grid sampling will accomplish this, but most study areas are not rectangles. Further, this package accounts for two-stage sampling: we can use the existing points and add new ones in the optimal locations.

The sampling units are obtained by minimising a criterion that is defined in terms of the geographic distances between the nodes of a fine discretisation grid covering the polygon(s) within which to sample and the sampling units. The criterion is to minimise the mean of the squared distances of the grid nodes to their nearest sampling unit (mean squared shortest distance, MSSD):

\[ MSSD=\frac{1}{N}\sum_{k=1}^{N}\min_{j}\left(D_{kj}^{2}\right) \;, \] where \(N\) is the total number of nodes of the discretisation grid and \(D_{kj}\) is the distance between the \(k\)th grid node and the \(j\)th sampling point. This distance measure can be minimised by the k-means algorithm.

This is a numerical, iterative procedure, as follows:

Repeat until at step (3) points are not moved:

place the required number of points \(k\) at random;
compute the clusters represented by these points with the k-means algorithm;
move the points to the centroids of their respective clusters.

See also 17.2 “Spatial coverage sampling”¹ in Brus, D. J. (2022). Spatial sampling with R. https://dickbrus.github.io/SpatialSamplingwithR/ and the vignette for the spcosa package²

1 Setup

Load the required packages:

require(sf, quietly = TRUE)        # Simple Features data structures
require(units, quietly = TRUE)
require(sp, quietly = TRUE)        # spcosa uses `sp` data structures
require(rJava, quietly = TRUE)     # required by spcosa
require(spcosa, quietly = TRUE)    # the "spatial coverage sampling" package
require(ggplot2, quietly = TRUE)   # graphics

2 Import a study area polygon

Here is the boundary of New York State (USA) as a shapefile, extracted from the US Census ³:

Read it as an sf object with the st_read function.

ny.poly <- st_read("./ds_spcosa/NYState_500k/NYState_500k.shp")

## Reading layer `NYState_500k' from data source 
##   `/Users/rossiter/Documents/data/edu/Geostatistics_tutorials/Geostats_tutorial_SpatialSampling/ds_spcosa/NYState_500k/NYState_500k.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -79.76215 ymin: 40.4961 xmax: -71.85621 ymax: 45.01585
## Geodetic CRS:  NAD83

class(ny.poly)

## [1] "sf"         "data.frame"

st_crs(ny.poly)$proj4string

## [1] "+proj=longlat +datum=NAD83 +no_defs"

st_crs(ny.poly)$epsg

## [1] 4269

st_bbox(ny.poly)

##      xmin      ymin      xmax      ymax 
## -79.76215  40.49610 -71.85621  45.01585

names(ny.poly)

##  [1] "STATEFP"  "STATENS"  "AFFGEOID" "GEOID"    "STUSPS"   "NAME"    
##  [7] "LSAD"     "ALAND"    "AWATER"   "geometry"

The CRS is WGS84 geographic coordinates. Convert to an appropriate metric projection CRS, for example NAD83 / New York Central:⁴

ny.poly <- st_transform(ny.poly, 32116)
st_crs(ny.poly)$proj4string

## [1] "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"

st_crs(ny.poly)$epsg

## [1] 32116

st_bbox(ny.poly)

##      xmin      ymin      xmax      ymax 
## -13304.82  57707.28 647329.76 561695.79

st_area(ny.poly)

## 127053762207 [m^2]

The area of the State is 127054 \(km^2\).

Display the multipolygon:

plot(ny.poly["NAME"], main = "NY State", col="lightgray", 
     graticule = st_crs(4326), axes = TRUE)

Convert this polygon from sf to sp for use with spcosa:

ny.poly.sp <- as(ny.poly, "Spatial")
class(ny.poly.sp)

## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"

3 Spatial coverage sampling

Now we are ready to divide this polygon into strata. An important parameter is nTry: “the stratify method will try nTry initial configurations [for the k-means clustering] and will keep the best solution in order to reduce the risk of ending up with an unfavorable solution.”

There are 62 counties in NY, so let’s use that number to get the most geographically-compact “counties”:

ny.strata <- stratify(ny.poly.sp, nStrata = 62, nTry = 24)
plot(ny.strata) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata")

class(ny.strata)

## [1] "CompactStratification"
## attr(,"package")
## [1] "spcosa"

slot(ny.strata@centroids, "proj4string")

## Coordinate Reference System:
## Deprecated Proj.4 representation: NA

Note that the stratification has no CRS, even though the source polygon does. However the centroids of a CompactStratification do have a slot to store a CRS, so we make it compatible with the polygon:

slot(ny.strata@centroids, "proj4string") <- slot(ny.poly.sp, "proj4string")

Examine the structure of the stratification object:

str(ny.strata)

## Formal class 'CompactStratification' [package "spcosa"] with 4 slots
##   ..@ cells    :Formal class 'SpatialPixels' [package "sp"] with 5 slots
##   .. .. ..@ grid       :Formal class 'GridTopology' [package "sp"] with 3 slots
##   .. .. .. .. ..@ cellcentre.offset: Named num [1:2] -9128 63286
##   .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
##   .. .. .. .. ..@ cellsize         : Named num [1:2] 7129 7129
##   .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
##   .. .. .. .. ..@ cells.dim        : Named int [1:2] 92 70
##   .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
##   .. .. ..@ grid.index : int [1:2508] 6414 6322 6323 6324 6325 6326 6327 6328 6329 6232 ...
##   .. .. ..@ coords     : num [1:2508, 1:2] 454266 454266 461395 468524 475654 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : NULL
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. ..@ bbox       : num [1:2, 1:2] -12693 59722 643188 558762
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. .. .. ..$ : chr [1:2] "min" "max"
##   .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. .. ..@ projargs: chr "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
##   .. .. .. .. ..$ comment: chr "PROJCRS[\"NAD83 / New York Central\",\n    BASEGEOGCRS[\"NAD83\",\n        DATUM[\"North American Datum 1983\","| __truncated__
##   ..@ stratumId: int [1:2508] 32 32 32 32 32 32 32 32 41 32 ...
##   ..@ centroids:Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   .. .. ..@ coords     : num [1:62, 1:2] 137247 386452 321308 443018 423967 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : NULL
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. ..@ bbox       : num [1:2, 1:2] 13718 84674 601941 541533
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. .. .. ..$ : chr [1:2] "min" "max"
##   .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. .. ..@ projargs: chr "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
##   .. .. .. .. ..$ comment: chr "PROJCRS[\"NAD83 / New York Central\",\n    BASEGEOGCRS[\"NAD83\",\n        DATUM[\"North American Datum 1983\","| __truncated__
##   ..@ mssd     : num 3.56e+08

Notice the automatically-computed cellsize 7129.140388 and dimension of the discretization grid. The total number of cells 2508 is approximately 2500, the default nGridCells optional argument to the stratify function. See below for how to change the resolution of the discretization grid.

We may choose to have any number of sampling points per stratum.

First, one point per “county”, at its centroid.

ny.sample.1 <- spsample(ny.strata)
plot(ny.strata, ny.sample.1) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata, sample at centroids")

3.1 Controlling the discretization

The stratify function works with a fine discretization of the polygon. This defaults to 2500 grid cells; however this can be controlled with the nGridCells optional parameter. Or, the cell size can be set with the cellSize optional parameter.

Let’s see how the stratification changes with a finer grid.

system.time( {
  ny.strata.fine <- stratify(ny.poly.sp, nStrata = 62, nTry = 24, nGridCells = 10000)
  slot(ny.strata.fine@centroids, "proj4string") <- 
    slot(ny.poly.sp, "proj4string")
  ny.sample.1.fine <- spsample(ny.strata.fine)
})

##    user  system elapsed 
##  10.206   0.102   9.412

plot(ny.strata.fine, ny.sample.1.fine ) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata (fine discretization)")

We see more detailed boundaries, which are more likely to be oblique.

4 Stratified random sampling

We can use the stratification and select several random points per stratum. This would be applicable as a stratified random sampling plan to estimate population parameters for the State.

For example, five points per stratum (so, 310 points total):

ny.sample.5 <- spsample(ny.strata.fine, n = 5)
plot(ny.strata.fine, ny.sample.5) +
    scale_x_continuous(name = "Easting (km)") +
    scale_y_continuous(name = "Northing (km)") +
    labs(title = "62 compact strata, 5 random points per stratum")

5 Infill sampling

Here we locate a set of new weather stations as infill to a set of existing stations, from the Northeast Regional Climate Network. These have been prepared in the same CRS as the NY State boundary polygon.

load("./ds_spcosa/ny_stations_sf.RData", verbose = TRUE)

## Loading objects:
##   ny.pts

dim(ny.pts)

## [1] 133   2

class(ny.pts)

## [1] "sf"         "data.frame"

head(ny.pts)

## Simple feature collection with 6 features and 1 field
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 70778.15 ymin: 235473.1 xmax: 477866.6 ymax: 364386.5
## Projected CRS: NAD83 / New York Central
##                 STN_NAME                  geometry
## 3014             ADDISON POINT (196527.2 235624.5)
## 3015      ALBANY INTL AP POINT (477866.6 309157.7)
## 3016         ALBION 2 NE POINT (121196.6 364386.5)
## 3017          ALCOVE DAM   POINT (468199 277709.9)
## 3018              ALFRED   POINT (151281 252776.1)
## 3019 ALLEGANY STATE PARK POINT (70778.15 235473.1)

st_crs(ny.pts)$epsg

## [1] 32116

st_crs(ny.pts)$proj4string

## [1] "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"

st_bbox(ny.pts)

##       xmin       ymin       xmax       ymax 
##   2898.118  69879.047 610676.520 549063.844

ggplot() +
  geom_sf(data = ny.poly) +
  geom_sf(data = ny.pts) +
  labs(title = "NY State weather stations")

There are 133 points. Suppose we have a budget for 196 weather stations – where should we locate the 63 new stations?

First stratify, then identify the centroids. We also show how to specify a cell size with the cellSize optional parameter. Here we select a 5 x 5 km grid cell; this results in about 5082 grid cells covering the State, which is about twice the default nGridCells (2500).

(n.new <- 196-dim(ny.pts)[1])

## [1] 63

ny.strata.new <- stratify(ny.poly.sp, 
                       # priorPoints must be of class sp::SpatialPoints*
                       priorPoints = as(ny.pts, "Spatial"),
                       nStrata = 196, nTry = 24,
                       cellSize = c(5000, 5000))
ny.pts.new <- spsample(ny.strata.new)
str(ny.pts.new)

## Formal class 'SamplingPatternPriorPoints' [package "spcosa"] with 2 slots
##   ..@ isPriorPoint: logi [1:196] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   ..@ sample      :Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   .. .. ..@ coords     : num [1:196, 1:2] 196527 477867 121197 468199 151281 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : NULL
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. ..@ bbox       : num [1:2, 1:2] 2898 69879 610677 549064
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "x1" "x2"
##   .. .. .. .. ..$ : chr [1:2] "min" "max"
##   .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. .. .. ..@ projargs: chr NA

Plot with two different symbols:

plot(ny.strata.new, ny.pts.new) +
  scale_x_continuous(name = "Easting (km)") +
  scale_y_continuous(name = "Northing (km)") +
  labs(title = "196 compact strata, stations at centroid")

6 Exporting sample points

The selected locations can be exported for use in other spatial programs or GIS. The simplest format is as a two-column CSV file. We first convert to a data.frame and then export this.

We know the location of the existing points, so we only export the locations of the additional points. For ease of location in the field, we convert from the NY State metric system to WGS84 geographic coordinates.

# indices of the new points
which.infill <- which(ny.pts.new@isPriorPoint == FALSE)
# only select the rows which represent new points
ny.infill.points.df <- 
  as(ny.pts.new, "data.frame")[which.infill,]
# how many new points?
dim(ny.infill.points.df)

## [1] 67  2

# convert to sf to change CRS
ny.infill.pts.sf <- st_as_sf(ny.infill.points.df, coords = c("x1", "x2"))
head(ny.infill.pts.sf)

## Simple feature collection with 6 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 313503.3 ymin: 177977.4 xmax: 449999.1 ymax: 401309.2
## CRS:           NA
##                    geometry
## 1   POINT (394775 177977.4)
## 2 POINT (449999.1 320880.6)
## 3   POINT (406804.6 225464)
## 4 POINT (321959.4 401309.2)
## 5   POINT (313503.3 290496)
## 6 POINT (366387.9 220880.6)

st_crs(ny.infill.pts.sf) <- st_crs(ny.poly)
ny.infill.pts.sf <- st_transform(ny.infill.pts.sf, 4326)
head(ny.infill.pts.sf)

## Simple feature collection with 6 features and 0 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.80933 ymin: 41.58966 xmax: -74.13592 ymax: 43.60988
## Geodetic CRS:  WGS 84
##                     geometry
## 1 POINT (-74.84695 41.58966)
## 2  POINT (-74.13592 42.8632)
## 3  POINT (-74.6902 42.01473)
## 4 POINT (-75.69192 43.60988)
## 5 POINT (-75.80933 42.61322)
## 6  POINT (-75.1789 41.98048)

# convert back to data.frame for export
ny.infill.pts.df <- as.data.frame(st_coordinates(ny.infill.pts.sf))
# round to 3 decimal degrees, approx. 100 m; adequate for weather stations
names(ny.infill.pts.df) <- c("Long", "Lat")
ny.infill.pts.df <- round(ny.infill.pts.df, 3)
head(ny.infill.pts.df)

##      Long    Lat
## 1 -74.847 41.590
## 2 -74.136 42.863
## 3 -74.690 42.015
## 4 -75.692 43.610
## 5 -75.809 42.613
## 6 -75.179 41.980

write.csv(ny.infill.pts.df, file="./ds_spcosa/InfillSamplePts.csv")

There are 67 new points.

Spatial Coverage Sampling

D G Rossiter

13-April-2022