The spcosa
R package is described in Walvoort, D. J. J., Brus, D. J., & de Gruijter, J. J. (2010). An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means. Computers & Geosciences, 36(10), 1261–1267. https://doi.org/10.1016/j.cageo.2010.04.005
The aim is to place the sample locations so that they cover the study area as uniformly as possible. In a rectangular area grid sampling will accomplish this, but most study areas are not rectangles. Further, this package accounts for two-stage sampling: we can use the existing points and add new ones in the optimal locations.
The sampling units are obtained by minimising a criterion that is defined in terms of the geographic distances between the nodes of a fine discretisation grid covering the polygon(s) within which to sample and the sampling units. The criterion is to minimise the mean of the squared distances of the grid nodes to their nearest sampling unit (mean squared shortest distance, MSSD):
\[ MSSD=\frac{1}{N}\sum_{k=1}^{N}\min_{j}\left(D_{kj}^{2}\right) \;, \] where \(N\) is the total number of nodes of the discretisation grid and \(D_{kj}\) is the distance between the \(k\)th grid node and the \(j\)th sampling point. This distance measure can be minimised by the k-means algorithm.
This is a numerical, iterative procedure, as follows:
Repeat until at step (3) points are not moved:
See also 17.2 “Spatial coverage sampling”1 in Brus, D. J. (2022). Spatial sampling with R. https://dickbrus.github.io/SpatialSamplingwithR/ and the vignette for the spcosa
package2
Load the required packages:
require(sf, quietly = TRUE) # Simple Features data structures
require(units, quietly = TRUE)
require(sp, quietly = TRUE) # spcosa uses `sp` data structures
require(rJava, quietly = TRUE) # required by spcosa
require(spcosa, quietly = TRUE) # the "spatial coverage sampling" package
require(ggplot2, quietly = TRUE) # graphics
Here is the boundary of New York State (USA) as a shapefile, extracted from the US Census 3:
Read it as an sf
object with the st_read
function.
ny.poly <- st_read("./ds_spcosa/NYState_500k/NYState_500k.shp")
## Reading layer `NYState_500k' from data source
## `/Users/rossiter/Documents/data/edu/Geostatistics_tutorials/Geostats_tutorial_SpatialSampling/ds_spcosa/NYState_500k/NYState_500k.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -79.76215 ymin: 40.4961 xmax: -71.85621 ymax: 45.01585
## Geodetic CRS: NAD83
class(ny.poly)
## [1] "sf" "data.frame"
st_crs(ny.poly)$proj4string
## [1] "+proj=longlat +datum=NAD83 +no_defs"
st_crs(ny.poly)$epsg
## [1] 4269
st_bbox(ny.poly)
## xmin ymin xmax ymax
## -79.76215 40.49610 -71.85621 45.01585
names(ny.poly)
## [1] "STATEFP" "STATENS" "AFFGEOID" "GEOID" "STUSPS" "NAME"
## [7] "LSAD" "ALAND" "AWATER" "geometry"
The CRS is WGS84 geographic coordinates. Convert to an appropriate metric projection CRS, for example NAD83 / New York Central:4
ny.poly <- st_transform(ny.poly, 32116)
st_crs(ny.poly)$proj4string
## [1] "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
st_crs(ny.poly)$epsg
## [1] 32116
st_bbox(ny.poly)
## xmin ymin xmax ymax
## -13304.82 57707.28 647329.76 561695.79
st_area(ny.poly)
## 127053762207 [m^2]
The area of the State is 127054 \(km^2\).
Display the multipolygon:
plot(ny.poly["NAME"], main = "NY State", col="lightgray",
graticule = st_crs(4326), axes = TRUE)
Convert this polygon from sf
to sp
for use with spcosa
:
ny.poly.sp <- as(ny.poly, "Spatial")
class(ny.poly.sp)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Now we are ready to divide this polygon into strata. An important parameter is nTry
: “the stratify method will try nTry
initial configurations [for the k-means clustering] and will keep the best solution in order to reduce the risk of ending up with an unfavorable solution.”
There are 62 counties in NY, so let’s use that number to get the most geographically-compact “counties”:
ny.strata <- stratify(ny.poly.sp, nStrata = 62, nTry = 24)
plot(ny.strata) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
labs(title = "62 compact strata")
class(ny.strata)
## [1] "CompactStratification"
## attr(,"package")
## [1] "spcosa"
slot(ny.strata@centroids, "proj4string")
## Coordinate Reference System:
## Deprecated Proj.4 representation: NA
Note that the stratification has no CRS, even though the source polygon does. However the centroids of a CompactStratification
do have a slot to store a CRS, so we make it compatible with the polygon:
slot(ny.strata@centroids, "proj4string") <- slot(ny.poly.sp, "proj4string")
Examine the structure of the stratification object:
str(ny.strata)
## Formal class 'CompactStratification' [package "spcosa"] with 4 slots
## ..@ cells :Formal class 'SpatialPixels' [package "sp"] with 5 slots
## .. .. ..@ grid :Formal class 'GridTopology' [package "sp"] with 3 slots
## .. .. .. .. ..@ cellcentre.offset: Named num [1:2] -9128 63286
## .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
## .. .. .. .. ..@ cellsize : Named num [1:2] 7129 7129
## .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
## .. .. .. .. ..@ cells.dim : Named int [1:2] 92 70
## .. .. .. .. .. ..- attr(*, "names")= chr [1:2] "x1" "x2"
## .. .. ..@ grid.index : int [1:2508] 6414 6322 6323 6324 6325 6326 6327 6328 6329 6232 ...
## .. .. ..@ coords : num [1:2508, 1:2] 454266 454266 461395 468524 475654 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : NULL
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. ..@ bbox : num [1:2, 1:2] -12693 59722 643188 558762
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. .. .. ..$ : chr [1:2] "min" "max"
## .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. .. .. ..@ projargs: chr "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
## .. .. .. .. ..$ comment: chr "PROJCRS[\"NAD83 / New York Central\",\n BASEGEOGCRS[\"NAD83\",\n DATUM[\"North American Datum 1983\","| __truncated__
## ..@ stratumId: int [1:2508] 32 32 32 32 32 32 32 32 41 32 ...
## ..@ centroids:Formal class 'SpatialPoints' [package "sp"] with 3 slots
## .. .. ..@ coords : num [1:62, 1:2] 137247 386452 321308 443018 423967 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : NULL
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. ..@ bbox : num [1:2, 1:2] 13718 84674 601941 541533
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. .. .. ..$ : chr [1:2] "min" "max"
## .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. .. .. ..@ projargs: chr "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
## .. .. .. .. ..$ comment: chr "PROJCRS[\"NAD83 / New York Central\",\n BASEGEOGCRS[\"NAD83\",\n DATUM[\"North American Datum 1983\","| __truncated__
## ..@ mssd : num 3.56e+08
Notice the automatically-computed cellsize 7129.140388 and dimension of the discretization grid. The total number of cells 2508 is approximately 2500, the default nGridCells
optional argument to the stratify
function. See below for how to change the resolution of the discretization grid.
We may choose to have any number of sampling points per stratum.
First, one point per “county”, at its centroid.
ny.sample.1 <- spsample(ny.strata)
plot(ny.strata, ny.sample.1) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
labs(title = "62 compact strata, sample at centroids")
The stratify
function works with a fine discretization of the polygon. This defaults to 2500 grid cells; however this can be controlled with the nGridCells
optional parameter. Or, the cell size can be set with the cellSize
optional parameter.
Let’s see how the stratification changes with a finer grid.
system.time( {
ny.strata.fine <- stratify(ny.poly.sp, nStrata = 62, nTry = 24, nGridCells = 10000)
slot(ny.strata.fine@centroids, "proj4string") <-
slot(ny.poly.sp, "proj4string")
ny.sample.1.fine <- spsample(ny.strata.fine)
})
## user system elapsed
## 10.206 0.102 9.412
plot(ny.strata.fine, ny.sample.1.fine ) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
labs(title = "62 compact strata (fine discretization)")
We see more detailed boundaries, which are more likely to be oblique.
We can use the stratification and select several random points per stratum. This would be applicable as a stratified random sampling plan to estimate population parameters for the State.
For example, five points per stratum (so, 310 points total):
ny.sample.5 <- spsample(ny.strata.fine, n = 5)
plot(ny.strata.fine, ny.sample.5) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
labs(title = "62 compact strata, 5 random points per stratum")
Here we locate a set of new weather stations as infill to a set of existing stations, from the Northeast Regional Climate Network. These have been prepared in the same CRS as the NY State boundary polygon.
load("./ds_spcosa/ny_stations_sf.RData", verbose = TRUE)
## Loading objects:
## ny.pts
dim(ny.pts)
## [1] 133 2
class(ny.pts)
## [1] "sf" "data.frame"
head(ny.pts)
## Simple feature collection with 6 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 70778.15 ymin: 235473.1 xmax: 477866.6 ymax: 364386.5
## Projected CRS: NAD83 / New York Central
## STN_NAME geometry
## 3014 ADDISON POINT (196527.2 235624.5)
## 3015 ALBANY INTL AP POINT (477866.6 309157.7)
## 3016 ALBION 2 NE POINT (121196.6 364386.5)
## 3017 ALCOVE DAM POINT (468199 277709.9)
## 3018 ALFRED POINT (151281 252776.1)
## 3019 ALLEGANY STATE PARK POINT (70778.15 235473.1)
st_crs(ny.pts)$epsg
## [1] 32116
st_crs(ny.pts)$proj4string
## [1] "+proj=tmerc +lat_0=40 +lon_0=-76.5833333333333 +k=0.9999375 +x_0=250000 +y_0=0 +datum=NAD83 +units=m +no_defs"
st_bbox(ny.pts)
## xmin ymin xmax ymax
## 2898.118 69879.047 610676.520 549063.844
ggplot() +
geom_sf(data = ny.poly) +
geom_sf(data = ny.pts) +
labs(title = "NY State weather stations")
There are 133 points. Suppose we have a budget for 196 weather stations – where should we locate the 63 new stations?
First stratify, then identify the centroids. We also show how to specify a cell size with the cellSize
optional parameter. Here we select a 5 x 5 km grid cell; this results in about 5082 grid cells covering the State, which is about twice the default nGridCells
(2500).
(n.new <- 196-dim(ny.pts)[1])
## [1] 63
ny.strata.new <- stratify(ny.poly.sp,
# priorPoints must be of class sp::SpatialPoints*
priorPoints = as(ny.pts, "Spatial"),
nStrata = 196, nTry = 24,
cellSize = c(5000, 5000))
ny.pts.new <- spsample(ny.strata.new)
str(ny.pts.new)
## Formal class 'SamplingPatternPriorPoints' [package "spcosa"] with 2 slots
## ..@ isPriorPoint: logi [1:196] TRUE TRUE TRUE TRUE TRUE TRUE ...
## ..@ sample :Formal class 'SpatialPoints' [package "sp"] with 3 slots
## .. .. ..@ coords : num [1:196, 1:2] 196527 477867 121197 468199 151281 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : NULL
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. ..@ bbox : num [1:2, 1:2] 2898 69879 610677 549064
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "x1" "x2"
## .. .. .. .. ..$ : chr [1:2] "min" "max"
## .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. .. .. ..@ projargs: chr NA
Plot with two different symbols:
plot(ny.strata.new, ny.pts.new) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)") +
labs(title = "196 compact strata, stations at centroid")
The selected locations can be exported for use in other spatial programs or GIS. The simplest format is as a two-column CSV file. We first convert to a data.frame
and then export this.
We know the location of the existing points, so we only export the locations of the additional points. For ease of location in the field, we convert from the NY State metric system to WGS84 geographic coordinates.
# indices of the new points
which.infill <- which(ny.pts.new@isPriorPoint == FALSE)
# only select the rows which represent new points
ny.infill.points.df <-
as(ny.pts.new, "data.frame")[which.infill,]
# how many new points?
dim(ny.infill.points.df)
## [1] 67 2
# convert to sf to change CRS
ny.infill.pts.sf <- st_as_sf(ny.infill.points.df, coords = c("x1", "x2"))
head(ny.infill.pts.sf)
## Simple feature collection with 6 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 313503.3 ymin: 177977.4 xmax: 449999.1 ymax: 401309.2
## CRS: NA
## geometry
## 1 POINT (394775 177977.4)
## 2 POINT (449999.1 320880.6)
## 3 POINT (406804.6 225464)
## 4 POINT (321959.4 401309.2)
## 5 POINT (313503.3 290496)
## 6 POINT (366387.9 220880.6)
st_crs(ny.infill.pts.sf) <- st_crs(ny.poly)
ny.infill.pts.sf <- st_transform(ny.infill.pts.sf, 4326)
head(ny.infill.pts.sf)
## Simple feature collection with 6 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -75.80933 ymin: 41.58966 xmax: -74.13592 ymax: 43.60988
## Geodetic CRS: WGS 84
## geometry
## 1 POINT (-74.84695 41.58966)
## 2 POINT (-74.13592 42.8632)
## 3 POINT (-74.6902 42.01473)
## 4 POINT (-75.69192 43.60988)
## 5 POINT (-75.80933 42.61322)
## 6 POINT (-75.1789 41.98048)
# convert back to data.frame for export
ny.infill.pts.df <- as.data.frame(st_coordinates(ny.infill.pts.sf))
# round to 3 decimal degrees, approx. 100 m; adequate for weather stations
names(ny.infill.pts.df) <- c("Long", "Lat")
ny.infill.pts.df <- round(ny.infill.pts.df, 3)
head(ny.infill.pts.df)
## Long Lat
## 1 -74.847 41.590
## 2 -74.136 42.863
## 3 -74.690 42.015
## 4 -75.692 43.610
## 5 -75.809 42.613
## 6 -75.179 41.980
write.csv(ny.infill.pts.df, file="./ds_spcosa/InfillSamplePts.csv")
There are 67 new points.