This tutorial shows the main features of the S language as implemented in the R Environment for Statistical Computing (https://www.r-project.org/).
R is fully described in: Venables, W N, D M Smith, and R Development Core Team. 2019. An Introduction to R; Notes on R: A Programming Environment for Data Analysis and Graphics. Version 3.6.2 (2019-12-12). Vienna: R Foundation for Statistical Computing. https://cran.r-project.org/manuals.html.
A simple introduction to R for statistical analysis is: Dalgaard, Peter. 2008. Introductory Statistics with R. Second. Springer. http://link.springer.com/book/10.1007%2F978-0-387-79054-1.
You should work through the code examples. At the end of each section is an exercise. Please create a new R Markdown document and complete these exercises; you may want to use some of the code from this document.
S code can be executed at the R console command line, or from a R script, or from an R Markdown code chunk, as a calculator.
The simplest use of S is to evaluate mathematical expressions.
2*pi/360
## [1] 0.01745329
The ratio of a circle’s radius squared to its area \(\pi\) is a built-in constant pi
.
S has the usual scalar arithmetic operators +, -, *, /, ^ and some less-common ones like %% (modulus) and %/% (integer division). Expressions are evaluated in accordance with the usual operator precedence; parentheses may be used to change the precedence or make it explicit.
3 / 2^2 + 2 * pi
## [1] 7.033185
((3 / 2)^2 + 2) * pi
## [1] 13.35177
Exercise 1: Write an expression to compute the number of seconds in a 365-day year, and execute the expression.
Results of S expressions can be saved as objects in the R workspace. There are two (equivalent) assignment operators, the left-hand side is the name of the workspace object, which is assigned the value of the expression on the right-hand side.
Object names are case-sensitive and may contain special characters, but no spaces.
rad.deg <- 2*pi/360
Rad2Deg = 2*pi/360
By default nothing is printed; but all of these give the same output:
(rad.deg <- 2*pi/360)
## [1] 0.01745329
rad.deg
## [1] 0.01745329
print(rad.deg)
## [1] 0.01745329
Workspace objects are listed with the ls
“list” function and deleted with the rm
“remove” function:
ls()
## [1] "rad.deg" "Rad2Deg"
rm(rad.deg)
ls()
## [1] "Rad2Deg"
The character(0)
result means that the list of objects in the workspace is a zero-length character vector, i.e., null.
Exercise 2: Define a workspace object which contains the number of seconds in 365-day year, and display the results.
Most work in S is done with functions or methods. These are parts of expressions containing:
Functions return the results. These can be directly printed, assigned to workspace objects, or directly used as arguments to other functions or as parts of expressions.
For example, the c
“catenate”, “make a chain” function builds a vector; these can be saved as workspace objects. The arguments of this function are the items to be joined into one vector:
(primes.lt.20 <- c(2, 3, 5, 7, 11, 13, 17))
## [1] 2 3 5 7 11 13 17
(colours <- c("indigo", "violet", "red", "orange", "yellow", "green", "blue"))
## [1] "indigo" "violet" "red" "orange" "yellow" "green" "blue"
Base R defined many common mathematical functions:
sum(primes.lt.20); min(primes.lt.20); max(primes.lt.20); median(primes.lt.20)
## [1] 58
## [1] 2
## [1] 17
## [1] 7
sin(pi/4); cos(pi/4); tan(pi/4); atan(1)
## [1] 0.7071068
## [1] 0.7071068
## [1] 1
## [1] 0.7853982
log(128); log2(128); exp(4.85203)
## [1] 4.85203
## [1] 7
## [1] 128
Note that multiple expressions can be written on one line, separated by ;
.
Also many vector manipulation functions:
rev(colours) # reverse the order of elements in a vector
## [1] "blue" "green" "yellow" "orange" "red" "violet" "indigo"
sort(colours) # sort in ascending order
## [1] "blue" "green" "indigo" "orange" "red" "violet" "yellow"
sort(colours, decreasing=TRUE) # sort in decending order
## [1] "yellow" "violet" "red" "orange" "indigo" "green" "blue"
Exercise 3: Find the function name for base-10 logarithms, and compute the base-10 logarithm of 10, 100, and 1000 (use the ??
function at the console to search).
The results of one function can be used as part of an expression:
sqrt(2)
## [1] 1.414214
sqrt(2)/2
## [1] 0.7071068
The results of one function can be used as an argument to another:
asin(sqrt(2)/2)
## [1] 0.7853982
rev(sort(colours)) # equivalent to `sort(colours, decreasing=TRUE)`
## [1] "yellow" "violet" "red" "orange" "indigo" "green" "blue"
An example of a function with required and optional arguments is rnorm
“random sample from a normal distribution”. The only required argument is the number of items to be returned.
rnorm(12)
## [1] -1.53450983 0.42906541 1.27942366 -1.54190547 -0.02263238 2.17587828
## [7] -1.14696705 0.90661266 1.07239549 1.24777960 -0.11737334 1.63321551
Optional arguments are the parameters of this distribution, i.e., its mean and standard deviation:
rnorm(n=12, mean=180)
## [1] 180.0752 179.9373 180.6208 179.9303 181.4353 180.0183 181.0947 180.9020
## [9] 181.3579 178.6629 181.7138 178.5944
rnorm(n=12, mean=180, sd=10)
## [1] 176.7439 182.9745 168.1655 176.9599 179.6355 166.3144 186.4413 179.3060
## [9] 162.9089 192.9902 188.9649 166.6695
How do we know all this? From the help system.
help(rnorm)
(You can also access the help for this function by enterung ?rnorm
at the console).
This shows:
Exercise 4: What are the arguments of the rbinom
(random numbers following the binomial distribution) function? Are any default or must all be specified? What is the value returned?
Exercise 5: Display the vector of the number of successes in 24 trials with probability of success 0.2
(20%), this simulation carried out 128 times.
Since S is an expression language, the results of one function can be passed as an argument to another.
Here, the results of the rnorm
function are passed to the sort
function.
sort(rnorm(n=12, mean=180, sd=10), decreasing=TRUE)
## [1] 203.0544 191.6411 182.8871 182.3158 180.4643 179.9218 177.4824 176.4939
## [9] 173.5188 173.1279 172.4832 162.0565
Note the use here of the optional argument decreasing
of the sort
function.
An important feature of R Markdown is the ability to include results of computations in the text, using the syntax `r `
followed by any R expression, which can include any workspace object defined in previous code chunks.
For example, suppose we compute the sample mean of a simulated normal distribution, rounded to two decimal places:
print(my.mean <- round(mean(rnorm(n=12, mean=180, sd=10)), 2))
## [1] 178.12
We then could report it like this: the sample mean is 178.12. Notice how in the compiled document the in-line expression beginning with `r `
is replaced by its value from the computation.
Exercise 6: Summarize the result of rbinom
(previous exercise) with the table
function. What is the range of results, i.e., the minimum and maximum values? Which is the most likely result? For these, write text which includes the computed results. This is necessary because the results change with each random sampling.
(Hint: convert the result of table
to a data.frame
; see ?table
; you may want to come back here after reading about data.frame
in a later section.)
A major feature of S is that most operations are vectorized, that is, they work element-wise on vectors.
Above we built a vector with the c
function:
(primes.lt.20 <- c(2, 3, 5, 7, 11, 13, 17))
## [1] 2 3 5 7 11 13 17
Elements can be selected from a vector with the []
matrix selection operator:
primes.lt.20[1]
## [1] 2
primes.lt.20[1:3]
## [1] 2 3 5
primes.lt.20[c(1,3,5)]
## [1] 2 5 11
For example, create a sequence of integers 1..10 and then add a normally-distributed error to it. These are added in parallel:
(s <- seq(1, 10))
## [1] 1 2 3 4 5 6 7 8 9 10
(r <- rnorm(10, 0, 0.1))
## [1] 0.0523664118 0.0008837435 -0.1131221806 0.0720411113 0.1501412454
## [6] -0.0666157339 -0.0508666871 -0.0980716909 0.0314177778 0.0593004570
s[1] + r[1]
## [1] 1.052366
s + r
## [1] 1.052366 2.000884 2.886878 4.072041 5.150141 5.933384 6.949133
## [8] 7.901928 9.031418 10.059300
If one of the arguments is shorter than the other, it is recycled:
primes.lt.20/2
## [1] 1.0 1.5 2.5 3.5 5.5 6.5 8.5
Many functions operate element-wise on vectors:
seq(0, 2*pi, by=pi/6)
## [1] 0.0000000 0.5235988 1.0471976 1.5707963 2.0943951 2.6179939 3.1415927
## [8] 3.6651914 4.1887902 4.7123890 5.2359878 5.7595865 6.2831853
round(sin(seq(0, 2*pi, by=pi/6)),4)
## [1] 0.000 0.500 0.866 1.000 0.866 0.500 0.000 -0.500 -0.866 -1.000
## [11] -0.866 -0.500 0.000
Others summarize a vector:
sum(primes.lt.20)
## [1] 58
Exercise 7: Create and display a vector representing latitudes in degrees from \(0^\circ\) (equator) to \(+90^\circ\) (north pole), in intervals of \(5^\circ\). Compute and display their cosines – recall, the trig functions in R expect arguments in radians. Find and display the maximum cosine.
R is built from a set of packages, several of which are loaded by default in all R installations and required for normal operation. The search
function shows the loaded packages and the order in which they are searched for function and object names:
search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
There are thousands of contributed packages; these must first be installed on each R system from a CRAN repository1 using the install.packages
function, and then loaded into the search space with the library
or require
function. Installation only needs to be done once per system, but loading one time each session. It is common practice to also load any packages on which the installed package depends.
Although packages are usually installed via the RStudio “Packages” tab, they can also be installed directly with install.packages
.
Here we load the MASS
package that implements statistical methods from the textbook: Venables, W N, and B D Ripley. 2002. Modern Applied Statistics with S. Fourth edition. New York: Springer-Verlag.
install.packages("MASS", dependencies=TRUE)
Once installed, they must be loaded into the workspace. Then their functions are available for use.
library(MASS)
search()
## [1] ".GlobalEnv" "package:MASS" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
help(package=MASS)
Exercise 8: Check if the gstat
package is installed on your system. If not, install it. Load it into the workspace. Display its help and find the variogram
function. What is its description?
All objects in S have a class. This controls what can be done with the object. For example, it does not make sense to use arithmetic operations on character strings.
Some basic types are logical
(TRUE/FALSE), numeric
, and character
.
The class
function returns the data type of an object:
class(TRUE); class(316); class(3.1415926); class("blue")
## [1] "logical"
## [1] "numeric"
## [1] "numeric"
## [1] "character"
Note both integers and real numbers are class numeric
.
The type list
can combine any objects, each with its own class:
class(l <- list(1,FALSE,"red"))
## [1] "list"
Elements of a list are extracted with the [[]]
operator:
class(l[[2]])
## [1] "logical"
Functions are also a class:
class(sort)
## [1] "function"
The classes logical
, numeric
, and character
are all vectors with one or more elements.
class(c(TRUE, FALSE, TRUE))
## [1] "logical"
class(primes.lt.20)
## [1] "numeric"
class(colours)
## [1] "character"
Exercise 9: Display the classes of the built-in constant pi
and of the built-in constant letters
.
These are basic classes with some additional attributes.
Examples:
array
is a vector with a dim
``dimensions’’ attributematrix
is a 2-D array
primes.lt.20
## [1] 2 3 5 7 11 13 17
dim(primes.lt.20)
## NULL
my.a <- array(primes.lt.20)
class(my.a)
## [1] "array"
dim(my.a)
## [1] 7
(my.matrix <- matrix(data=rep(primes.lt.20,2), nrow=length(primes.lt.20), byrow=FALSE))
## [,1] [,2]
## [1,] 2 2
## [2,] 3 3
## [3,] 5 5
## [4,] 7 7
## [5,] 11 11
## [6,] 13 13
## [7,] 17 17
class(my.matrix); dim(my.matrix)
## [1] "matrix" "array"
## [1] 7 2
my.matrix[,2] <- my.matrix[,2]^2
my.matrix
## [,1] [,2]
## [1,] 2 4
## [2,] 3 9
## [3,] 5 25
## [4,] 7 49
## [5,] 11 121
## [6,] 13 169
## [7,] 17 289
Certain operations and functions are restricted to work on certain classes. For example, a matrix
may be transposed with the t
“transpose” function:
t(my.matrix)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 2 3 5 7 11 13 17
## [2,] 4 9 25 49 121 169 289
Some functions automtically “promote” the class of their arguments, if possible:
class(t(primes.lt.20))
## [1] "matrix" "array"
Functions can define their own classes. For example, the rlm
“robust linear models” function of the MASS
package returns an object with several classes defined by both rlm
and the function lm
“linear models” on which it depends:
data(birthwt, package="MASS")
class(model <- rlm(low ~ ., birthwt))
## [1] "rlm" "lm"
Model formulas and built-in datasets will be explained below.
Exercise 10: What is the class of the object returned by the variogram
function? (Hint: see the heading “Value” in the help text.)
R has many example datasets in the datasets
package, which is loaded by default in all R installations. There are also datasets in most contributed packages. These are intended to show the features of various functions,
(You can also load your own data, of course; see below.)
The data
function with the optional package
argument lists the datasets available in the named packages:
data(package="datasets")
data(package="ggplot2")
These can be loaded into the workspace with the data
function. If the package which provides the dataset is loaded, there is no need to specify it.
data(diamonds, package="ggplot2")
data("birthwt") # in MASS, this was loaded above, so no need to specify
A dataset (actually, any S object) can be summarized with the summary
function, which gives output appropriate to the object class. An object’s internal structure is revealed with the str
“structure” function:
is(diamonds)
## [1] "tbl_df"
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Exercise 11: List the datasets in the gstat
package.
Exercise 12: Load, summarize, and show the structure of the oxford
dataset.
The “data frame” is the most common class for statistical and database work. An object of class data.frame
is a matrix
with column (field) colnames
and (optional) row.names
.
Rows are generally observations or individuals (database “cases”) and columns are attributes (database “fields”).
(m.d <- as.data.frame(my.matrix))
## V1 V2
## 1 2 4
## 2 3 9
## 3 5 25
## 4 7 49
## 5 11 121
## 6 13 169
## 7 17 289
class(m.d)
## [1] "data.frame"
str(m.d)
## 'data.frame': 7 obs. of 2 variables:
## $ V1: num 2 3 5 7 11 13 17
## $ V2: num 4 9 25 49 121 169 289
colnames(m.d)
## [1] "V1" "V2"
colnames(m.d) <- c("primes", "primes.sq")
rownames(m.d)
## [1] "1" "2" "3" "4" "5" "6" "7"
rownames(m.d) <- letters[1:length(rownames(m.d))]
rownames(m.d)
## [1] "a" "b" "c" "d" "e" "f" "g"
str(m.d)
## 'data.frame': 7 obs. of 2 variables:
## $ primes : num 2 3 5 7 11 13 17
## $ primes.sq: num 4 9 25 49 121 169 289
There are a variety of ways to extract parts of a dataframe. We illustrate this with the trees
sample dataset.
The most common way to extract a field as a vector is with the $
operator:
data(trees)
str(trees)
## 'data.frame': 31 obs. of 3 variables:
## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...
# fields
trees$Volume
## [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3 19.1
## [16] 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0
## [31] 77.0
However, since a dataframe is just a matrix with named rows and columns, matrix selection operator []
also can be used:
trees[,1]
## [1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 11.3 11.4 11.4 11.7 12.0
## [16] 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5 16.0 16.3 17.3 17.5 17.9 18.0 18.0
## [31] 20.6
trees[,"Volume"]
## [1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 24.2 21.0 21.4 21.3 19.1
## [16] 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3 38.3 42.6 55.4 55.7 58.3 51.5 51.0
## [31] 77.0
# cases (records, tuples)
trees[1,]
## Girth Height Volume
## 1 8.3 70 10.3
# subsets of cases and selected fields
trees[1,1]
## [1] 8.3
trees[1:3, c(1,3)]
## Girth Volume
## 1 8.3 10.3
## 2 8.6 10.3
## 3 8.8 10.2
Exercise 13: load the women
sample dataset. How many observations (cases) and how many attributes (fields) for each case? What are the column (field) and row names? What is the height of the first-listed woman?
R makes a strong distinction between continuous numeric variables and categorical variables. The latter are called R factors
and may be ordered (a natural ordering of class levels) or unordered.
is.factor(diamonds$color)
## [1] TRUE
is.ordered(diamonds$color)
## [1] TRUE
levels(diamonds$color)
## [1] "D" "E" "F" "G" "H" "I" "J"
is.factor(diamonds$carat)
## [1] FALSE
The identification as a factor or not has major implications for statistical models (see below).
Exercise 14: List the factors in the oxford
dataset.
S has a special NA
value for missing values. For example, in a data frame there may be no value for a certain field for one or more cases (observations). The missing value is handled correctly by functions.
Suppose there was no volume measurement for the first tree in the trees
dataset:
trees[1,]
## Girth Height Volume
## 1 8.3 70 10.3
trees[1,"Volume"] <- NA
trees[1,]
## Girth Height Volume
## 1 8.3 70 NA
summary(trees$Volume)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.20 19.75 24.55 30.83 37.80 77.00 1
This is different from “not a number”, NaN
, which is the result of an illegal mathematical operation, and Inf
, which is the result of dividing by zero. If these are used in further mathematical operations, the error wil propagate.
(x <- sqrt(-1))
## [1] NaN
x^2
## [1] NaN
(x <- 100/0)
## [1] Inf
x+1
## [1] Inf
The usual logical S operators <
, >
, <=
, >=
, ==
, !=
return the logical values TRUE
and FALSE
. The which
function then selects the indices of the TRUE
elements of a vector.
trees$Height > 80
## [1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE TRUE TRUE FALSE FALSE FALSE TRUE
(ix <- which(trees$Height > 80))
## [1] 5 6 17 18 26 27 31
dim(trees)
## [1] 31 3
(tall.trees <- trees[ix, ])
## Girth Height Volume
## 5 10.7 81 18.8
## 6 10.8 83 19.7
## 17 12.9 85 33.8
## 18 13.3 86 27.4
## 26 17.3 81 55.4
## 27 17.5 82 55.7
## 31 20.6 87 77.0
dim(tall.trees)
## [1] 7 3
Exercise 15: Identify the thin trees, defined as those with height/girth ratio more than 1 s.d. above the mean. You will have to define a new field in the dataframe with this ratio, and then use the mean
and sd
summary functions, along with a logical expression.
Here is an example of using logical expressions, along with simulation: ““What are the chances that three people on an elevator with 13 buttons press three consecutive floors?”
mean(replicate(2^12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1)))
## [1] 0.02978516
Let’s break this down “inside-out”, but only using a small number of simulations:
sample.int(n=13, size=3, replace=TRUE) # three floors selected, can be repeats
## [1] 2 4 2
sort(sample.int(n=13, size=3, replace=TRUE)) # same, but in order
## [1] 3 7 8
diff(sort(sample.int(n=13, size=3, replace=TRUE))) # delta between each choice
## [1] 3 2
diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1 # are they consecutive 1 by 1?
## [1] FALSE FALSE
all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1) # are they all consecutive?
## [1] FALSE
replicate(12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1)) # repeat the test many times
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
mean(replicate(2^12, all(diff(sort(sample.int(n=13, size=3, replace=TRUE))) == 1))) # what is the average of the replicated tests?
## [1] 0.02954102
Challenge: run this several times and see how variable is the simulation result.
R has several graphics systems, beginning with base graphics which are installed by default. The plot
method is generic, and changes its behaviour according to its arguments.
The simplest plot is the scatterplot of two variables, which uses the ~
formula operator (see ‘Statistical Models’, below) to symbolize “y-axis” vs. “x-axis”.
plot(Height ~ Girth, data=trees)
# plot(trees$Height ~ trees$Girth)
Base graphics also has functions for histograms, boxplots, dot charts, and conditioning plots. See help(package=graphics)
.
hist(trees$Volume); rug(trees$Volume)
boxplot(diamonds$carat)
dotchart(trees$Volume)
Another graphics system, ggplot2
, is becoming ever more popular. This views graphics as being built in layers, each with functions, starting with ggplot2
to initialize the graphic, then each layer added with a +
operator.
For example, here is a histogram of the tree volumes with a “rug” plot to show the actual values:
library(ggplot2)
g.h <- ggplot(data=trees) +
geom_histogram(mapping = aes(x=Volume), binwidth = 5,
colour="blue") +
geom_rug(mapping = aes(x=Volume))
print(g.h)
See here for more on ggplot2
.
Exercise 16: Display a histogram of the diamond prices in the diamonds
dataset.
S specifies statistical models in symbolic form with model formulae, which are arguments to many statistical methods, for example lm
(linear models, in the base packages) and rlm
(robust linear models, in the MASS
package).
These have a:
~
, read as “depends on”, “is modelled by”The simplest use is in simple linear regression with the lm
function. For example, to model the linear relation of tree Volume
from its Height
, the formula is Volume ~ Height
.
model.vh <- lm(Volume ~ Height, data=trees)
# equivalent to: model <- lm(trees$Volume ~ trees$Height)
summary(model.vh)
##
## Call:
## lm(formula = Volume ~ Height, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.249 -9.845 -2.434 11.806 30.100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -82.5245 29.9844 -2.752 0.010267
## Height 1.4876 0.3922 3.793 0.000729
##
## Residual standard error: 13.48 on 28 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3395, Adjusted R-squared: 0.3159
## F-statistic: 14.39 on 1 and 28 DF, p-value: 0.0007292
Exercise 17: Write a model to predict tree height from tree girth. How much of the height can be predicted from the girth?
The right-hand side can have formula operators, which look like arithmetic opertors but are interpreted in terms of the model structure:
+
additive effect*
interaction effect/
nested model-
remove terms from the modelFor arithmetic within model formulas, the I
“identity” function must be used.
To illustrate the use of opertors, here is a model of tree volume Volume
modelled by the square root of tree height sqrt(Height)
, interacting with the tree radius I(Girth/(2*pi))
, with no intercept (-1
):
model <- lm(Volume ~ sqrt(Height)*I(Girth/(2*pi)) - 1, data=trees)
summary(model)
##
## Call:
## lm(formula = Volume ~ sqrt(Height) * I(Girth/(2 * pi)) - 1, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9522 -1.6022 0.1426 2.0223 4.8618
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## sqrt(Height) -3.6253 0.3557 -10.191 9.44e-11
## I(Girth/(2 * pi)) -19.8235 8.0063 -2.476 0.0198
## sqrt(Height):I(Girth/(2 * pi)) 5.6003 0.8383 6.681 3.60e-07
##
## Residual standard error: 3.175 on 27 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9925, Adjusted R-squared: 0.9916
## F-statistic: 1188 on 3 and 27 DF, p-value: < 2.2e-16
This model may or may not make sense physically!
Exercise 18: Write a model to predict tree volume as a linear function of tree height and tree girth, with no interaction.
Variables in statistical models may be categorical, i.e., R factors
; these are specified the same way in models but interpreted differently. For example in the linear model lm
function, a factor is converted into contrasts, typically so-called “dummy” variables:
summary(m.c <- lm(price ~ carat, data=diamonds))
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16
## carat 7756.43 14.07 551.4 <2e-16
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
head(model.matrix(m.c))
## (Intercept) carat
## 1 1 0.23
## 2 1 0.21
## 3 1 0.23
## 4 1 0.29
## 5 1 0.31
## 6 1 0.24
summary(m.f <- lm(price ~ clarity - 1, data=diamonds))
##
## Call:
## lm(formula = price ~ clarity - 1, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4737 -2727 -1429 1262 16254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## clarityI1 3924.17 144.56 27.14 <2e-16
## claritySI2 5063.03 41.04 123.37 <2e-16
## claritySI1 3996.00 34.43 116.07 <2e-16
## clarityVS2 3924.99 35.54 110.43 <2e-16
## clarityVS1 3839.46 43.53 88.19 <2e-16
## clarityVVS2 3283.74 55.29 59.39 <2e-16
## clarityVVS1 2523.11 65.09 38.76 <2e-16
## clarityIF 2864.84 93.01 30.80 <2e-16
##
## Residual standard error: 3935 on 53932 degrees of freedom
## Multiple R-squared: 0.5066, Adjusted R-squared: 0.5066
## F-statistic: 6923 on 8 and 53932 DF, p-value: < 2.2e-16
head(model.matrix(m.f))
## clarityI1 claritySI2 claritySI1 clarityVS2 clarityVS1 clarityVVS2 clarityVVS1
## 1 0 1 0 0 0 0 0
## 2 0 0 1 0 0 0 0
## 3 0 0 0 0 1 0 0
## 4 0 0 0 1 0 0 0
## 5 0 1 0 0 0 0 0
## 6 0 0 0 0 0 1 0
## clarityIF
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Within an expression the ifelse
function can be used to select two outcomes, based on a logical expression.
For example, to select a plotting colour based on whether one or the other of two variables in a scatterplot is greater:
# two vectors of 100 random uniform variates
n <- 128
x <- runif(n); y <- runif(n)
plot(y ~ x, asp=1, col=ifelse(y > x, "red", "green"), pch=20)
abline(0, 1, lty=2); grid()
S has ALGOL-like control structures:
if
… else
for
while
, repeat
break
, next
Note that vectorized functions or the apply
family of functions can often be used where other programming languages would use for
loops.
Here is an example of the while
control structure. For some simulation we want to draw a sample from the normal distribution but make sure that there is an extreme value, so we repeat the sampling until we get what we want:
n <- 1
while (max(abs(sample <- rnorm(80))) < 3) {
print("No extreme")
n <- n + 1
}
## [1] "No extreme"
## [1] "No extreme"
print(paste("Sample required",n,"draws"))
## [1] "Sample required 3 draws"
The for
loop works as in other languages:
v <- vector(mode="numeric", length=11)
for (power in 0:10) {
v[power+1] <- 2^power
}
print(v)
## [1] 1 2 4 8 16 32 64 128 256 512 1024
But of course this is much easier with a vectorized operation:
0:10
## [1] 0 1 2 3 4 5 6 7 8 9 10
(v <- 2^(0:10))
## [1] 1 2 4 8 16 32 64 128 256 512 1024
apply
family of functionsThese apply a function across either the rows or columns of a matrix, including data frames.
For example, to find the maximum value of all the variables in a data frame we can use the sapply
“apply a function and simplify the result” function:
sapply(trees, FUN=max)
## Girth Height Volume
## 20.6 87.0 NA
This applies the max
built-in function along the fields of the data frame.
The by
function performs a function on subsets defined by a factor (categorical varibable):
by(diamonds[,c(1,3:5)], diamonds$cut, summary)
## diamonds$cut: Fair
## carat color clarity depth
## Min. :0.220 D:163 SI2 :466 Min. :43.00
## 1st Qu.:0.700 E:224 SI1 :408 1st Qu.:64.40
## Median :1.000 F:312 VS2 :261 Median :65.00
## Mean :1.046 G:314 I1 :210 Mean :64.04
## 3rd Qu.:1.200 H:303 VS1 :170 3rd Qu.:65.90
## Max. :5.010 I:175 VVS2 : 69 Max. :79.00
## J:119 (Other): 26
## ------------------------------------------------------------
## diamonds$cut: Good
## carat color clarity depth
## Min. :0.2300 D:662 SI1 :1560 Min. :54.30
## 1st Qu.:0.5000 E:933 SI2 :1081 1st Qu.:61.30
## Median :0.8200 F:909 VS2 : 978 Median :63.40
## Mean :0.8492 G:871 VS1 : 648 Mean :62.37
## 3rd Qu.:1.0100 H:702 VVS2 : 286 3rd Qu.:63.80
## Max. :3.0100 I:522 VVS1 : 186 Max. :67.00
## J:307 (Other): 167
## ------------------------------------------------------------
## diamonds$cut: Very Good
## carat color clarity depth
## Min. :0.2000 D:1513 SI1 :3240 Min. :56.80
## 1st Qu.:0.4100 E:2400 VS2 :2591 1st Qu.:60.90
## Median :0.7100 F:2164 SI2 :2100 Median :62.10
## Mean :0.8064 G:2299 VS1 :1775 Mean :61.82
## 3rd Qu.:1.0200 H:1824 VVS2 :1235 3rd Qu.:62.90
## Max. :4.0000 I:1204 VVS1 : 789 Max. :64.90
## J: 678 (Other): 352
## ------------------------------------------------------------
## diamonds$cut: Premium
## carat color clarity depth
## Min. :0.200 D:1603 SI1 :3575 Min. :58.00
## 1st Qu.:0.410 E:2337 VS2 :3357 1st Qu.:60.50
## Median :0.860 F:2331 SI2 :2949 Median :61.40
## Mean :0.892 G:2924 VS1 :1989 Mean :61.26
## 3rd Qu.:1.200 H:2360 VVS2 : 870 3rd Qu.:62.20
## Max. :4.010 I:1428 VVS1 : 616 Max. :63.00
## J: 808 (Other): 435
## ------------------------------------------------------------
## diamonds$cut: Ideal
## carat color clarity depth
## Min. :0.2000 D:2834 VS2 :5071 Min. :43.00
## 1st Qu.:0.3500 E:3903 SI1 :4282 1st Qu.:61.30
## Median :0.5400 F:3826 VS1 :3589 Median :61.80
## Mean :0.7028 G:4884 VVS2 :2606 Mean :61.71
## 3rd Qu.:1.0100 H:3115 SI2 :2598 3rd Qu.:62.20
## Max. :3.5000 I:2093 VVS1 :2047 Max. :66.70
## J: 896 (Other):1358
A function is defined with the function
function (!), specifying the arguments, body of the function, and the return values:
For example, here is a function to compute the geometric mean of a vector \(\overline{v_h} = \left[ \prod_{i=1\ldots n} \; v_i \right]^{1/n}\):
hm <- function(v) {
return(exp(sum(log(v))/length(v)))
}
class(hm)
## [1] "function"
hm(1:99); mean(1:99)
## [1] 37.6231
## [1] 50
Of course, a function should check for valid inputs. This example shows the use of the if
, else if
, else
control structure, as well as the functions is.numeric
, any
, and length
, and the return
function to return a value from a function.
g.mean <- function(v) {
if (!is.numeric(v)) {
print("Argument must be numeric"); return(NULL)
}
else if (any(v <= 0)) {
print("All elements must be positive"); return(NULL)
}
else return(exp(sum(log(v))/length(v)))
}
g.mean(letters)
## [1] "Argument must be numeric"
## NULL
g.mean(c(-1, -2, 1, 2))
## [1] "All elements must be positive"
## NULL
g.mean(1:99)
## [1] 37.6231
Exercise 19: Write a function to restrict the values of a vector to the range \(0 \ldots 1\). Any values \(< 0\) should be replaced with \(0\), and any values \(>1\) should be replaced with \(1\). Test the function on a vector with elements from \(-1.2\) to \(+1.2\) in increments of \(0.1\) – see the seq
“sequence” function.
R can exchange data in any conceivable format.
Reference: “R Data Import/Export”, an R manual installed with R; available under the Help menu as R Help
.`
The most common interchange format for “flat” files is the Comma-Separated Values (CSV). Here we first download a sample CSV file and then import to R:
download.file(
url="https://data.ny.gov/api/views/vpv5-zd4k/rows.csv?accessType=DOWNLOAD",
destfile="enplanements.csv")
file.show("enplanements.csv") # examine in another window
airports <- read.csv("enplanements.csv")
str(airports)
## 'data.frame': 378 obs. of 4 variables:
## $ Airport.Group: chr "UPSTATE NON-PRIMARY" "UPSTATE HUB" "OTHER UPSTATE PRIMARY" "UPSTATE HUB" ...
## $ Airport.Name : chr "ADIRONDACK REGIONAL" "ALBANY INTERNATIONAL" "BINGHAMTON REGIONAL AIRPORT" "BUFFALO-NIAGARA INTERNATIONAL" ...
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Enplanements : int 5770 1216626 108172 2582597 3483 139698 152582 1190967 121733 23664832 ...
This file is described here
The more general read.table
function can read tables in many formats.
In the past few years “an opinionated collection of R packages designed for data science” called the tidyverse
has been developed by Hadley Wickham and colleagues. The ggplot2
package is part of this; other important packages are dplyr
for data manipulation,readr
for data import, and stringr
for string manipulation.
The complete tidyverse can be installed with:
install.packages("tidyverse")
The dplyr
package provides a flexible grammar of data manipulation, including the “pipe” operator %>%
, which works from left to right, unlike the conventional <-
assignment operator.
library(dplyr)
The typical use is to do some data manipulation by filtering, as well as to summarize. For example, to subset the airports to just the upstate hubs and summarize the enplanements by year:
levels(airports$Airport.Group)
## NULL
airports %>% filter(Airport.Group=="UPSTATE HUB") %>%
group_by(Year) %>%
summarize(Enplanements=sum(Enplanements))
## # A tibble: 21 × 2
## Year Enplanements
## <int> <int>
## 1 1997 4890591
## 2 1998 5059778
## 3 1999 5283594
## 4 2000 5826243
## 5 2001 5736766
## 6 2002 5580027
## 7 2003 5632693
## 8 2004 6237753
## 9 2005 6643091
## 10 2006 6511005
## # … with 11 more rows
Another typical use is to define new variables, possibly as combinations of others. For example, load the Meuse River geochemical dataset, select only the most frequently-flooded areas, rename the columns for the heavy metals, compute new columns with their base-10 logarithms, compute a ratio, and finally save as a new object:
library(sp)
data(meuse)
names(meuse)
## [1] "x" "y" "cadmium" "copper" "lead" "zinc" "elev"
## [8] "dist" "om" "ffreq" "soil" "lime" "landuse" "dist.m"
meuse %>%
filter(ffreq==1) %>%
select(-ffreq) %>% # column is no longer needed, it is constant
rename(Cd=cadmium, Cu=copper, Pb=lead, Zn=zinc) %>%
mutate(lCd=log10(Cd), lCu=log10(Cu), lPb=log10(Pb), lZn=log10(Zn)) %>%
mutate(ratio.lCu.lPb=lCu/lPb) ->
meuse.new
meuse.new[1:6,]
## x y Cd Cu Pb Zn elev dist om soil lime landuse dist.m
## 1 181072 333611 11.7 85 299 1022 7.909 0.00135803 13.6 1 1 Ah 50
## 2 181025 333558 8.6 81 277 1141 6.983 0.01222430 14.0 1 1 Ah 30
## 3 181165 333537 6.5 68 199 640 7.800 0.10302900 13.0 1 1 Ah 150
## 4 181298 333484 2.6 81 116 257 7.655 0.19009400 8.0 2 0 Ga 270
## 5 181307 333330 2.8 48 117 269 7.480 0.27709000 8.7 2 0 Ah 380
## 6 181390 333260 3.0 61 137 281 7.791 0.36406700 7.8 2 0 Ga 470
## lCd lCu lPb lZn ratio.lCu.lPb
## 1 1.0681859 1.929419 2.475671 3.009451 0.7793519
## 2 0.9344985 1.908485 2.442480 3.057286 0.7813719
## 3 0.8129134 1.832509 2.298853 2.806180 0.7971405
## 4 0.4149733 1.908485 2.064458 2.409933 0.9244485
## 5 0.4471580 1.681241 2.068186 2.429752 0.8129063
## 6 0.4771213 1.785330 2.136721 2.448706 0.8355467
Note the use here of the right-hand assignment operator ->
to save the result into a new object (we could also overwrite the old object meuse
if we had no use for its original form).
Bonus Exercise : Use tidyverse
functions and pipes on the trees
dataset, to select the trees (use the filter
function) with a volume greater than the median volume (use the median
function), compute the ratio of girth to height as a new variable (use the mutate
function), and sort by this (use the arrange
function) from thin to thick trees.
names(trees)
## [1] "Girth" "Height" "Volume"
mv <- median(trees$Volume)
trees %>%
filter(Volume > median(Volume)) %>%
mutate(thickness=round(Girth/Height,3)) %>%
arrange(thickness)
## [1] Girth Height Volume thickness
## <0 rows> (or 0-length row.names)