Exercise 1: Write an expression to compute the number of seconds in a 365-day year, and execute the expression.
The number of seconds in a 365-day year:
365*24*60*60
## [1] 31536000
Exercise 2: Define a workspace object which contains the number of seconds in 365-day year, and display the results.
A workspace object which contains the number of seconds in 365-day year, and the results.
(s.in.yr <- 365*24*60*60)
## [1] 31536000
Exercise 3: Find the function name for base-10 logarithms, and compute the base-10 logarithm of 10, 100, and 1000 (use the ??
function at the console to search).
??log10
?log10
The function name for base-10 logarithms, and the base-10 logarithm of 10, 100, and 1000.
log10(10); log10(100); log10(1000)
## [1] 1
## [1] 2
## [1] 3
Exercise 4: What are the arguments of the rbinom
(random numbers following the binomial distribution) function? Are any default or must all be specified? What is the value returned?
help(rbinom)
There are three arguments, all of which must all be specified: n
: number of observations; size
: number of trials; prob
: probability of success on each trial. The value returned is a vector of length n
with the number of successes in each trial.
Exercise 5: Display the vector of the number of successes in 24 trials with probability of success 0.2
(20%), this simulation carried out 128 times.
(v <- rbinom(128, 24, 0.2))
## [1] 3 10 6 3 4 7 3 9 4 7 5 4 4 3 2 4 10 1 5 2 4 4 4 6 9
## [26] 6 3 5 3 5 6 6 2 7 2 5 7 8 5 7 3 4 3 5 5 6 7 5 7 5
## [51] 5 5 2 4 6 5 4 6 5 1 5 3 5 2 2 5 6 4 4 4 3 2 1 3 5
## [76] 3 5 4 3 7 7 6 5 4 6 4 4 2 3 4 9 6 6 6 3 6 5 5 4 7
## [101] 2 3 4 2 5 5 5 6 6 3 3 4 6 6 3 4 6 3 5 7 2 3 6 5 4
## [126] 3 4 9
The vector of the number of successes in 24 trials with probability of success 0.2
(20%), this simulation carried out 128 times.
Summarize the result of rbinom
(previous exercise) with the table
function. What is the range of results, i.e., the minimum and maximum values? Which is the most likely result? For these, write text which includes the computed results. This is necessary because the results change with each random sampling.
print(tv <- table(v <- rbinom(128, 24, 0.2)))
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1 2 8 15 29 23 25 13 7 4 1
(tv.df <- as.data.frame(tv))
## Var1 Freq
## 1 0 1
## 2 1 2
## 3 2 8
## 4 3 15
## 5 4 29
## 6 5 23
## 7 6 25
## 8 7 13
## 9 8 7
## 10 9 4
## 11 10 1
max.count <- max(tv.df$Freq)
ix <- which(tv.df$Freq == max.count)
tv.df[ix, ]
## Var1 Freq
## 5 4 29
The range is from 0 to 10; the modal value is 4; in this simulation that value is found 29 times.
Displaying the modal value is tricky; it requires you to convert the results of table()
to a data.frame
and find the highest frequency value(s), then report that(those) value(s).
Exercise 7: Create and display a vector representing latitudes in degrees from \(0^\circ\) (equator) to \(+90^\circ\) (north pole), in intervals of \(5^\circ\). Compute and display their cosines – recall, the trig functions in R expect arguments in radians. Find and display the maximum cosine.
A vector representing latitudes in degrees from \(0^\circ\) (equator) to \(+90^\circ\) (north pole), in intervales of \(5^\circ\):
(angles <- seq(0, 90, by=5))
## [1] 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
Their cosines and its maximum value:
deg.rad <- pi/180
angles*deg.rad
## [1] 0.00000000 0.08726646 0.17453293 0.26179939 0.34906585 0.43633231
## [7] 0.52359878 0.61086524 0.69813170 0.78539816 0.87266463 0.95993109
## [13] 1.04719755 1.13446401 1.22173048 1.30899694 1.39626340 1.48352986
## [19] 1.57079633
round(angles.cos <- cos(angles*deg.rad),4)
## [1] 1.0000 0.9962 0.9848 0.9659 0.9397 0.9063 0.8660 0.8192 0.7660 0.7071
## [11] 0.6428 0.5736 0.5000 0.4226 0.3420 0.2588 0.1736 0.0872 0.0000
max(angles.cos)
## [1] 1
Exercise 8: Check if the gstat
package is installed on your system. If not, install it. Load it into the workspace. Display its help and find the variogram
function. What is its description?
Installing gstat
:
install.packages("gstat", dependencies=TRUE)
Or, use the Install
toolbar button of Packages
tab of the Help
RStudio pane.
Loading gstat
into the workspace:
library(gstat)
Or, click next to the package’s name in Packages
tab of the Help
RStudio pane.
Help for the variogram
function:
help(variogram, package="gstat")
Description: “Calculates the sample variogram from data, or in case of a linear model is given, for the residuals, with options for directional, robust, and pooled variogram, and for irregular distance intervals”
Exercise 9: Display the classes of the built-in constant pi
and of the built-in constant letters
.
class(pi)
## [1] "numeric"
class(letters)
## [1] "character"
Exercise 10: What is the class of the object returned by the variogram
function? (Hint: see the heading “Value” in the help text.)
help(variogram)
The variogram
function returns an object of class gstatVariogram
.
Exercise 11: List the datasets in the gstat
package.
Datasets in the gstat
package:
data(package="gstat")
Exercise 12: Load, summarize, and show the structure of the oxford
dataset.
library(gstat)
data(oxford)
summary(oxford)
## PROFILE XCOORD YCOORD ELEV PROFCLASS
## Min. : 1.00 Min. :100 Min. : 100 Min. :540.0 Cr:19
## 1st Qu.: 32.25 1st Qu.:200 1st Qu.: 600 1st Qu.:558.0 Ct:36
## Median : 63.50 Median :350 Median :1100 Median :573.0 Ia:71
## Mean : 63.50 Mean :350 Mean :1100 Mean :573.6
## 3rd Qu.: 94.75 3rd Qu.:500 3rd Qu.:1600 3rd Qu.:584.5
## Max. :126.00 Max. :600 Max. :2100 Max. :632.0
## MAPCLASS VAL1 CHR1 LIME1 VAL2
## Cr:31 Min. :2.000 Min. :1.000 Min. :0.000 Min. :4.00
## Ct:36 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:4.00
## Ia:59 Median :4.000 Median :2.000 Median :4.000 Median :8.00
## Mean :3.508 Mean :2.468 Mean :2.643 Mean :6.23
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:8.00
## Max. :4.000 Max. :4.000 Max. :4.000 Max. :8.00
## CHR2 LIME2 DEPTHCM DEP2LIME PCLAY1
## Min. :2 Min. :0.000 Min. :10.00 Min. :20.00 Min. :10.00
## 1st Qu.:2 1st Qu.:4.000 1st Qu.:25.00 1st Qu.:20.00 1st Qu.:20.00
## Median :2 Median :5.000 Median :36.00 Median :20.00 Median :24.50
## Mean :3 Mean :3.889 Mean :46.25 Mean :30.32 Mean :24.44
## 3rd Qu.:4 3rd Qu.:5.000 3rd Qu.:64.75 3rd Qu.:40.00 3rd Qu.:28.00
## Max. :6 Max. :5.000 Max. :91.00 Max. :90.00 Max. :37.00
## PCLAY2 MG1 OM1 CEC1
## Min. :10.00 Min. : 19.00 Min. : 2.600 Min. : 7.00
## 1st Qu.:10.00 1st Qu.: 44.00 1st Qu.: 4.100 1st Qu.:12.00
## Median :10.00 Median : 72.00 Median : 5.350 Median :15.00
## Mean :14.76 Mean : 93.53 Mean : 5.995 Mean :18.88
## 3rd Qu.:20.00 3rd Qu.:123.25 3rd Qu.: 7.175 3rd Qu.:25.25
## Max. :40.00 Max. :308.00 Max. :13.100 Max. :43.00
## PH1 PHOS1 POT1
## Min. :4.200 Min. : 1.700 Min. : 83.0
## 1st Qu.:7.200 1st Qu.: 6.200 1st Qu.:127.0
## Median :7.500 Median : 8.500 Median :164.0
## Mean :7.152 Mean : 8.752 Mean :181.7
## 3rd Qu.:7.600 3rd Qu.:10.500 3rd Qu.:194.8
## Max. :7.700 Max. :25.000 Max. :847.0
str(oxford)
## 'data.frame': 126 obs. of 22 variables:
## $ PROFILE : num 1 2 3 4 5 6 7 8 9 10 ...
## $ XCOORD : num 100 100 100 100 100 100 100 100 100 100 ...
## $ YCOORD : num 2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 ...
## $ ELEV : num 598 597 610 615 610 595 580 590 598 588 ...
## $ PROFCLASS: Factor w/ 3 levels "Cr","Ct","Ia": 2 2 2 3 3 2 3 2 3 3 ...
## $ MAPCLASS : Factor w/ 3 levels "Cr","Ct","Ia": 2 3 3 3 3 2 2 3 3 3 ...
## $ VAL1 : num 3 3 4 4 3 3 4 4 4 3 ...
## $ CHR1 : num 3 3 3 3 3 2 2 3 3 3 ...
## $ LIME1 : num 4 4 4 4 4 0 2 1 0 4 ...
## $ VAL2 : num 4 4 5 8 8 4 8 4 8 8 ...
## $ CHR2 : num 4 4 4 2 2 4 2 4 2 2 ...
## $ LIME2 : num 4 4 4 5 5 4 5 4 5 5 ...
## $ DEPTHCM : num 61 91 46 20 20 91 30 61 38 25 ...
## $ DEP2LIME : num 20 20 20 20 20 20 20 20 40 20 ...
## $ PCLAY1 : num 15 25 20 20 18 25 25 35 35 12 ...
## $ PCLAY2 : num 10 10 20 10 10 20 10 20 10 10 ...
## $ MG1 : num 63 58 55 60 88 168 99 59 233 87 ...
## $ OM1 : num 5.7 5.6 5.8 6.2 8.4 6.4 7.1 3.8 5 9.2 ...
## $ CEC1 : num 20 22 17 23 27 27 21 14 27 20 ...
## $ PH1 : num 7.7 7.7 7.5 7.6 7.6 7 7.5 7.6 6.6 7.5 ...
## $ PHOS1 : num 13 9.2 10.5 8.8 13 9.3 10 9 15 12.6 ...
## $ POT1 : num 196 157 115 172 238 164 312 184 123 282 ...
Exercise 13: load the women
sample dataset. How many observations (cases) and how many attributes (fields) for each case? What are the column (field) and row names? What is the height of the first-listed woman?
data(women)
dim(women)
## [1] 15 2
colnames(women)
## [1] "height" "weight"
row.names(women)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
women[1,"height"]
## [1] 58
There are 15 observations (cases) and 2 attributes (fields) for each case. The column (field) names are height, weight and the row (cases) names are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15. The first woman is 58 inches tall.
Exercise 14: List the factors in the oxford
dataset.
names(which(sapply(oxford, is.factor)))
## [1] "PROFCLASS" "MAPCLASS"
Exercise 15: Identify the thin trees, defined as those with height/girth ratio more than 1 s.d. above the mean. You will have to define a new field in the dataframe with this ratio, and then use the mean
and sd
summary functions, along with a logical expression.
trees$hg <- trees$Height/trees$Girth
# thin trees have a height/girth ratio more than 1 s.d. above the mean
(thin.trees <- subset(trees, hg > (mean(trees$hg) + sd(trees$hg))))
## Girth Height Volume hg
## 1 8.3 70 10.3 8.433735
## 2 8.6 65 10.3 7.558140
## 5 10.7 81 18.8 7.570093
## 6 10.8 83 19.7 7.685185
## 9 11.1 80 22.6 7.207207
Exercise 16: Display a histogram of the diamond prices in the diamonds
dataset.
data(diamonds, package="ggplot2")
hist(diamonds$price)
Exercise 17: Write a model to predict tree height from tree girth. How much of the height can be predicted from the girth?
model.hg <- lm(Height ~ Girth, data=trees)
# equivalent to: model <- lm(trees$Volume ~ trees$Height)
summary(model.hg)
##
## Call:
## lm(formula = Height ~ Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5816 -2.7686 0.3163 2.4728 9.9456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.0313 4.3833 14.152 1.49e-14 ***
## Girth 1.0544 0.3222 3.272 0.00276 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.538 on 29 degrees of freedom
## Multiple R-squared: 0.2697, Adjusted R-squared: 0.2445
## F-statistic: 10.71 on 1 and 29 DF, p-value: 0.002758
Only 24.4% of the variance in tree heights can be explained by its girth.
Exercise 18: Write a model to predict tree volume as a linear function of tree height and tree girth, with no interaction.
model.vhg <- lm(Volume ~ Height + Girth, data=trees)
summary(model.vhg)
##
## Call:
## lm(formula = Volume ~ Height + Girth, data = trees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## Girth 4.7082 0.2643 17.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
Exercise 19: Write a function to restrict the values of a vector to the range \(0 \ldots 1\). Any values \(< 0\) should be replaced with \(0\), and any values \(>1\) should be replaced with \(1\). Test the function on a vector with elements from \(-1.2\) to \(+1.2\) in increments of \(0.1\) – see the seq
“sequence” function.
limit.01 <- function(v) {
changed <- 0
ix <- which(v < 0); v[ix] <- 0
changed <- changed+length(ix)
ix <- which(v > 1); v[ix] <- 1
changed <- changed+length(ix)
print(paste("Number of elements limited to 0..1:", changed))
return(v)
}
Test of this function:
(test.v <- seq(-0.2, 1.2, by=0.1))
## [1] -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
limit.01(test.v)
## [1] "Number of elements limited to 0..1: 5"
## [1] 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.0 1.0
Bonus Exercise : Use tidyverse
functions and pipes on the trees
dataset, to select the trees (use the filter
function) with a volume greater than the median volume (use the median
function), compute the ratio of girth to height as a new variable (use the mutate
function), and sort by this (use the arrange
function) from thin to thick trees.
library(dplyr)
data(trees)
names(trees)
## [1] "Girth" "Height" "Volume"
trees %>%
filter(Volume > median(Volume)) %>%
mutate(thickness=round(Girth/Height,3)) %>%
arrange(thickness)
## Girth Height Volume thickness
## 1 12.9 85 33.8 0.152
## 2 13.3 86 27.4 0.155
## 3 14.2 80 31.7 0.178
## 4 14.0 78 34.5 0.179
## 5 13.7 71 25.7 0.193
## 6 14.5 74 36.3 0.196
## 7 16.3 77 42.6 0.212
## 8 17.5 82 55.7 0.213
## 9 17.3 81 55.4 0.214
## 10 13.8 64 24.9 0.216
## 11 16.0 72 38.3 0.222
## 12 17.9 80 58.3 0.224
## 13 18.0 80 51.5 0.225
## 14 18.0 80 51.0 0.225
## 15 20.6 87 77.0 0.237