- RStudio user interface
- R Objects
- R Functions
- R Scripts
- R Packages
- R Lists
- R Notation
- R Missing Data
- dplyr
2018-01-22
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
x <- 1 x
## [1] 1
For example, you can round a number with the round function round()
, or calculate its absolute value with abs()
.
Write the name of the function and then the data you want the function to operate on in parentheses:
round(-2.718282, 2)
## [1] -2.72
abs(-5)
## [1] 5
abs(round(-2.718282, 2))
## [1] 2.72
function()
and follow it with a pair of braces, {}
: my_function <- function() {}
die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice)
## [1] 3
roll <- function() { die <- 1:6 dice <- sample(die, size = 2, replace = TRUE) sum(dice) }
Call the function roll()
roll() # call the function. NB: result will differ with every call
## [1] 10
Instead of rolling one die consider rolling four or ten dice then adding the results of all the rolls together.
roll2 <- function(numrolls) { # x is the argument of the function roll2 die <- 1:6 dice <- sample(die, size = numrolls, replace = TRUE) # the size of the sample sum(dice) # add up the roll results }
numrolls
is called an argument of the function roll2()
.
Let's simulate rolling ten dice and adding the results together.
roll2(10)
## [1] 31
If we want to edit the function roll2()
then we will want to save it in a script.
To do this in RStudio File > New File > R script in the menu bar.
ggplot2
and dplyr
.To install the package tidyverse
in RStudio go to the Packages tab in RStudio and click Install.
To load a package type
library(tidyverse)
You can make an atomic vector by grouping some values of data together with c:
die<-c(1,2,3,4,5,6) die
## [1] 1 2 3 4 5 6
is.vector(die)
## [1] TRUE
length(die)
## [1] 6
You can also make an atomic vector with just one value. R saves single values as an atomic vector of length 1:
two <- 2 two
## [1] 2
Each atomic vector can only store one type of data. You can save different types of data in R by using different types of atomic vectors.
R recognizes six basic types of atomic vectors: doubles, integers, characters, logicals, complex, and raw.
We will not be using complex or raw types in STA130.
Integer vectors included a capital L with input, and character vectors have input surounded by quotation marks.
mynums <- c(2L,3L) courses <- "STA130" courses <- c("STA130", "MAT137") sum(mynums)
## [1] 5
sum(courses)
## Error in sum(courses): invalid 'type' (character) of argument
sum(courses == "STA130")
## [1] 1
die <- c(1,2,3,4,5,6) typeof(die)
## [1] "double"
3 > 4
## [1] FALSE
logic <- c(TRUE, FALSE, TRUE) logic
## [1] TRUE FALSE TRUE
dim()
You can transform an atomic vector into an n-dimensional array by giving it a dimen‐ sions attribute with dim.
die <- c(1,2,3,4,5,6) dim(die) <- c(2,3) # a 2x3 matrix die
## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
die <- c(1,2,3,4,5,6) dim(die) <- c(3,2) # a 3x2 matrix die
## [,1] [,2] ## [1,] 1 4 ## [2,] 2 5 ## [3,] 3 6
R always fills up each matrix by columns, instead of by rows unless you use matrix()
or array()
.
sex <- factor(c("male", "female", "female", "male")) typeof(sex)
## [1] "integer"
unclass(sex) # shows how R is storing the factor vector
## [1] 2 1 1 2 ## attr(,"levels") ## [1] "female" "male"
R always follows the same rules when it coerces data types. Once you are familiar with these rules, you can use R’s coercion behavior to do surprisingly useful things.
For example sum(c(TRUE, TRUE, FALSE, FALSE))
will become sum(c(1, 1, 0, 0))
.
sum(c(TRUE, TRUE, FALSE, FALSE))
## [1] 2
list1 <- list(1:31, "Prof. Taback", list(TRUE, FALSE)) list1
## [[1]] ## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ## [24] 24 25 26 27 28 29 30 31 ## ## [[2]] ## [1] "Prof. Taback" ## ## [[3]] ## [[3]][[1]] ## [1] TRUE ## ## [[3]][[2]] ## [1] FALSE
student_num <- c(1, 2, 3, 4) name <- c("Nadia", "Shiyi", "Yizhe", "Wei") mydat <- data.frame(obsnum = student_num, student_name = name) mydat
## obsnum student_name ## 1 1 Nadia ## 2 2 Shiyi ## 3 3 Yizhe ## 4 4 Wei
data.frame()
function.data.frame()
any number of vectors, each separated with a comma.data.frame()
will turn each vector into a column of the new data frame.You can view a data frame in RStudio by clicking on the data frame name in the Environment tab
mydat[ , ]
mydat
## obsnum student_name ## 1 1 Nadia ## 2 2 Shiyi ## 3 3 Yizhe ## 4 4 Wei
mydat[1,2] # the value in row 1 and column 2
## [1] Nadia ## Levels: Nadia Shiyi Wei Yizhe
mydat[c(1,2),2] # all values in rows 1 and 2 in second column
## [1] Nadia Shiyi ## Levels: Nadia Shiyi Wei Yizhe
The $
tells R to return all of the values in a column as a vector.
mydat$student_name
## [1] Nadia Shiyi Yizhe Wei ## Levels: Nadia Shiyi Wei Yizhe
vec <- mydat$student_name # assign it to vec attributes(vec) # info associated with object vec
## $levels ## [1] "Nadia" "Shiyi" "Wei" "Yizhe" ## ## $class ## [1] "factor"
vec[2] # get second element of vector
## [1] Shiyi ## Levels: Nadia Shiyi Wei Yizhe
mydat[mydat$obsnum == 1,] # first row of data frame and all columns
## obsnum student_name ## 1 1 Nadia
mydat[mydat$obsnum == 1 | mydat$obsnum == 4 ,] # first and fourth rows of data frame and all columns
## obsnum student_name ## 1 1 Nadia ## 4 4 Wei
NA
NA
character is a special symbol in R. It stands for “not available” and can be used as a placeholder for missing information.1 + NA
## [1] NA
na.rm()
age <- c(19, 20, 17, 20, NA) mean(age) # mean will be NA
## [1] NA
age <- c(19, 20, 17, 20, NA) mean(age, na.rm = TRUE) # R will ignore missing values
## [1] 19
is.na()
age <- c(19, 20, 17, 20, NA) is.na(age) # check which elements of age are missing
## [1] FALSE FALSE FALSE FALSE TRUE
age[1] <- NA # set the first element of age to NA age
## [1] NA 20 17 20 NA
dplyr
The provincial rates for the week ending January 6, 2018 are in the file fludat_prov.csv and the the size of the population in each province is in the file popdat.csv. The code below reads the files into R data frames.
library(tidyverse) fludat_prov <- read_csv("fludat_prov.csv") # import data from file popdat <- read_csv("popdat.csv") # import data from file
dplyr
head(fludat_prov) # head shows the first six rows of a data frame
## # A tibble: 6 x 3 ## prov testpop_size fluA ## <chr> <int> <int> ## 1 Newfoundland 96 12 ## 2 Prince Edward Island 64 11 ## 3 Nova Scotia 144 23 ## 4 New Brunswick 347 80 ## 5 Province of Québec 6361 1190 ## 6 Province of Ontario 2320 344
head(popdat)
## # A tibble: 6 x 3 ## prov prov_pop_size region ## <chr> <int> <chr> ## 1 Nunavut 35944 Territories ## 2 Alberta 4067175 <NA> ## 3 Saskatchewan 1098352 West ## 4 Yukon 35874 Territories ## 5 Manitoba 1278365 West ## 6 British Columbia 4648055 West
dplyr
How many Provinces/Territories are in the fludat_prov data frame?
fludat_prov %>% summarise(numprov = n()) # n() counts the number of rows in the data frame
## # A tibble: 1 x 1 ## numprov ## <int> ## 1 13
dplyr
Do any variables in fludat or popdat have missing values?
fludat_prov %>% filter(is.na(prov) == TRUE | is.na(testpop_size) == TRUE | is.na(fluA) == TRUE)
## # A tibble: 0 x 3 ## # ... with 3 variables: prov <chr>, testpop_size <int>, fluA <int>
popdat %>% filter(is.na(prov) == TRUE | is.na(prov_pop_size) == TRUE | is.na(region) == TRUE)
## # A tibble: 2 x 3 ## prov prov_pop_size region ## <chr> <int> <chr> ## 1 Alberta 4067175 <NA> ## 2 Quebec 8164361 <NA>
dplyr
Recode specific values using R data frame notation [,] and $.
popdat$region[popdat$prov == "Alberta"] <- "West" #recode only the region value for Alberta popdat$region[popdat$prov == "Quebec"] <- "East" #recode only the region value for Alberta popdat$region #print region variable in popdat data
## [1] "Territories" "West" "West" "Territories" "West" ## [6] "West" "East" "East" "Atlantic" "Atlantic" ## [11] "Territories" "Atlantic" "Atlantic"
dplyr
- Joining Two Tables with inner_join()
We can join two data frames with inner_join(x,y)
: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
fludat_prov %>% inner_join(popdat, by = "prov")
## # A tibble: 9 x 5 ## prov testpop_size fluA prov_pop_size region ## <chr> <int> <int> <int> <chr> ## 1 Newfoundland 96 12 519716 Atlantic ## 2 Prince Edward Island 64 11 142907 Atlantic ## 3 Nova Scotia 144 23 923598 Atlantic ## 4 New Brunswick 347 80 747101 Atlantic ## 5 Manitoba 849 186 1278365 West ## 6 British Columbia 1078 198 4648055 West ## 7 Yukon 15 1 35874 Territories ## 8 Northwest Territories 28 10 41786 Territories ## 9 Nunavut 18 1 35944 Territories
Why are there only 9 observations when there are 13 Provinces/Territories?
dplyr
- Joining Two Tables with inner_join()
fludat_prov$prov
## [1] "Newfoundland" "Prince Edward Island" ## [3] "Nova Scotia" "New Brunswick" ## [5] "Province of Québec" "Province of Ontario" ## [7] "Manitoba" "Province of Saskatchewan" ## [9] "Province of Alberta" "British Columbia" ## [11] "Yukon" "Northwest Territories" ## [13] "Nunavut"
popdat$prov
## [1] "Nunavut" "Alberta" ## [3] "Saskatchewan" "Yukon" ## [5] "Manitoba" "British Columbia" ## [7] "Ontario" "Quebec" ## [9] "Prince Edward Island" "Newfoundland" ## [11] "Northwest Territories" "Nova Scotia" ## [13] "New Brunswick"
Province needs to be recoded. Exercise on this week's practice problems.
dplyr
- Joining Two Tables with inner_join()