- Histograms and density functions
- Statistical data
- Tidy data
- Data wrangling
- Transforming data
2018-01-15
dat <- data_frame(x = c(1,2,2.5,3,7)) dat$x
[1] 1.0 2.0 2.5 3.0 7.0
Let \(x_0=0.5, h=0.25, m=1,\dots,29\)
seq(0.5,7.5,by = 0.25)
[1] 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75 [15] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00 7.25 [29] 7.50
The bins are: \([0.50,0.75),[0.75,1.00),[1.00,1.25),\ldots,[7.25,7.50).\)
\[\hat f(x) = \frac{1}{hn} \#\{X_i \text{ in same bin as } x\}\] is called the histogram estimator.
\(\hat f(x)\) is an estimate of the density at a point \(x\).
To construct the histogram we have to choose an origin \(x_0\) and bin width \(h\).
Same bin width but different origin
Collecting this data will generate three variables: height
, years
, and sex
.
height <- c() years <- c() sex <- c()
Put the variables into an R data frame.
NB: data_frame
is the tidyverse
version of base R data.frame
.
sta130_dat <- data_frame(height, years, sex)
We could have entred this in a spreadsheet program like MS Excel, saved it as a CSV file, then imported the file into R.
There are three interrelated rules which make a dataset tidy:
Which data set is tidy?
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
## # A tibble: 6 x 3 ## country year rate ## * <chr> <int> <chr> ## 1 Afghanistan 1999 745/19987071 ## 2 Afghanistan 2000 2666/20595360 ## 3 Brazil 1999 37737/172006362 ## 4 Brazil 2000 80488/174504898 ## 5 China 1999 212258/1272915272 ## 6 China 2000 213766/1280428583
"For a given dataset, it is usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general." (Wickham, 2014)
A general rule of thumb:
It is easier to describe functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) than between rows.
It is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns.
(Wickham, 2014)
ggplot
library implements a grammer of graphics.dplyr
library presents a grammer for data wrangling."…A college degree is no guarantee of economic success. But through their choice of major, they can take at least some steps toward boosting their odds."
fivethirtyeight
library in R.library(fivethirtyeight) # load the library glimpse(college_recent_grads)
## Observations: 173 ## Variables: 21 ## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,... ## $ major_code <int> 2419, 2416, 2415, 2417, 2405, 2418... ## $ major <chr> "Petroleum Engineering", "Mining A... ## $ major_category <chr> "Engineering", "Engineering", "Eng... ## $ total <int> 2339, 756, 856, 1258, 32260, 2573,... ## $ sample_size <int> 36, 7, 3, 16, 289, 17, 51, 10, 102... ## $ men <int> 2057, 679, 725, 1123, 21239, 2200,... ## $ women <int> 282, 77, 131, 135, 11021, 373, 960... ## $ sharewomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0... ## $ employed <int> 1976, 640, 648, 758, 25694, 1857, ... ## $ employed_fulltime <int> 1849, 556, 558, 1069, 23170, 2038,... ## $ employed_parttime <int> 270, 170, 133, 150, 5180, 264, 296... ## $ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, ... ## $ unemployed <int> 37, 85, 16, 40, 1672, 400, 308, 33... ## $ unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096... ## $ p25th <dbl> 95000, 55000, 50000, 43000, 50000,... ## $ median <dbl> 110000, 75000, 73000, 70000, 65000... ## $ p75th <dbl> 125000, 90000, 105000, 80000, 7500... ## $ college_jobs <int> 1534, 350, 456, 529, 18314, 1142, ... ## $ non_college_jobs <int> 364, 257, 176, 102, 4440, 657, 314... ## $ low_wage_jobs <int> 193, 50, 0, 0, 972, 244, 259, 220,...
select()
To retrieve a data frame with only major, number of male and female graduates we use the select()
function in the dplyr
library.
select(college_recent_grads,major, men,women)
## # A tibble: 173 x 3 ## major men women ## <chr> <int> <int> ## 1 Petroleum Engineering 2057 282 ## 2 Mining And Mineral Engineering 679 77 ## 3 Metallurgical Engineering 725 131 ## 4 Naval Architecture And Marine Engineering 1123 135 ## 5 Chemical Engineering 21239 11021 ## 6 Nuclear Engineering 2200 373 ## 7 Actuarial Science 832 960 ## 8 Astronomy And Astrophysics 2110 1667 ## 9 Mechanical Engineering 12953 2105 ## 10 Electrical Engineering 8407 6548 ## # ... with 163 more rows
filter()
If we want to retrieve only those observations (rows) that pertain to engineering majors then we need to specify that the value of the major
variable is Electrical Engineering.
EE <- filter(college_recent_grads, major == "Electrical Engineering") glimpse(EE)
## Observations: 1 ## Variables: 21 ## $ rank <int> 10 ## $ major_code <int> 2408 ## $ major <chr> "Electrical Engineering" ## $ major_category <chr> "Engineering" ## $ total <int> 81527 ## $ sample_size <int> 631 ## $ men <int> 8407 ## $ women <int> 6548 ## $ sharewomen <dbl> 0.4378469 ## $ employed <int> 61928 ## $ employed_fulltime <int> 55450 ## $ employed_parttime <int> 12695 ## $ employed_fulltime_yearround <int> 41413 ## $ unemployed <int> 3895 ## $ unemployment_rate <dbl> 0.05917385 ## $ p25th <dbl> 45000 ## $ median <dbl> 60000 ## $ p75th <dbl> 72000 ## $ college_jobs <int> 45829 ## $ non_college_jobs <int> 10874 ## $ low_wage_jobs <int> 3170
NB: ==
is a test for equality and is different than =
.
select()
and filter()
We can drill down to get certain pieces of information using filter()
and select()
together.
The median
variable is median salary.
select(filter(college_recent_grads, median >= 60000), major, men, women)
%>%
In the code:
select(filter(college_recent_grads, median >= 60000), major,men,women)
filter is nested inside select.
The pipe operator allows is an alternative to nesting and yields easier to read code. The same expression can be written with the pipe operator
college_recent_grads %>% filter(median >= 60000) %>% select(major, men, women)
mutate()
What percentage of graduates from each major where the median earnings is at least $60,000 are men ?
college_recent_grads %>% filter(median >= 60000) %>% select(major, men, women) %>% mutate(total = men + women, pct_male = round((men / total)*100, 2))
Compare to nested code:
mutate(select(filter(college_recent_grads,median >= 60000), major, men, women), total = men + women, pct_male = round((men / total)*100, 2))
mutate()
major | men | women | total | pct_male |
---|---|---|---|---|
Petroleum Engineering | 2057 | 282 | 2339 | 87.94 |
Mining And Mineral Engineering | 679 | 77 | 756 | 89.81 |
Metallurgical Engineering | 725 | 131 | 856 | 84.70 |
Naval Architecture And Marine Engineering | 1123 | 135 | 1258 | 89.27 |
Chemical Engineering | 21239 | 11021 | 32260 | 65.84 |
Nuclear Engineering | 2200 | 373 | 2573 | 85.50 |
Actuarial Science | 832 | 960 | 1792 | 46.43 |
Astronomy And Astrophysics | 2110 | 1667 | 3777 | 55.86 |
Mechanical Engineering | 12953 | 2105 | 15058 | 86.02 |
Electrical Engineering | 8407 | 6548 | 14955 | 56.22 |
Computer Engineering | 33258 | 8284 | 41542 | 80.06 |
Aerospace Engineering | 65511 | 16016 | 81527 | 80.35 |
Biomedical Engineering | 80320 | 10907 | 91227 | 88.04 |
Materials Science | 2949 | 1330 | 4279 | 68.92 |
mutate()
ifelse()
in a mutate()
statement.college_recent_grads %>% select(major, men, women) %>% mutate(total = men + women, pct_female = round((women / total)*100, 2), male.bias = ifelse(pct_female >= 45 & pct_female <= 55, "No","Yes")) %>% select(major,male.bias)
## # A tibble: 173 x 2 ## major male.bias ## <chr> <chr> ## 1 Petroleum Engineering Yes ## 2 Mining And Mineral Engineering Yes ## 3 Metallurgical Engineering Yes ## 4 Naval Architecture And Marine Engineering Yes ## 5 Chemical Engineering Yes ## 6 Nuclear Engineering Yes ## 7 Actuarial Science No ## 8 Astronomy And Astrophysics Yes ## 9 Mechanical Engineering Yes ## 10 Electrical Engineering Yes ## # ... with 163 more rows
rename()
rename()
to change the name of sex.equal
to sex_equal
.my_college_dat <- college_recent_grads %>% select(major, men, women, median) %>% mutate(total = men + women, pct_female = round((women / total)*100, 2), sex.equal = ifelse(pct_female >= 45 & pct_female <= 55, "No","Yes")) %>% select(major,sex.equal, median) my_college_dat <- my_college_dat %>% rename(sex_equal = sex.equal, salary_median = median) glimpse(my_college_dat)
## Observations: 173 ## Variables: 3 ## $ major <chr> "Petroleum Engineering", "Mining And Mineral Eng... ## $ sex_equal <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", ... ## $ salary_median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000...
arrange()
my_college_dat %>% arrange(salary_median) %>% select(major, salary_median) %>% arrange(desc(salary_median))
## # A tibble: 173 x 2 ## major salary_median ## <chr> <dbl> ## 1 Petroleum Engineering 110000 ## 2 Mining And Mineral Engineering 75000 ## 3 Metallurgical Engineering 73000 ## 4 Naval Architecture And Marine Engineering 70000 ## 5 Chemical Engineering 65000 ## 6 Nuclear Engineering 65000 ## 7 Actuarial Science 62000 ## 8 Astronomy And Astrophysics 62000 ## 9 Mechanical Engineering 60000 ## 10 Electrical Engineering 60000 ## # ... with 163 more rows
summarize()
The average number of female grads and the total number of majors in the data set.
college_recent_grads %>% select(major, men, women) %>% summarise(femgrad_mean = mean(women), N = n())
## # A tibble: 1 x 2 ## femgrad_mean N ## <dbl> <int> ## 1 22530.36 173
summarize()
and group_by()
The median salary in majors with 45%-55% female students.
my_college_dat %>% group_by(sex_equal) %>% summarise(median(salary_median))
## # A tibble: 2 x 2 ## sex_equal `median(salary_median)` ## <chr> <dbl> ## 1 No 37400 ## 2 Yes 36000
A data frame with Trump's Tweets.
trumptweets <- read_csv("trumptweets.csv") #import from csv file glimpse(trumptweets)
## Observations: 53,333 ## Variables: 4 ## $ source <chr> "Android", "Android", "Android", "Android", "Androi... ## $ created_at <dttm> 2013-02-06 01:53:40, 2013-02-06 01:53:40, 2013-02-... ## $ id_str <dbl> 2.989727e+17, 2.989727e+17, 2.989727e+17, 2.989727e... ## $ word <chr> "@sherrieshepherd", "nice", "comments", "view", "te...
trumptweets %>% count(word) %>% mutate(word = reorder(word,n)) %>% top_n(20) %>% ggplot(aes(word, n)) + geom_col() + coord_flip() + labs(x = "Word",y = "Number of times word ocurres in a Tweet")
tidytext
library contains several lexicons.library(tidytext) sentiments
## # A tibble: 27,314 x 4 ## word sentiment lexicon score ## <chr> <chr> <chr> <int> ## 1 abacus trust nrc NA ## 2 abandon fear nrc NA ## 3 abandon negative nrc NA ## 4 abandon sadness nrc NA ## 5 abandoned anger nrc NA ## 6 abandoned fear nrc NA ## 7 abandoned negative nrc NA ## 8 abandoned sadness nrc NA ## 9 abandonment anger nrc NA ## 10 abandonment fear nrc NA ## # ... with 27,304 more rows
getsentiments()
function provides a way to get specific sentiment lexicons without the columns that are not used in that lexicon.get_sentiments("nrc")
## # A tibble: 13,901 x 2 ## word sentiment ## <chr> <chr> ## 1 abacus trust ## 2 abandon fear ## 3 abandon negative ## 4 abandon sadness ## 5 abandoned anger ## 6 abandoned fear ## 7 abandoned negative ## 8 abandoned sadness ## 9 abandonment anger ## 10 abandonment fear ## # ... with 13,891 more rows
To examine the sentiment of the words Trump used in tweets we need to join the data frame containing the NRC lexicon and the data frame of Trump's words used in tweets.
inner_join(x,y)
: return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
trumptweets %>% inner_join(get_sentiments("nrc"))
## # A tibble: 33,043 x 5 ## source created_at id_str word sentiment ## <chr> <dttm> <dbl> <chr> <chr> ## 1 Android 2013-02-06 01:53:40 2.989727e+17 terrific sadness ## 2 Android 2013-02-18 23:36:36 3.036492e+17 sky positive ## 3 Android 2013-02-18 23:36:36 3.036492e+17 rocket anger ## 4 Android 2013-02-18 23:36:36 3.036492e+17 payback anger ## 5 Android 2013-02-18 23:36:36 3.036492e+17 payback negative ## 6 Android 2013-02-19 00:25:48 3.036616e+17 surprised surprise ## 7 Android 2013-02-19 12:36:19 3.038455e+17 buss joy ## 8 Android 2013-02-19 12:36:19 3.038455e+17 buss positive ## 9 Android 2013-02-19 12:36:19 3.038455e+17 friend joy ## 10 Android 2013-02-19 12:36:19 3.038455e+17 friend positive ## # ... with 33,033 more rows
trumptweets %>% inner_join(get_sentiments("nrc")) %>% group_by(sentiment,source) %>% summarise(n=n()) %>% mutate(pct= round(n/sum(n)*100,2)) %>% arrange(desc(pct))
## # A tibble: 20 x 4 ## # Groups: sentiment [10] ## sentiment source n pct ## <chr> <chr> <int> <dbl> ## 1 disgust Android 1537 80.68 ## 2 negative Android 4040 78.68 ## 3 sadness Android 2117 78.32 ## 4 anger Android 2228 78.31 ## 5 fear Android 2057 77.80 ## 6 surprise Android 1297 72.70 ## 7 joy Android 1777 71.65 ## 8 anticipation Android 2240 71.25 ## 9 positive Android 4328 69.72 ## 10 trust Android 2924 69.70 ## 11 trust iPhone 1271 30.30 ## 12 positive iPhone 1880 30.28 ## 13 anticipation iPhone 904 28.75 ## 14 joy iPhone 703 28.35 ## 15 surprise iPhone 487 27.30 ## 16 fear iPhone 587 22.20 ## 17 anger iPhone 617 21.69 ## 18 sadness iPhone 586 21.68 ## 19 negative iPhone 1095 21.32 ## 20 disgust iPhone 368 19.32
In the dplyr
library there are several other ways to join tables: left_join()
, right_join()
, full_join()
, semi_join()
, anti_join()
.
See dplyr
documentation.
\[x_i \mapsto f(x_i).\] - Common transformations include: \(f(x)=\ln(x)\), and \(f(x)=x^p, \mbox{ }p \in \mathbb{R}.\) For example, if \(p=1/2\) then \(f\) is the square-root transformation.
The relationship between Salary (median
) and percentage of male graduates.
college_recent_grads %>% ggplot(aes(x = men / total, y = median)) + geom_point()
The same plot but on the log-log scale.
college_recent_grads %>% mutate(log_men = log(men / total), log_salary = log(median)) %>% ggplot(aes(x = log_men, y = log_salary)) + geom_point()