STA130 - Class #2:

2018-01-15

Today's Class

Histograms and density functions
Statistical data
Tidy data
Data wrangling
Transforming data

Histograms and Density Functions

The histogram of a variable is a graphical method to vizualize the distribution of a single variable.
To construct a basic histogram:

Divide the data into intervals (called bins).
Count the number of observations that are contained in the bin.
Plot rectangles with height equal to the count from (2) and width equal to the width of the bin.

Histograms and Density Functions

Different bin width will yield different histograms

Mathematical Definition of Histogram

The bins of the histogram are the intervals: \[[x_0+mh,x_0+(m+1)h).\] $x_0$ is the origin, $m = \ldots,-1,0,1,\ldots$ indexes the bins, and $h = (x_0+(m+1)h) - (x_0+mh)$ is the bin width.

Example - Mathematical Definition of Histogram

dat <- data_frame(x = c(1,2,2.5,3,7))
dat$x

[1] 1.0 2.0 2.5 3.0 7.0

Let $x_0=0.5, h=0.25, m=1,\dots,29$

seq(0.5,7.5,by = 0.25)

 [1] 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75
[15] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00 7.25
[29] 7.50

The bins are: $[0.50,0.75),[0.75,1.00),[1.00,1.25),\ldots,[7.25,7.50).$

Example - Mathematical Definition of Histogram

The bins can be used to construct rectangles with width $h=0.25$ and height $y$.
$y$ will be called density.
The area of these rectangles is $hy$.
We would like the area of these rectangles, $hy$, to be the same as the proportion of data in the bin. This will make the sum of all areas equal 1.
Let $n$ be the number of observations. Then, \[hy = \frac{\#\{X_i \text{ in bin} \}}{n} \]
In this example, $n=5$, and $X_1=1,X_2=2,X_3=2.5,X_4=3,X_5=7.$

Example - Mathematical Definition of Histogram

Mathematical Definition of Histogram

\[\hat f(x) = \frac{1}{hn} \#\{X_i \text{ in same bin as } x\}\] is called the histogram estimator.

$\hat f(x)$ is an estimate of the density at a point $x$.

To construct the histogram we have to choose an origin $x_0$ and bin width $h$.

Choosing Origin and Bin Width in R

Same bin width but different origin

Statistical data

What is statistical data?

Statistical data is obtained by observing (random) variables.
A random variable can be given a precise mathematical definition that we will cover later in the course.
In this class we will discuss examples.

Observing a few variables on STA130 students

What is your height?
How many years have been at UofT?
What is your sex (male or female)?

Collecting this data will generate three variables: height, years, and sex.

Enter variables on STA130 students

height <- c()
years <- c()
sex <- c()

Put the variables into an R data frame.

NB: data_frame is the tidyverse version of base R data.frame.

sta130_dat <- data_frame(height, years, sex)

We could have entred this in a spreadsheet program like MS Excel, saved it as a CSV file, then imported the file into R.

Tidy data

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Tidy data

Which data set is tidy?

## # A tibble: 6 x 4
##       country  year  cases population
##         <chr> <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3      Brazil  1999  37737  172006362
## 4      Brazil  2000  80488  174504898
## 5       China  1999 212258 1272915272
## 6       China  2000 213766 1280428583

## # A tibble: 6 x 3
##       country  year              rate
## *       <chr> <int>             <chr>
## 1 Afghanistan  1999      745/19987071
## 2 Afghanistan  2000     2666/20595360
## 3      Brazil  1999   37737/172006362
## 4      Brazil  2000   80488/174504898
## 5       China  1999 212258/1272915272
## 6       China  2000 213766/1280428583

Tidy data

"For a given dataset, it is usually easy to figure out what are observations and what are variables, but it is surprisingly difficult to precisely define variables and observations in general." (Wickham, 2014)

A general rule of thumb:

It is easier to describe functional relationships between variables (e.g., z is a linear combination of x and y, density is the ratio of weight to volume) than between rows.
It is easier to make comparisons between groups of observations (e.g., average of group a vs. average of group b) than between groups of columns.

(Wickham, 2014)

Data wrangling

The ggplot library implements a grammer of graphics.
Similarily the dplyr library presents a grammer for data wrangling.

The Economic Guide to Picking a Major

"…A college degree is no guarantee of economic success. But through their choice of major, they can take at least some steps toward boosting their odds."

The Economic Guide to Picking a Major

The data used in the article is from the American Community Survey 2010-2012 Public Use Microdata Series.
We can use the fivethirtyeight library in R.

Data behind the article

library(fivethirtyeight) # load the library
glimpse(college_recent_grads)

## Observations: 173
## Variables: 21
## $ rank                        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...
## $ major_code                  <int> 2419, 2416, 2415, 2417, 2405, 2418...
## $ major                       <chr> "Petroleum Engineering", "Mining A...
## $ major_category              <chr> "Engineering", "Engineering", "Eng...
## $ total                       <int> 2339, 756, 856, 1258, 32260, 2573,...
## $ sample_size                 <int> 36, 7, 3, 16, 289, 17, 51, 10, 102...
## $ men                         <int> 2057, 679, 725, 1123, 21239, 2200,...
## $ women                       <int> 282, 77, 131, 135, 11021, 373, 960...
## $ sharewomen                  <dbl> 0.1205643, 0.1018519, 0.1530374, 0...
## $ employed                    <int> 1976, 640, 648, 758, 25694, 1857, ...
## $ employed_fulltime           <int> 1849, 556, 558, 1069, 23170, 2038,...
## $ employed_parttime           <int> 270, 170, 133, 150, 5180, 264, 296...
## $ employed_fulltime_yearround <int> 1207, 388, 340, 692, 16697, 1449, ...
## $ unemployed                  <int> 37, 85, 16, 40, 1672, 400, 308, 33...
## $ unemployment_rate           <dbl> 0.018380527, 0.117241379, 0.024096...
## $ p25th                       <dbl> 95000, 55000, 50000, 43000, 50000,...
## $ median                      <dbl> 110000, 75000, 73000, 70000, 65000...
## $ p75th                       <dbl> 125000, 90000, 105000, 80000, 7500...
## $ college_jobs                <int> 1534, 350, 456, 529, 18314, 1142, ...
## $ non_college_jobs            <int> 364, 257, 176, 102, 4440, 657, 314...
## $ low_wage_jobs               <int> 193, 50, 0, 0, 972, 244, 259, 220,...

Select variables/columns using `select()`

To retrieve a data frame with only major, number of male and female graduates we use the select() function in the dplyr library.

select(college_recent_grads,major, men,women)

## # A tibble: 173 x 3
##                                        major   men women
##                                        <chr> <int> <int>
##  1                     Petroleum Engineering  2057   282
##  2            Mining And Mineral Engineering   679    77
##  3                 Metallurgical Engineering   725   131
##  4 Naval Architecture And Marine Engineering  1123   135
##  5                      Chemical Engineering 21239 11021
##  6                       Nuclear Engineering  2200   373
##  7                         Actuarial Science   832   960
##  8                Astronomy And Astrophysics  2110  1667
##  9                    Mechanical Engineering 12953  2105
## 10                    Electrical Engineering  8407  6548
## # ... with 163 more rows

Select observations/rows using `filter()`

If we want to retrieve only those observations (rows) that pertain to engineering majors then we need to specify that the value of the major variable is Electrical Engineering.

EE <- filter(college_recent_grads, major == "Electrical Engineering")
glimpse(EE)

## Observations: 1
## Variables: 21
## $ rank                        <int> 10
## $ major_code                  <int> 2408
## $ major                       <chr> "Electrical Engineering"
## $ major_category              <chr> "Engineering"
## $ total                       <int> 81527
## $ sample_size                 <int> 631
## $ men                         <int> 8407
## $ women                       <int> 6548
## $ sharewomen                  <dbl> 0.4378469
## $ employed                    <int> 61928
## $ employed_fulltime           <int> 55450
## $ employed_parttime           <int> 12695
## $ employed_fulltime_yearround <int> 41413
## $ unemployed                  <int> 3895
## $ unemployment_rate           <dbl> 0.05917385
## $ p25th                       <dbl> 45000
## $ median                      <dbl> 60000
## $ p75th                       <dbl> 72000
## $ college_jobs                <int> 45829
## $ non_college_jobs            <int> 10874
## $ low_wage_jobs               <int> 3170

NB: == is a test for equality and is different than =.

Combine `select()` and `filter()`

We can drill down to get certain pieces of information using filter() and select() together.
The median variable is median salary.

select(filter(college_recent_grads, median >= 60000), major, men, women)

The pipe operator `%>%`

In the code:

select(filter(college_recent_grads, median >= 60000), major,men,women)

filter is nested inside select.

The pipe operator allows is an alternative to nesting and yields easier to read code. The same expression can be written with the pipe operator

college_recent_grads %>% 
  filter(median >= 60000) %>% 
  select(major, men, women)

Create new variables from existing variables using `mutate()`

What percentage of graduates from each major where the median earnings is at least $60,000 are men ?

college_recent_grads %>% 
  filter(median >= 60000) %>% 
  select(major, men, women) %>% 
  mutate(total = men + women, pct_male = round((men / total)*100, 2))

Compare to nested code:

  mutate(select(filter(college_recent_grads,median >= 60000), 
                major, men, women), 
         total = men + women, pct_male = round((men / total)*100, 2))

Create new variables from existing variables using `mutate()`

major	men	women	total	pct_male
Petroleum Engineering	2057	282	2339	87.94
Mining And Mineral Engineering	679	77	756	89.81
Metallurgical Engineering	725	131	856	84.70
Naval Architecture And Marine Engineering	1123	135	1258	89.27
Chemical Engineering	21239	11021	32260	65.84
Nuclear Engineering	2200	373	2573	85.50
Actuarial Science	832	960	1792	46.43
Astronomy And Astrophysics	2110	1667	3777	55.86
Mechanical Engineering	12953	2105	15058	86.02
Electrical Engineering	8407	6548	14955	56.22
Computer Engineering	33258	8284	41542	80.06
Aerospace Engineering	65511	16016	81527	80.35
Biomedical Engineering	80320	10907	91227	88.04
Materials Science	2949	1330	4279	68.92

Create new variables from existing variables using `mutate()`

Suppose that we would like to create a categorical variable to identify majors with 45% and 55% women (ie., approximately equal numbers of males and females).
We can use ifelse() in a mutate() statement.

college_recent_grads %>% 
  select(major, men, women) %>% 
  mutate(total = men + women, pct_female = round((women / total)*100, 2), 
         male.bias = ifelse(pct_female >= 45 & pct_female <= 55, "No","Yes")) %>%
  select(major,male.bias)

## # A tibble: 173 x 2
##                                        major male.bias
##                                        <chr>     <chr>
##  1                     Petroleum Engineering       Yes
##  2            Mining And Mineral Engineering       Yes
##  3                 Metallurgical Engineering       Yes
##  4 Naval Architecture And Marine Engineering       Yes
##  5                      Chemical Engineering       Yes
##  6                       Nuclear Engineering       Yes
##  7                         Actuarial Science        No
##  8                Astronomy And Astrophysics       Yes
##  9                    Mechanical Engineering       Yes
## 10                    Electrical Engineering       Yes
## # ... with 163 more rows

Rename variables using `rename()`

It's considered bad practice in R to use periods in variable names.
We can use rename() to change the name of sex.equal to sex_equal.

my_college_dat <- college_recent_grads %>% 
  select(major, men, women, median) %>% 
  mutate(total = men + women, pct_female = round((women / total)*100, 2), 
         sex.equal = ifelse(pct_female >= 45 & pct_female <= 55, "No","Yes")) %>%
  select(major,sex.equal, median)

my_college_dat <- my_college_dat %>% 
  rename(sex_equal = sex.equal, salary_median = median)
glimpse(my_college_dat)

## Observations: 173
## Variables: 3
## $ major         <chr> "Petroleum Engineering", "Mining And Mineral Eng...
## $ sex_equal     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", ...
## $ salary_median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000...

Sort a data frame using `arrange()`

my_college_dat %>% 
  arrange(salary_median) %>% 
  select(major, salary_median) %>% 
  arrange(desc(salary_median))

## # A tibble: 173 x 2
##                                        major salary_median
##                                        <chr>         <dbl>
##  1                     Petroleum Engineering        110000
##  2            Mining And Mineral Engineering         75000
##  3                 Metallurgical Engineering         73000
##  4 Naval Architecture And Marine Engineering         70000
##  5                      Chemical Engineering         65000
##  6                       Nuclear Engineering         65000
##  7                         Actuarial Science         62000
##  8                Astronomy And Astrophysics         62000
##  9                    Mechanical Engineering         60000
## 10                    Electrical Engineering         60000
## # ... with 163 more rows

Summarize a data frame using `summarize()`

The average number of female grads and the total number of majors in the data set.

college_recent_grads %>% 
  select(major, men, women) %>% 
  summarise(femgrad_mean = mean(women), N = n())

## # A tibble: 1 x 2
##   femgrad_mean     N
##          <dbl> <int>
## 1     22530.36   173

Summarize groups in a data frame using `summarize()` and `group_by()`

The median salary in majors with 45%-55% female students.

my_college_dat %>% 
  group_by(sex_equal) %>%
  summarise(median(salary_median))

## # A tibble: 2 x 2
##   sex_equal `median(salary_median)`
##       <chr>                   <dbl>
## 1        No                   37400
## 2       Yes                   36000

Combining Multiple Tables

Sentiment of Trump's Tweets

Donald Trump likes to tweet a lot.
Some tweets have an angry sentiment or contain insults, and some are not.
Trump supposedly used to send tweets from a Samsung Galaxy when he is insulting people, places, and things, from other devices such as an iPhone when he is not.
Trump's last tweet from Android were March 25, 2017

Trump's Tweets

A data frame with Trump's Tweets.

trumptweets <- read_csv("trumptweets.csv") #import from csv file
glimpse(trumptweets)

## Observations: 53,333
## Variables: 4
## $ source     <chr> "Android", "Android", "Android", "Android", "Androi...
## $ created_at <dttm> 2013-02-06 01:53:40, 2013-02-06 01:53:40, 2013-02-...
## $ id_str     <dbl> 2.989727e+17, 2.989727e+17, 2.989727e+17, 2.989727e...
## $ word       <chr> "@sherrieshepherd", "nice", "comments", "view", "te...

Trump's tweets

trumptweets %>% 
  count(word) %>% 
  mutate(word = reorder(word,n)) %>% 
  top_n(20) %>% 
  ggplot(aes(word, n)) + geom_col() + coord_flip()  +
  labs(x = "Word",y = "Number of times word ocurres in a Tweet")

Sentiment Lexicon

Several lexicons (dictionaries) have been developed that categorize words according to sentiment (feeling or emotion).
The tidytext library contains several lexicons.

library(tidytext)
sentiments

## # A tibble: 27,314 x 4
##           word sentiment lexicon score
##          <chr>     <chr>   <chr> <int>
##  1      abacus     trust     nrc    NA
##  2     abandon      fear     nrc    NA
##  3     abandon  negative     nrc    NA
##  4     abandon   sadness     nrc    NA
##  5   abandoned     anger     nrc    NA
##  6   abandoned      fear     nrc    NA
##  7   abandoned  negative     nrc    NA
##  8   abandoned   sadness     nrc    NA
##  9 abandonment     anger     nrc    NA
## 10 abandonment      fear     nrc    NA
## # ... with 27,304 more rows

NRC Lexicon

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
The getsentiments() function provides a way to get specific sentiment lexicons without the columns that are not used in that lexicon.

NRC Lexicon

get_sentiments("nrc")

## # A tibble: 13,901 x 2
##           word sentiment
##          <chr>     <chr>
##  1      abacus     trust
##  2     abandon      fear
##  3     abandon  negative
##  4     abandon   sadness
##  5   abandoned     anger
##  6   abandoned      fear
##  7   abandoned  negative
##  8   abandoned   sadness
##  9 abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

Sentiment of Words used in Tweets

To examine the sentiment of the words Trump used in tweets we need to join the data frame containing the NRC lexicon and the data frame of Trump's words used in tweets.
inner_join(x,y): return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.

trumptweets %>% inner_join(get_sentiments("nrc"))

## # A tibble: 33,043 x 5
##     source          created_at       id_str      word sentiment
##      <chr>              <dttm>        <dbl>     <chr>     <chr>
##  1 Android 2013-02-06 01:53:40 2.989727e+17  terrific   sadness
##  2 Android 2013-02-18 23:36:36 3.036492e+17       sky  positive
##  3 Android 2013-02-18 23:36:36 3.036492e+17    rocket     anger
##  4 Android 2013-02-18 23:36:36 3.036492e+17   payback     anger
##  5 Android 2013-02-18 23:36:36 3.036492e+17   payback  negative
##  6 Android 2013-02-19 00:25:48 3.036616e+17 surprised  surprise
##  7 Android 2013-02-19 12:36:19 3.038455e+17      buss       joy
##  8 Android 2013-02-19 12:36:19 3.038455e+17      buss  positive
##  9 Android 2013-02-19 12:36:19 3.038455e+17    friend       joy
## 10 Android 2013-02-19 12:36:19 3.038455e+17    friend  positive
## # ... with 33,033 more rows

Sentiment of Words used in Tweets

trumptweets %>% 
  inner_join(get_sentiments("nrc")) %>% 
  group_by(sentiment,source) %>% 
  summarise(n=n()) %>% 
  mutate(pct= round(n/sum(n)*100,2)) %>%
  arrange(desc(pct))

## # A tibble: 20 x 4
## # Groups:   sentiment [10]
##       sentiment  source     n   pct
##           <chr>   <chr> <int> <dbl>
##  1      disgust Android  1537 80.68
##  2     negative Android  4040 78.68
##  3      sadness Android  2117 78.32
##  4        anger Android  2228 78.31
##  5         fear Android  2057 77.80
##  6     surprise Android  1297 72.70
##  7          joy Android  1777 71.65
##  8 anticipation Android  2240 71.25
##  9     positive Android  4328 69.72
## 10        trust Android  2924 69.70
## 11        trust  iPhone  1271 30.30
## 12     positive  iPhone  1880 30.28
## 13 anticipation  iPhone   904 28.75
## 14          joy  iPhone   703 28.35
## 15     surprise  iPhone   487 27.30
## 16         fear  iPhone   587 22.20
## 17        anger  iPhone   617 21.69
## 18      sadness  iPhone   586 21.68
## 19     negative  iPhone  1095 21.32
## 20      disgust  iPhone   368 19.32

Sentiment of Words used in Tweets

Join two tables together

In the dplyr library there are several other ways to join tables: left_join(), right_join(), full_join(), semi_join(), anti_join() .
See dplyr documentation.

Transforming data

Statistical Transformations

In statistical anaysis it's often necessary to transform data.
Transforming data takes each value of a variable $x_i$ and transforms it into $f(x_i)$:

\[x_i \mapsto f(x_i).\] - Common transformations include: $f(x)=\ln(x)$, and $f(x)=x^p, \mbox{ }p \in \mathbb{R}.$ For example, if $p=1/2$ then $f$ is the square-root transformation.

Logarithmic transformation

Logarithmic transformation refers to the natural logarithm: \[y=\log_e(x) \iff \exp(y)=e^y=x\]

Transforming Variables in R

The relationship between Salary (median) and percentage of male graduates.

college_recent_grads %>% 
  ggplot(aes(x = men / total, y = median)) + geom_point()

Transforming Variables in R

The same plot but on the log-log scale.

college_recent_grads %>% 
  mutate(log_men = log(men / total), log_salary = log(median)) %>%
  ggplot(aes(x = log_men, y = log_salary)) + geom_point()

Today's Class

Histograms and Density Functions

Histograms and Density Functions

Histograms and Density Functions

Mathematical Definition of Histogram

Example - Mathematical Definition of Histogram

Example - Mathematical Definition of Histogram

Example - Mathematical Definition of Histogram

Mathematical Definition of Histogram

Choosing Origin and Bin Width in R

Statistical data

What is statistical data?

Observing a few variables on STA130 students

Enter variables on STA130 students

Tidy data

Tidy data

Tidy data

Tidy data

Data wrangling

Data wrangling

The Economic Guide to Picking a Major

The Economic Guide to Picking a Major

Data behind the article

Select variables/columns using select()

Select observations/rows using filter()

Combine select() and filter()

The pipe operator %>%

Create new variables from existing variables using mutate()

Create new variables from existing variables using mutate()

Create new variables from existing variables using mutate()

Rename variables using rename()

Sort a data frame using arrange()

Summarize a data frame using summarize()

Summarize groups in a data frame using summarize() and group_by()

Combining Multiple Tables

Sentiment of Trump's Tweets

Trump's Tweets

Trump's tweets

Sentiment Lexicon

NRC Lexicon

NRC Lexicon

Sentiment of Words used in Tweets

Sentiment of Words used in Tweets

Sentiment of Words used in Tweets

Join two tables together

Transforming data

Statistical Transformations

Logarithmic transformation

Transforming Variables in R

Transforming Variables in R

Select variables/columns using `select()`

Select observations/rows using `filter()`

Combine `select()` and `filter()`

The pipe operator `%>%`

Create new variables from existing variables using `mutate()`

Create new variables from existing variables using `mutate()`

Create new variables from existing variables using `mutate()`

Rename variables using `rename()`

Sort a data frame using `arrange()`

Summarize a data frame using `summarize()`

Summarize groups in a data frame using `summarize()` and `group_by()`