library(tidytext)
library(dplyr)
library(MASS)
library(stringr)

Today’s class

Scraping Twitter
Sentiment Analysis
Supervised Machine Learning

By the end of today:

Sentiment Analysis

Are these tweets related to #ParisClimateDeal positive or negative?

tweet 1

tweet 2

Below is a basic sentiment analysis using tidytext.

library(dplyr)
library(tidytext)
library(sentimentr)

bing_lex <- get_sentiments("bing")

tweet1 <- data_frame(text = c("Pulling out of the #ParisClimateDeal is reckless and regressive. Instead of handholding, I'll work for a sustainable future for our planet."))

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(bing_lex)

fn_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n=n())

afinn_lex <- get_sentiments("afinn")

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(afinn_lex)

fn_sentiment %>% filter(!is.na(score)) %>% summarise(mean=mean(score))

sentiment(tweet1)
mean(sentiment(tweet1)$sentiment)

tweet2 <- data_frame(text=c("The USA is not an ulimited bank account for rich countries pretending to be poor. Pay your fair share for #NATO and the #ParisClimateDeal."))

fn_sentiment <- tweet2 %>% unnest_tokens(word,text) %>%  left_join(bing_lex)

fn_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n=n())

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(afinn_lex)

fn_sentiment %>% filter(!is.na(score)) %>% summarise(mean=mean(score))

sentiment(tweet2)
mean(sentiment(tweet2)$sentiment)

What is Sentiment analysis?

Often called opinion mining.

When we read text we use our understanding of the emotional intent to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.

We can use the tools of text mining to approach the emotional content of text programmatically,

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

(Silage and Robinson, 2017)

The `sentiments` dataset in `tidytext`

library(tidytext)

sentiments

## # A tibble: 23,165 × 4
##           word sentiment lexicon score
##          <chr>     <chr>   <chr> <int>
## 1       abacus     trust     nrc    NA
## 2      abandon      fear     nrc    NA
## 3      abandon  negative     nrc    NA
## 4      abandon   sadness     nrc    NA
## 5    abandoned     anger     nrc    NA
## 6    abandoned      fear     nrc    NA
## 7    abandoned  negative     nrc    NA
## 8    abandoned   sadness     nrc    NA
## 9  abandonment     anger     nrc    NA
## 10 abandonment      fear     nrc    NA
## # ... with 23,155 more rows

There are three general sentiment lexicons in the sentiments dataset.

AFINN from Finn Årup Nielsen,

bing from Bing Liu and collaborators, and

nrc from Saif Mohammad and Peter Turney.

Comments:

All three are based on single words (i.e., unigrams).
nrc has sentiment categories (positive, negative), and emotional categories (anger, anticipation, disgust, fear, joy, sadness, surprise, trust).

library(tidytext)
get_sentiments("nrc")

## # A tibble: 13,901 × 2
##           word sentiment
##          <chr>     <chr>
## 1       abacus     trust
## 2      abandon      fear
## 3      abandon  negative
## 4      abandon   sadness
## 5    abandoned     anger
## 6    abandoned      fear
## 7    abandoned  negative
## 8    abandoned   sadness
## 9  abandonment     anger
## 10 abandonment      fear
## # ... with 13,891 more rows

bing has sentiment categories (positive, negative).

library(tidytext)
get_sentiments("bing")

## # A tibble: 6,788 × 2
##           word sentiment
##          <chr>     <chr>
## 1      2-faced  negative
## 2      2-faces  negative
## 3           a+  positive
## 4     abnormal  negative
## 5      abolish  negative
## 6   abominable  negative
## 7   abominably  negative
## 8    abominate  negative
## 9  abomination  negative
## 10       abort  negative
## # ... with 6,778 more rows

AFINN assigns a score from -5 (negative sentiment) to +5 (positive sentiment).

library(tidytext)
get_sentiments("afinn")

## # A tibble: 2,476 × 2
##          word score
##         <chr> <int>
## 1     abandon    -2
## 2   abandoned    -2
## 3    abandons    -2
## 4    abducted    -2
## 5   abduction    -2
## 6  abductions    -2
## 7       abhor    -3
## 8    abhorred    -3
## 9   abhorrent    -3
## 10     abhors    -3
## # ... with 2,466 more rows

How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text. (Silage and Robinson, 2017)

Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text. (Silage and Robinson, 2017)

Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only. (Silage and Robinson, 2017)

The `sentimentr` library

sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers, de-amplifiers, and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup.

Sentiment analysis of Yelp reviews

This material is based on a blog post by David Robinson.

Does sentiment analysis work?
Can you predict positivity or negativity by assigning sentiment scores to words?

The Yelp Dataset

This section needs to be finished …

The dataset is from the Yelp dataset challenge

library(readr)
library(dplyr)

infile <- "~/Downloads/yelp_dataset_challenge_round9/yelp_academic_dataset_review.json"
review_lines <- read_lines(infile, n_max = 200000, progress = FALSE)

library(stringr)
library(jsonlite)

# Each line is a JSON object- the fastest way to process is to combine into a
# single JSON string and use fromJSON and flatten
reviews_combined <- str_c("[", str_c(review_lines, collapse = ", "), "]")

reviews <- fromJSON(reviews_combined) %>%
  flatten() %>%
  tbl_df()
write.csv(reviews,"yelp.csv")

library(readr)
reviews <- read_csv("~/Dropbox/Docs/DHSI-2017/day2/yelp.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   review_id = col_character(),
##   user_id = col_character(),
##   business_id = col_character(),
##   stars = col_integer(),
##   date = col_date(format = ""),
##   text = col_character(),
##   useful = col_integer(),
##   funny = col_integer(),
##   cool = col_integer(),
##   type = col_character()
## )

A data frame with one row per review.

reviews

## # A tibble: 200,000 × 11
##       X1              review_id                user_id
##    <int>                  <chr>                  <chr>
## 1      1 NxL8SIC5yqOdnlXCg18IBg KpkOkG6RIf4Ra25Lhhxf1A
## 2      2 pXbbIgOXvLuTi_SPs1hQEQ bQ7fQq1otn9hKX-gXRsrgA
## 3      3 wslW2Lu4NYylb1jEapAGsw r1NUhdNmL6yU9Bn-Yx6FTw
## 4      4 GP6YEearUWrzPtQYSF1vVg aW3ix1KNZAvoM8q-WghA3Q
## 5      5 25RlYGq2s5qShi-pn3ufVA YOo-Cip8HqvKp_p9nEGphw
## 6      6 Uf1Ki1yyH_JDKhLvn2e4FQ bgl3j8yJcRO-00NkUYsXGQ
## 7      7 oFmVZh-La7SuvpHrH_Al4Q CWKF9de-nskLYEqDDCfubg
## 8      8 bRvdVt88MJ_YMTlLbjDLxQ GJ7PTY7huYORFKKg3db3Gw
## 9      9 zNUSxqflZKgKD1NQH3jdFA rxqp9eXZj1jYTn0UIsm3Hg
## 10    10 LkP1l7sZIwOV6IKNLqQp_A UU0nHQtHPMAfLidk8tOHTg
## # ... with 199,990 more rows, and 8 more variables: business_id <chr>,
## #   stars <int>, date <date>, text <chr>, useful <int>, funny <int>,
## #   cool <int>, type <chr>

Can we predict the star rating based on the text?

This is an example of a supervised machine learning problem.

Now we will use the unnest_tokens() function to get one row-per-term-per-document.

We will also remove stop words and formattimng text such as “–”

library(tidytext)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

review_words <- reviews %>%
  select(review_id, business_id, stars, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

review_words

## # A tibble: 8,774,699 × 4
##                 review_id            business_id stars       word
##                     <chr>                  <chr> <int>      <chr>
## 1  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5      enjoy
## 2  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5    service
## 3  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5  competent
## 4  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5 personable
## 5  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5  recommend
## 6  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5      corey
## 7  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5     kaplan
## 8  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5     highly
## 9  NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5       time
## 10 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A     5      spent
## # ... with 8,774,689 more rows

Use AFINN lexicon and do an inner-join operation.

library(tidytext)
library(dplyr)

AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)

reviews_sentiment <- review_words %>%
  inner_join(AFINN, by = "word") %>%
  group_by(review_id, stars) %>%
  summarize(sentiment = mean(afinn_score))

reviews_sentiment

## Source: local data frame [188,543 x 3]
## Groups: review_id [?]
## 
##                 review_id stars  sentiment
##                     <chr> <int>      <dbl>
## 1  __--ULL5SCcn24lYW0Fdfg     5  1.5000000
## 2  __-A_0YYKrw7j_woYU9gJw     4 -1.5000000
## 3  __-D6nkN0qL5wepgSwxzCg     5  1.3333333
## 4  __0YvJRhBnEkcBVfsg3vFw     3  1.9090909
## 5  __1UMTVq9wFMFYo58p9G-w     4  2.2500000
## 6  __2D3wkMJMFhFMyIf4PEcQ     3 -3.0000000
## 7  __2DnhNHN1rNX0TqB03hHw     1 -0.3333333
## 8  __2S7tqIMAHWIy4dyIFoiw     1  0.0000000
## 9  __3gdV_ALx9QQzdXWHHUew     3 -0.6000000
## 10 __3WDpybnBMoQ8QfMaeMBw     5 -3.0000000
## # ... with 188,533 more rows

Now we have an average sentiment for each review with a star rating.

library(ggplot2)
theme_set(theme_bw())

ggplot(reviews_sentiment, aes(stars, sentiment, group = stars)) +
  geom_boxplot() +
  ylab("Average sentiment score")

Case Study 1 - Supervised Machine Learning: Can we use Machine Learning to Predict Yelp Star Rating from Text?

Linear Discriminant Analysis

Now let’s fit a classification model using Linear Discriminant analysis.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

set.seed(1)
n <- length(reviews_sentiment$stars) 
n_build <- round(2/3*n)
train <- sample(1:n,n_build)
length(train)

## [1] 125695

reviews_sentiment_test <- reviews_sentiment[-train,]

length(reviews_sentiment_test$stars)

## [1] 62848

table(reviews_sentiment$stars[train])

## 
##     1     2     3     4     5 
## 18960 10994 16921 30849 47971

train_model <- lda(stars~sentiment,data=reviews_sentiment,subset = train)
train_model

## Call:
## lda(stars ~ sentiment, data = reviews_sentiment, subset = train)
## 
## Prior probabilities of groups:
##          1          2          3          4          5 
## 0.15084132 0.08746569 0.13461952 0.24542742 0.38164605 
## 
## Group means:
##     sentiment
## 1 -0.72891985
## 2 -0.02633536
## 3  0.63755904
## 4  1.20671945
## 5  1.58053129
## 
## Coefficients of linear discriminants:
##                LD1
## sentiment 0.717838

pred.stars.train <- predict(train_model)$class
table(pred.stars.train,reviews_sentiment[train,]$stars)

##                 
## pred.stars.train     1     2     3     4     5
##                1 11987  4252  3646  3619  3867
##                2     0     0     0     0     0
##                3     0     0     0     0     0
##                4     0     0     0     0     0
##                5  6973  6742 13275 27230 44104

# test model on hold-out data

predict(train_model,newdata = data.frame(sentiment=-3))

## $class
## [1] 1
## Levels: 1 2 3 4 5
## 
## $posterior
##           1         2          3          4          5
## 1 0.6929537 0.1554911 0.07724234 0.04457611 0.02973668
## 
## $x
##         LD1
## 1 -2.780143

pred.stars <-predict(train_model,newdata = reviews_sentiment_test)
table(pred.stars$class,reviews_sentiment_test$stars)

##    
##         1     2     3     4     5
##   1  5983  2095  1812  1829  1958
##   2     0     0     0     0     0
##   3     0     0     0     0     0
##   4     0     0     0     0     0
##   5  3605  3367  6502 13471 22226

library(ElemStatLearn)
library(rpart)
library(tree)
library(maptree)

## Loading required package: cluster

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

library(ggplot2)

DATASET <- spam
head(DATASET)

##    A.1  A.2  A.3 A.4  A.5  A.6  A.7  A.8  A.9 A.10 A.11 A.12 A.13 A.14
## 1 0.00 0.64 0.64   0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00
## 2 0.21 0.28 0.50   0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21
## 3 0.06 0.00 0.71   0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00
## 4 0.00 0.00 0.00   0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 5 0.00 0.00 0.00   0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 6 0.00 0.00 0.00   0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00
##   A.15 A.16 A.17 A.18 A.19 A.20 A.21 A.22 A.23 A.24 A.25 A.26 A.27 A.28
## 1 0.00 0.32 0.00 1.29 1.93 0.00 0.96    0 0.00 0.00    0    0    0    0
## 2 0.14 0.14 0.07 0.28 3.47 0.00 1.59    0 0.43 0.43    0    0    0    0
## 3 1.75 0.06 0.06 1.03 1.36 0.32 0.51    0 1.16 0.06    0    0    0    0
## 4 0.00 0.31 0.00 0.00 3.18 0.00 0.31    0 0.00 0.00    0    0    0    0
## 5 0.00 0.31 0.00 0.00 3.18 0.00 0.31    0 0.00 0.00    0    0    0    0
## 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00    0 0.00 0.00    0    0    0    0
##   A.29 A.30 A.31 A.32 A.33 A.34 A.35 A.36 A.37 A.38 A.39 A.40 A.41 A.42
## 1    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 2    0    0    0    0    0    0    0    0 0.07    0    0 0.00    0    0
## 3    0    0    0    0    0    0    0    0 0.00    0    0 0.06    0    0
## 4    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 5    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
## 6    0    0    0    0    0    0    0    0 0.00    0    0 0.00    0    0
##   A.43 A.44 A.45 A.46 A.47 A.48 A.49  A.50 A.51  A.52  A.53  A.54  A.55
## 1 0.00    0 0.00 0.00    0    0 0.00 0.000    0 0.778 0.000 0.000 3.756
## 2 0.00    0 0.00 0.00    0    0 0.00 0.132    0 0.372 0.180 0.048 5.114
## 3 0.12    0 0.06 0.06    0    0 0.01 0.143    0 0.276 0.184 0.010 9.821
## 4 0.00    0 0.00 0.00    0    0 0.00 0.137    0 0.137 0.000 0.000 3.537
## 5 0.00    0 0.00 0.00    0    0 0.00 0.135    0 0.135 0.000 0.000 3.537
## 6 0.00    0 0.00 0.00    0    0 0.00 0.223    0 0.000 0.000 0.000 3.000
##   A.56 A.57 spam
## 1   61  278 spam
## 2  101 1028 spam
## 3  485 2259 spam
## 4   40  191 spam
## 5   40  191 spam
## 6   15   54 spam

dim(DATASET)

## [1] 4601   58

nrow(DATASET)

## [1] 4601

ncol(DATASET)

## [1] 58

colnames(DATASET)

##  [1] "A.1"  "A.2"  "A.3"  "A.4"  "A.5"  "A.6"  "A.7"  "A.8"  "A.9"  "A.10"
## [11] "A.11" "A.12" "A.13" "A.14" "A.15" "A.16" "A.17" "A.18" "A.19" "A.20"
## [21] "A.21" "A.22" "A.23" "A.24" "A.25" "A.26" "A.27" "A.28" "A.29" "A.30"
## [31] "A.31" "A.32" "A.33" "A.34" "A.35" "A.36" "A.37" "A.38" "A.39" "A.40"
## [41] "A.41" "A.42" "A.43" "A.44" "A.45" "A.46" "A.47" "A.48" "A.49" "A.50"
## [51] "A.51" "A.52" "A.53" "A.54" "A.55" "A.56" "A.57" "spam"

# Change Column Names

newColNames <- c("word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", 
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet", 
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will", 
    "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free", 
    "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit", 
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", 
    "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", 
    "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data", 
    "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", 
    "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting", 
    "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu", 
    "word_freq_table", "word_freq_conference", "char_freq_ch;", "char_freq_ch(", 
    "char_freq_ch[", "char_freq_ch!", "char_freq_ch$", "char_freq_ch#", "capital_run_length_average", 
    "capital_run_length_longest", "capital_run_length_total", "spam")

colnames(DATASET) <- newColNames

dataset.email <- sapply(DATASET[which(DATASET$spam == "email"),1:54],mean)
dataset.spam <- sapply(DATASET[which(DATASET$spam == "spam"),1:54],mean)

dataset.email.order <- data.frame(name=names(dataset.email[order(-dataset.email)[1:10]]),
                                  mean=dataset.email[order(-dataset.email)[1:10]],class=rep("email",10))
dataset.spam.order <- data.frame(name=names(dataset.spam[order(-dataset.spam)[1:10]]),
                                            mean=dataset.spam[order(-dataset.spam)[1:10]],class=rep("spam",10))

dataset.plot <-rbind(dataset.email.order,dataset.spam.order)

ggplot(dataset.plot,aes(x=name, y=mean,fill=class))+
  geom_bar(stat="identity",position="dodge")+
  theme(axis.text.x=element_text(angle=90,hjust=1))

# training and test sets

set.seed(1423)
index <- 1:nrow(DATASET)
trainIndex <- sample(index, trunc(length(index) * 0.666666666666667))
DATASET.train <- DATASET[trainIndex, ]

DATASET.train %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")

DATASET.test <- DATASET[-trainIndex, ]

DATASET.test %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")

# classification tree

model.rpart <- rpart(spam ~ ., method = "class", data = DATASET.train)
printcp(model.rpart)

## 
## Classification tree:
## rpart(formula = spam ~ ., data = DATASET.train, method = "class")
## 
## Variables actually used in tree construction:
## [1] capital_run_length_total char_freq_ch!           
## [3] char_freq_ch$            word_freq_hp            
## [5] word_freq_remove        
## 
## Root node error: 1206/3067 = 0.39322
## 
## n= 3067 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.485075      0   1.00000 1.00000 0.022431
## 2 0.141791      1   0.51493 0.55638 0.018985
## 3 0.057214      2   0.37313 0.47761 0.017935
## 4 0.030680      3   0.31592 0.35738 0.015959
## 5 0.025705      4   0.28524 0.32836 0.015399
## 6 0.010000      5   0.25954 0.28690 0.014528

draw.tree(model.rpart, cex = 0.5, nodeinfo = TRUE, col = gray(0:8/8))

# Confusion Matrix

prediction.rpart <- predict(model.rpart, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$spam, `Predicted Class` = prediction.rpart)

##             Predicted Class
## Actual Class email spam
##        email   876   51
##        spam    110  497

error.rate.rpart <- sum(DATASET.test$spam != prediction.rpart)/nrow(DATASET.test)

#Accuracy
1-error.rate.rpart

## [1] 0.8950456

Word and Document Frequency

How can we quantify what a document is about?
Count the frequency of terms that make up a document.
Words such as “the”, “is”, etc. can occur frequently.
Could use stop words, but this approach is limited, especially for commonly used words.

Term Frequency

The term frequency in a document is number of times a term \(\text t\) occurs in document \(\text d\),

\[\text{tf}_\text{t,d}.\]

Inverse Document Frequency

The inverse document frequency of a term \(\text t\) is,

\[\text{idf}_\text{t}=\log\left(\frac{N}{\text{df}_\text{t}}\right).\]

\(N\) is the total number of documents in a collection (or corpus) of documents, and \(\text{df}_\text{t}\) is the number of documents in a collection that contain the term \(\text t\).

Tf-idf Weighting

A weight for each term in each document is given by multiplying term frequency and inverse document frequency.

\[\text{tf-idf}_\text{t,d}= \text{tf}_\text{t,d} \times \log\left(\frac{N}{\text{df}_\text{t}}\right).\]

Some properties of Tf-idf (see Manning et al.):

highest when \(t\) occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.

Jane Austen’s novels

library(dplyr)
library(janeaustenr)
library(tidytext)

book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  ungroup()

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)

## Joining, by = "book"

book_words

## # A tibble: 40,379 × 4
##                 book  word     n  total
##               <fctr> <chr> <int>  <int>
## 1     Mansfield Park   the  6206 160460
## 2     Mansfield Park    to  5475 160460
## 3     Mansfield Park   and  5438 160460
## 4               Emma    to  5239 160996
## 5               Emma   the  5201 160996
## 6               Emma   and  4896 160996
## 7     Mansfield Park    of  4778 160460
## 8  Pride & Prejudice   the  4331 122204
## 9               Emma    of  4291 160996
## 10 Pride & Prejudice    to  4162 122204
## # ... with 40,369 more rows

One row for each word-book combination.

library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~book, ncol = 2, scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 896 rows containing non-finite values (stat_bin).

Many novels have similar distributions of words.
This distribution is called Zipf’s law.

freq_by_rank <- book_words %>% 
  group_by(book) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

freq_by_rank

## Source: local data frame [40,379 x 6]
## Groups: book [6]
## 
##                 book  word     n  total  rank `term frequency`
##               <fctr> <chr> <int>  <int> <int>            <dbl>
## 1     Mansfield Park   the  6206 160460     1       0.03867631
## 2     Mansfield Park    to  5475 160460     2       0.03412065
## 3     Mansfield Park   and  5438 160460     3       0.03389007
## 4               Emma    to  5239 160996     1       0.03254118
## 5               Emma   the  5201 160996     2       0.03230515
## 6               Emma   and  4896 160996     3       0.03041069
## 7     Mansfield Park    of  4778 160460     4       0.02977689
## 8  Pride & Prejudice   the  4331 122204     1       0.03544074
## 9               Emma    of  4291 160996     4       0.02665284
## 10 Pride & Prejudice    to  4162 122204     2       0.03405780
## # ... with 40,369 more rows

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = book)) + 
  geom_line(size = 1.2, alpha = 0.8) + 
  scale_x_log10() +
  scale_y_log10()

Introduction to Machine Learning in the Digital Humanities - Day 2