
Today’s class

By the end of today:

Sentiment Analysis

Are these tweets related to #ParisClimateDeal positive or negative?

tweet 1

tweet 2

Below is a basic sentiment analysis using tidytext.


bing_lex <- get_sentiments("bing")

tweet1 <- data_frame(text = c("Pulling out of the #ParisClimateDeal is reckless and regressive. Instead of handholding, I'll work for a sustainable future for our planet."))

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(bing_lex)

fn_sentiment %>% filter(! %>% group_by(sentiment) %>% summarise(n=n())

afinn_lex <- get_sentiments("afinn")

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(afinn_lex)

fn_sentiment %>% filter(! %>% summarise(mean=mean(score))


tweet2 <- data_frame(text=c("The USA is not an ulimited bank account for rich countries pretending to be poor. Pay your fair share for #NATO and the #ParisClimateDeal."))

fn_sentiment <- tweet2 %>% unnest_tokens(word,text) %>%  left_join(bing_lex)

fn_sentiment %>% filter(! %>% group_by(sentiment) %>% summarise(n=n())

fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>%  left_join(afinn_lex)

fn_sentiment %>% filter(! %>% summarise(mean=mean(score))


What is Sentiment analysis?

When we read text we use our understanding of the emotional intent to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.

We can use the tools of text mining to approach the emotional content of text programmatically,

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

(Silage and Robinson, 2017)

The sentiments dataset in tidytext


How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text. (Silage and Robinson, 2017)

Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text. (Silage and Robinson, 2017)

Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only. (Silage and Robinson, 2017)

The sentimentr library

sentimentr attempts to take into account valence shifters (i.e., negators, amplifiers, de-amplifiers, and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup.

Sentiment analysis of Yelp reviews

This material is based on a blog post by David Robinson.

The Yelp Dataset

The dataset is from the Yelp dataset challenge


infile <- "~/Downloads/yelp_dataset_challenge_round9/yelp_academic_dataset_review.json"
review_lines <- read_lines(infile, n_max = 200000, progress = FALSE)


# Each line is a JSON object- the fastest way to process is to combine into a
# single JSON string and use fromJSON and flatten
reviews_combined <- str_c("[", str_c(review_lines, collapse = ", "), "]")

reviews <- fromJSON(reviews_combined) %>%
  flatten() %>%
reviews <- read_csv("~/Dropbox/Docs/DHSI-2017/day2/yelp.csv")
A data frame with one row per review.

Can we predict the star rating based on the text?

This is an example of a supervised machine learning problem.

Now we will use the unnest_tokens() function to get one row-per-term-per-document.

We will also remove stop words and formattimng text such as “–”

review_words <- reviews %>%
  select(review_id, business_id, stars, text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

Use AFINN lexicon and do an inner-join operation.


AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)

reviews_sentiment <- review_words %>%
  inner_join(AFINN, by = "word") %>%
  group_by(review_id, stars) %>%
  summarize(sentiment = mean(afinn_score))

Now we have an average sentiment for each review with a star rating.


ggplot(reviews_sentiment, aes(stars, sentiment, group = stars)) +
  geom_boxplot() +
  ylab("Average sentiment score")

Case Study 1 - Supervised Machine Learning: Can we use Machine Learning to Predict Yelp Star Rating from Text?

Linear Discriminant Analysis

Now let’s fit a classification model using Linear Discriminant analysis.

reviews_sentiment_test <- reviews_sentiment[-train,]

# test model on hold-out data

predict(train_model,newdata = data.frame(sentiment=-3))
## $class
## [1] 1
## Levels: 1 2 3 4 5
## $posterior
##           1         2          3          4          5
## 1 0.6929537 0.1554911 0.07724234 0.04457611 0.02973668
## $x
##         LD1
## 1 -2.780143
pred.stars <-predict(train_model,newdata = reviews_sentiment_test)
DATASET <- spam
## [1] 4601   58
# Change Column Names

newColNames <- c("word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d", 
    "word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet", 
    "word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will", 
    "word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free", 
    "word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit", 
    "word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money", 
    "word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab", 
    "word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data", 
    "word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999", 
    "word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting", 
    "word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu", 
    "word_freq_table", "word_freq_conference", "char_freq_ch;", "char_freq_ch(", 
    "char_freq_ch[", "char_freq_ch!", "char_freq_ch$", "char_freq_ch#", "capital_run_length_average", 
    "capital_run_length_longest", "capital_run_length_total", "spam")

colnames(DATASET) <- newColNames <- sapply(DATASET[which(DATASET$spam == "email"),1:54],mean)
dataset.spam <- sapply(DATASET[which(DATASET$spam == "spam"),1:54],mean) <- data.frame(name=names([order([1:10]]),
dataset.spam.order <- data.frame(name=names(dataset.spam[order(-dataset.spam)[1:10]]),

dataset.plot <-rbind(,dataset.spam.order)

ggplot(dataset.plot,aes(x=name, y=mean,fill=class))+

# training and test sets

index <- 1:nrow(DATASET)
trainIndex <- sample(index, trunc(length(index) * 0.666666666666667))
DATASET.train <- DATASET[trainIndex, ]

DATASET.train %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")

DATASET.test <- DATASET[-trainIndex, ]

DATASET.test %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")

# classification tree

model.rpart <- rpart(spam ~ ., method = "class", data = DATASET.train)
draw.tree(model.rpart, cex = 0.5, nodeinfo = TRUE, col = gray(0:8/8))

# Confusion Matrix

prediction.rpart <- predict(model.rpart, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$spam, `Predicted Class` = prediction.rpart)
##             Predicted Class
## Actual Class email spam
##        email   876   51
##        spam    110  497
error.rate.rpart <- sum(DATASET.test$spam != prediction.rpart)/nrow(DATASET.test)
## [1] 0.8950456

## [1] 0.8950456

Word and Document Frequency

Term Frequency

The term frequency in a document is number of times a term \(\text t\) occurs in document \(\text d\),


Inverse Document Frequency

The inverse document frequency of a term \(\text t\) is,


\(N\) is the total number of documents in a collection (or corpus) of documents, and \(\text{df}_\text{t}\) is the number of documents in a collection that contain the term \(\text t\).

Tf-idf Weighting

A weight for each term in each document is given by multiplying term frequency and inverse document frequency.

\[\text{tf-idf}_\text{t,d}= \text{tf}_\text{t,d} \times \log\left(\frac{N}{\text{df}_\text{t}}\right).\]

Some properties of Tf-idf (see Manning et al.):

  1. highest when \(t\) occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);

  3. lowest when the term occurs in virtually all documents.

Jane Austen’s novels


book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words <- left_join(book_words, total_words)
## Joining, by = "book"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = book)) + 
  geom_line(size = 1.2, alpha = 0.8) + 
  scale_x_log10() +