library(tidytext)
library(dplyr)
library(MASS)
library(stringr)
By the end of today:
Are these tweets related to #ParisClimateDeal positive or negative?
Below is a basic sentiment analysis using tidytext.
library(dplyr)
library(tidytext)
library(sentimentr)
bing_lex <- get_sentiments("bing")
tweet1 <- data_frame(text = c("Pulling out of the #ParisClimateDeal is reckless and regressive. Instead of handholding, I'll work for a sustainable future for our planet."))
fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>% left_join(bing_lex)
fn_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n=n())
afinn_lex <- get_sentiments("afinn")
fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>% left_join(afinn_lex)
fn_sentiment %>% filter(!is.na(score)) %>% summarise(mean=mean(score))
sentiment(tweet1)
mean(sentiment(tweet1)$sentiment)
tweet2 <- data_frame(text=c("The USA is not an ulimited bank account for rich countries pretending to be poor. Pay your fair share for #NATO and the #ParisClimateDeal."))
fn_sentiment <- tweet2 %>% unnest_tokens(word,text) %>% left_join(bing_lex)
fn_sentiment %>% filter(!is.na(sentiment)) %>% group_by(sentiment) %>% summarise(n=n())
fn_sentiment <- tweet1 %>% unnest_tokens(word,text) %>% left_join(afinn_lex)
fn_sentiment %>% filter(!is.na(score)) %>% summarise(mean=mean(score))
sentiment(tweet2)
mean(sentiment(tweet2)$sentiment)
When we read text we use our understanding of the emotional intent to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust.
We can use the tools of text mining to approach the emotional content of text programmatically,
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
(Silage and Robinson, 2017)
sentiments
dataset in tidytext
library(tidytext)
sentiments
## # A tibble: 23,165 × 4
## word sentiment lexicon score
## <chr> <chr> <chr> <int>
## 1 abacus trust nrc NA
## 2 abandon fear nrc NA
## 3 abandon negative nrc NA
## 4 abandon sadness nrc NA
## 5 abandoned anger nrc NA
## 6 abandoned fear nrc NA
## 7 abandoned negative nrc NA
## 8 abandoned sadness nrc NA
## 9 abandonment anger nrc NA
## 10 abandonment fear nrc NA
## # ... with 23,155 more rows
sentiments
dataset.
AFINN
from Finn Årup Nielsen,bing
from Bing Liu and collaborators, andnrc
from Saif Mohammad and Peter Turney.
Comments:
All three are based on single words (i.e., unigrams).
nrc
has sentiment categories (positive, negative), and emotional categories (anger, anticipation, disgust, fear, joy, sadness, surprise, trust).
library(tidytext)
get_sentiments("nrc")
## # A tibble: 13,901 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
bing
has sentiment categories (positive, negative).library(tidytext)
get_sentiments("bing")
## # A tibble: 6,788 × 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
AFINN
assigns a score from -5 (negative sentiment) to +5 (positive sentiment).library(tidytext)
get_sentiments("afinn")
## # A tibble: 2,476 × 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text. (Silage and Robinson, 2017)
Dictionary-based methods like the ones we are discussing find the total sentiment of a piece of text by adding up the individual sentiment scores for each word in the text. (Silage and Robinson, 2017)
Not every English word is in the lexicons because many English words are pretty neutral. It is important to keep in mind that these methods do not take into account qualifiers before a word, such as in “no good” or “not true”; a lexicon-based method like this is based on unigrams only. (Silage and Robinson, 2017)
sentimentr
librarysentimentr attempts to take into account valence shifters (i.e., negators, amplifiers, de-amplifiers, and adversative conjunctions) while maintaining speed. Simply put, sentimentr is an augmented dictionary lookup.
This material is based on a blog post by David Robinson.
Does sentiment analysis work?
Can you predict positivity or negativity by assigning sentiment scores to words?
The dataset is from the Yelp dataset challenge
library(readr)
library(dplyr)
infile <- "~/Downloads/yelp_dataset_challenge_round9/yelp_academic_dataset_review.json"
review_lines <- read_lines(infile, n_max = 200000, progress = FALSE)
library(stringr)
library(jsonlite)
# Each line is a JSON object- the fastest way to process is to combine into a
# single JSON string and use fromJSON and flatten
reviews_combined <- str_c("[", str_c(review_lines, collapse = ", "), "]")
reviews <- fromJSON(reviews_combined) %>%
flatten() %>%
tbl_df()
write.csv(reviews,"yelp.csv")
library(readr)
reviews <- read_csv("~/Dropbox/Docs/DHSI-2017/day2/yelp.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## review_id = col_character(),
## user_id = col_character(),
## business_id = col_character(),
## stars = col_integer(),
## date = col_date(format = ""),
## text = col_character(),
## useful = col_integer(),
## funny = col_integer(),
## cool = col_integer(),
## type = col_character()
## )
A data frame with one row per review.
reviews
## # A tibble: 200,000 × 11
## X1 review_id user_id
## <int> <chr> <chr>
## 1 1 NxL8SIC5yqOdnlXCg18IBg KpkOkG6RIf4Ra25Lhhxf1A
## 2 2 pXbbIgOXvLuTi_SPs1hQEQ bQ7fQq1otn9hKX-gXRsrgA
## 3 3 wslW2Lu4NYylb1jEapAGsw r1NUhdNmL6yU9Bn-Yx6FTw
## 4 4 GP6YEearUWrzPtQYSF1vVg aW3ix1KNZAvoM8q-WghA3Q
## 5 5 25RlYGq2s5qShi-pn3ufVA YOo-Cip8HqvKp_p9nEGphw
## 6 6 Uf1Ki1yyH_JDKhLvn2e4FQ bgl3j8yJcRO-00NkUYsXGQ
## 7 7 oFmVZh-La7SuvpHrH_Al4Q CWKF9de-nskLYEqDDCfubg
## 8 8 bRvdVt88MJ_YMTlLbjDLxQ GJ7PTY7huYORFKKg3db3Gw
## 9 9 zNUSxqflZKgKD1NQH3jdFA rxqp9eXZj1jYTn0UIsm3Hg
## 10 10 LkP1l7sZIwOV6IKNLqQp_A UU0nHQtHPMAfLidk8tOHTg
## # ... with 199,990 more rows, and 8 more variables: business_id <chr>,
## # stars <int>, date <date>, text <chr>, useful <int>, funny <int>,
## # cool <int>, type <chr>
Can we predict the star rating based on the text?
This is an example of a supervised machine learning problem.
Now we will use the unnest_tokens()
function to get one row-per-term-per-document.
We will also remove stop words and formattimng text such as “–”
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
review_words <- reviews %>%
select(review_id, business_id, stars, text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word,
str_detect(word, "^[a-z']+$"))
review_words
## # A tibble: 8,774,699 × 4
## review_id business_id stars word
## <chr> <chr> <int> <chr>
## 1 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 enjoy
## 2 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 service
## 3 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 competent
## 4 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 personable
## 5 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 recommend
## 6 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 corey
## 7 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 kaplan
## 8 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 highly
## 9 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 time
## 10 NxL8SIC5yqOdnlXCg18IBg 2aFiy99vNLklCx3T_tGS9A 5 spent
## # ... with 8,774,689 more rows
Use AFINN lexicon and do an inner-join operation.
library(tidytext)
library(dplyr)
AFINN <- sentiments %>%
filter(lexicon == "AFINN") %>%
select(word, afinn_score = score)
reviews_sentiment <- review_words %>%
inner_join(AFINN, by = "word") %>%
group_by(review_id, stars) %>%
summarize(sentiment = mean(afinn_score))
reviews_sentiment
## Source: local data frame [188,543 x 3]
## Groups: review_id [?]
##
## review_id stars sentiment
## <chr> <int> <dbl>
## 1 __--ULL5SCcn24lYW0Fdfg 5 1.5000000
## 2 __-A_0YYKrw7j_woYU9gJw 4 -1.5000000
## 3 __-D6nkN0qL5wepgSwxzCg 5 1.3333333
## 4 __0YvJRhBnEkcBVfsg3vFw 3 1.9090909
## 5 __1UMTVq9wFMFYo58p9G-w 4 2.2500000
## 6 __2D3wkMJMFhFMyIf4PEcQ 3 -3.0000000
## 7 __2DnhNHN1rNX0TqB03hHw 1 -0.3333333
## 8 __2S7tqIMAHWIy4dyIFoiw 1 0.0000000
## 9 __3gdV_ALx9QQzdXWHHUew 3 -0.6000000
## 10 __3WDpybnBMoQ8QfMaeMBw 5 -3.0000000
## # ... with 188,533 more rows
Now we have an average sentiment for each review with a star rating.
library(ggplot2)
theme_set(theme_bw())
ggplot(reviews_sentiment, aes(stars, sentiment, group = stars)) +
geom_boxplot() +
ylab("Average sentiment score")
Now let’s fit a classification model using Linear Discriminant analysis.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
set.seed(1)
n <- length(reviews_sentiment$stars)
n_build <- round(2/3*n)
train <- sample(1:n,n_build)
length(train)
## [1] 125695
reviews_sentiment_test <- reviews_sentiment[-train,]
length(reviews_sentiment_test$stars)
## [1] 62848
table(reviews_sentiment$stars[train])
##
## 1 2 3 4 5
## 18960 10994 16921 30849 47971
train_model <- lda(stars~sentiment,data=reviews_sentiment,subset = train)
train_model
## Call:
## lda(stars ~ sentiment, data = reviews_sentiment, subset = train)
##
## Prior probabilities of groups:
## 1 2 3 4 5
## 0.15084132 0.08746569 0.13461952 0.24542742 0.38164605
##
## Group means:
## sentiment
## 1 -0.72891985
## 2 -0.02633536
## 3 0.63755904
## 4 1.20671945
## 5 1.58053129
##
## Coefficients of linear discriminants:
## LD1
## sentiment 0.717838
pred.stars.train <- predict(train_model)$class
table(pred.stars.train,reviews_sentiment[train,]$stars)
##
## pred.stars.train 1 2 3 4 5
## 1 11987 4252 3646 3619 3867
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 6973 6742 13275 27230 44104
# test model on hold-out data
predict(train_model,newdata = data.frame(sentiment=-3))
## $class
## [1] 1
## Levels: 1 2 3 4 5
##
## $posterior
## 1 2 3 4 5
## 1 0.6929537 0.1554911 0.07724234 0.04457611 0.02973668
##
## $x
## LD1
## 1 -2.780143
pred.stars <-predict(train_model,newdata = reviews_sentiment_test)
table(pred.stars$class,reviews_sentiment_test$stars)
##
## 1 2 3 4 5
## 1 5983 2095 1812 1829 1958
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 3605 3367 6502 13471 22226
library(ElemStatLearn)
library(rpart)
library(tree)
library(maptree)
## Loading required package: cluster
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(ggplot2)
DATASET <- spam
head(DATASET)
## A.1 A.2 A.3 A.4 A.5 A.6 A.7 A.8 A.9 A.10 A.11 A.12 A.13 A.14
## 1 0.00 0.64 0.64 0 0.32 0.00 0.00 0.00 0.00 0.00 0.00 0.64 0.00 0.00
## 2 0.21 0.28 0.50 0 0.14 0.28 0.21 0.07 0.00 0.94 0.21 0.79 0.65 0.21
## 3 0.06 0.00 0.71 0 1.23 0.19 0.19 0.12 0.64 0.25 0.38 0.45 0.12 0.00
## 4 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 5 0.00 0.00 0.00 0 0.63 0.00 0.31 0.63 0.31 0.63 0.31 0.31 0.31 0.00
## 6 0.00 0.00 0.00 0 1.85 0.00 0.00 1.85 0.00 0.00 0.00 0.00 0.00 0.00
## A.15 A.16 A.17 A.18 A.19 A.20 A.21 A.22 A.23 A.24 A.25 A.26 A.27 A.28
## 1 0.00 0.32 0.00 1.29 1.93 0.00 0.96 0 0.00 0.00 0 0 0 0
## 2 0.14 0.14 0.07 0.28 3.47 0.00 1.59 0 0.43 0.43 0 0 0 0
## 3 1.75 0.06 0.06 1.03 1.36 0.32 0.51 0 1.16 0.06 0 0 0 0
## 4 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0
## 5 0.00 0.31 0.00 0.00 3.18 0.00 0.31 0 0.00 0.00 0 0 0 0
## 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.00 0.00 0 0 0 0
## A.29 A.30 A.31 A.32 A.33 A.34 A.35 A.36 A.37 A.38 A.39 A.40 A.41 A.42
## 1 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0
## 2 0 0 0 0 0 0 0 0 0.07 0 0 0.00 0 0
## 3 0 0 0 0 0 0 0 0 0.00 0 0 0.06 0 0
## 4 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0
## 5 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0
## 6 0 0 0 0 0 0 0 0 0.00 0 0 0.00 0 0
## A.43 A.44 A.45 A.46 A.47 A.48 A.49 A.50 A.51 A.52 A.53 A.54 A.55
## 1 0.00 0 0.00 0.00 0 0 0.00 0.000 0 0.778 0.000 0.000 3.756
## 2 0.00 0 0.00 0.00 0 0 0.00 0.132 0 0.372 0.180 0.048 5.114
## 3 0.12 0 0.06 0.06 0 0 0.01 0.143 0 0.276 0.184 0.010 9.821
## 4 0.00 0 0.00 0.00 0 0 0.00 0.137 0 0.137 0.000 0.000 3.537
## 5 0.00 0 0.00 0.00 0 0 0.00 0.135 0 0.135 0.000 0.000 3.537
## 6 0.00 0 0.00 0.00 0 0 0.00 0.223 0 0.000 0.000 0.000 3.000
## A.56 A.57 spam
## 1 61 278 spam
## 2 101 1028 spam
## 3 485 2259 spam
## 4 40 191 spam
## 5 40 191 spam
## 6 15 54 spam
dim(DATASET)
## [1] 4601 58
nrow(DATASET)
## [1] 4601
ncol(DATASET)
## [1] 58
colnames(DATASET)
## [1] "A.1" "A.2" "A.3" "A.4" "A.5" "A.6" "A.7" "A.8" "A.9" "A.10"
## [11] "A.11" "A.12" "A.13" "A.14" "A.15" "A.16" "A.17" "A.18" "A.19" "A.20"
## [21] "A.21" "A.22" "A.23" "A.24" "A.25" "A.26" "A.27" "A.28" "A.29" "A.30"
## [31] "A.31" "A.32" "A.33" "A.34" "A.35" "A.36" "A.37" "A.38" "A.39" "A.40"
## [41] "A.41" "A.42" "A.43" "A.44" "A.45" "A.46" "A.47" "A.48" "A.49" "A.50"
## [51] "A.51" "A.52" "A.53" "A.54" "A.55" "A.56" "A.57" "spam"
# Change Column Names
newColNames <- c("word_freq_make", "word_freq_address", "word_freq_all", "word_freq_3d",
"word_freq_our", "word_freq_over", "word_freq_remove", "word_freq_internet",
"word_freq_order", "word_freq_mail", "word_freq_receive", "word_freq_will",
"word_freq_people", "word_freq_report", "word_freq_addresses", "word_freq_free",
"word_freq_business", "word_freq_email", "word_freq_you", "word_freq_credit",
"word_freq_your", "word_freq_font", "word_freq_000", "word_freq_money",
"word_freq_hp", "word_freq_hpl", "word_freq_george", "word_freq_650", "word_freq_lab",
"word_freq_labs", "word_freq_telnet", "word_freq_857", "word_freq_data",
"word_freq_415", "word_freq_85", "word_freq_technology", "word_freq_1999",
"word_freq_parts", "word_freq_pm", "word_freq_direct", "word_freq_cs", "word_freq_meeting",
"word_freq_original", "word_freq_project", "word_freq_re", "word_freq_edu",
"word_freq_table", "word_freq_conference", "char_freq_ch;", "char_freq_ch(",
"char_freq_ch[", "char_freq_ch!", "char_freq_ch$", "char_freq_ch#", "capital_run_length_average",
"capital_run_length_longest", "capital_run_length_total", "spam")
colnames(DATASET) <- newColNames
dataset.email <- sapply(DATASET[which(DATASET$spam == "email"),1:54],mean)
dataset.spam <- sapply(DATASET[which(DATASET$spam == "spam"),1:54],mean)
dataset.email.order <- data.frame(name=names(dataset.email[order(-dataset.email)[1:10]]),
mean=dataset.email[order(-dataset.email)[1:10]],class=rep("email",10))
dataset.spam.order <- data.frame(name=names(dataset.spam[order(-dataset.spam)[1:10]]),
mean=dataset.spam[order(-dataset.spam)[1:10]],class=rep("spam",10))
dataset.plot <-rbind(dataset.email.order,dataset.spam.order)
ggplot(dataset.plot,aes(x=name, y=mean,fill=class))+
geom_bar(stat="identity",position="dodge")+
theme(axis.text.x=element_text(angle=90,hjust=1))
# training and test sets
set.seed(1423)
index <- 1:nrow(DATASET)
trainIndex <- sample(index, trunc(length(index) * 0.666666666666667))
DATASET.train <- DATASET[trainIndex, ]
DATASET.train %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")
DATASET.test <- DATASET[-trainIndex, ]
DATASET.test %>% group_by(spam) %>% summarise(n=n()) %>% ggplot(aes(x=spam,y=n))+geom_bar(stat="identity")
# classification tree
model.rpart <- rpart(spam ~ ., method = "class", data = DATASET.train)
printcp(model.rpart)
##
## Classification tree:
## rpart(formula = spam ~ ., data = DATASET.train, method = "class")
##
## Variables actually used in tree construction:
## [1] capital_run_length_total char_freq_ch!
## [3] char_freq_ch$ word_freq_hp
## [5] word_freq_remove
##
## Root node error: 1206/3067 = 0.39322
##
## n= 3067
##
## CP nsplit rel error xerror xstd
## 1 0.485075 0 1.00000 1.00000 0.022431
## 2 0.141791 1 0.51493 0.55638 0.018985
## 3 0.057214 2 0.37313 0.47761 0.017935
## 4 0.030680 3 0.31592 0.35738 0.015959
## 5 0.025705 4 0.28524 0.32836 0.015399
## 6 0.010000 5 0.25954 0.28690 0.014528
draw.tree(model.rpart, cex = 0.5, nodeinfo = TRUE, col = gray(0:8/8))
# Confusion Matrix
prediction.rpart <- predict(model.rpart, newdata = DATASET.test, type = "class")
table(`Actual Class` = DATASET.test$spam, `Predicted Class` = prediction.rpart)
## Predicted Class
## Actual Class email spam
## email 876 51
## spam 110 497
error.rate.rpart <- sum(DATASET.test$spam != prediction.rpart)/nrow(DATASET.test)
#Accuracy
1-error.rate.rpart
## [1] 0.8950456
How can we quantify what a document is about?
Count the frequency of terms that make up a document.
Words such as “the”, “is”, etc. can occur frequently.
Could use stop words, but this approach is limited, especially for commonly used words.
The term frequency in a document is number of times a term \(\text t\) occurs in document \(\text d\),
\[\text{tf}_\text{t,d}.\]
The inverse document frequency of a term \(\text t\) is,
\[\text{idf}_\text{t}=\log\left(\frac{N}{\text{df}_\text{t}}\right).\]
\(N\) is the total number of documents in a collection (or corpus) of documents, and \(\text{df}_\text{t}\) is the number of documents in a collection that contain the term \(\text t\).
A weight for each term in each document is given by multiplying term frequency and inverse document frequency.
\[\text{tf-idf}_\text{t,d}= \text{tf}_\text{t,d} \times \log\left(\frac{N}{\text{df}_\text{t}}\right).\]
Some properties of Tf-idf (see Manning et al.):
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE) %>%
ungroup()
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))
book_words <- left_join(book_words, total_words)
## Joining, by = "book"
book_words
## # A tibble: 40,379 × 4
## book word n total
## <fctr> <chr> <int> <int>
## 1 Mansfield Park the 6206 160460
## 2 Mansfield Park to 5475 160460
## 3 Mansfield Park and 5438 160460
## 4 Emma to 5239 160996
## 5 Emma the 5201 160996
## 6 Emma and 4896 160996
## 7 Mansfield Park of 4778 160460
## 8 Pride & Prejudice the 4331 122204
## 9 Emma of 4291 160996
## 10 Pride & Prejudice to 4162 122204
## # ... with 40,369 more rows
library(ggplot2)
ggplot(book_words, aes(n/total, fill = book)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~book, ncol = 2, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 896 rows containing non-finite values (stat_bin).
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term frequency` = n/total)
freq_by_rank
## Source: local data frame [40,379 x 6]
## Groups: book [6]
##
## book word n total rank `term frequency`
## <fctr> <chr> <int> <int> <int> <dbl>
## 1 Mansfield Park the 6206 160460 1 0.03867631
## 2 Mansfield Park to 5475 160460 2 0.03412065
## 3 Mansfield Park and 5438 160460 3 0.03389007
## 4 Emma to 5239 160996 1 0.03254118
## 5 Emma the 5201 160996 2 0.03230515
## 6 Emma and 4896 160996 3 0.03041069
## 7 Mansfield Park of 4778 160460 4 0.02977689
## 8 Pride & Prejudice the 4331 122204 1 0.03544074
## 9 Emma of 4291 160996 4 0.02665284
## 10 Pride & Prejudice to 4162 122204 2 0.03405780
## # ... with 40,369 more rows
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_line(size = 1.2, alpha = 0.8) +
scale_x_log10() +
scale_y_log10()