By the end of today:
unnnest_tokens()
and count()
functions to count frequencies of words in text.“Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.” Viktor Frankl, Man’s Search for Meaning
In order to convert raw text into data it’s convenient to break the text into some pre-defined unit such as words or phrases.
A token is a meaningful unit of text, such as a word, that we are intrested in using for analysis. Tokenization is the process of splitting text into tokens.
A text is ready for analysis if it has one-token-per-row. A token is often a single word but it could also be a sentence or a paragraph.
R or other software packages can be used to break text into tokens.
All students should try this example.
# Uncomment this line if the packages are not installed or install via the Packages tab
# install.packages(c("dplyr","tidytext"))
library(dplyr)
library(tidytext)
vfquote <- c("Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.")
vf_df <- data_frame(vfquote)
unnest_tokens(vf_df,output = word,input = vfquote)
## # A tibble: 31 × 1
## word
## <chr>
## 1 everything
## 2 can
## 3 be
## 4 taken
## 5 from
## 6 a
## 7 man
## 8 but
## 9 one
## 10 thing
## # ... with 21 more rows
unnest_tokens()
.Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.
The tidytext
library has a dataset of stop words stop_words
.
library(tidytext)
stop_words
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
For more information on these stop words look at the help menu or type ?stop_words
The SnowballC
library has a function that stems words.
library(SnowballC)
wordStem(c("win", "winning", "winner"))
## [1] "win" "win" "winner"
More details on these algorithms can be found using help ?wordStem
If the text is in tidy text format and we want to count the number of words then we use the count()
function in the dplyr
library.
library(dplyr)
library(tidytext)
vfquote <- c("Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.")
vf_df <- data_frame(vfquote)
unnest_tokens(vf_df,output = word,input = vfquote)
## # A tibble: 31 × 1
## word
## <chr>
## 1 everything
## 2 can
## 3 be
## 4 taken
## 5 from
## 6 a
## 7 man
## 8 but
## 9 one
## 10 thing
## # ... with 21 more rows
unnest_tokens(vf_df,output = word,input = vfquote) %>% count(word,sort=TRUE)
## # A tibble: 26 × 2
## word n
## <chr> <int>
## 1 choose 2
## 2 of 2
## 3 one’s 2
## 4 the 2
## 5 to 2
## 6 a 1
## 7 any 1
## 8 attitude 1
## 9 be 1
## 10 but 1
## # ... with 16 more rows
This code takes the data in unnest_tokens(vf_df,output = word,input = vfquote)
and pipes it into the count()
function.
Piping is only available if the dplyr
library is loaded.
The Time Machine
The War of the Worlds
The Island of Doctor Moreau
Twelve Stories and a Dream
The Invisible Man: A Grotesque Romance
Boon, The Mind of the Race, The Wild Asses of the Devil, and The Last Trump; Being a First Selection from the Literary Remains of George Boon, Appropriate to the Times
Do some of these books have more in common with each other compared to other books?
Hierarchical cluster analysis may be able to give us some insight into this question.
library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
library(quanteda)
library(tm)
library(SnowballC)
library(ggdendro)
# instead could look at all Wells books
# x <- gutenberg_metadata %>% filter(author=="Wells, H. G. (Herbert George)") %>% select(gutenberg_id)
#
hgwells <- gutenberg_download(c(35, 36, 5230, 159,34962,1743))
tidy_hgwells <- hgwells %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
# str_extract() is from the library stringr and extracts matching patterns
# it takes regular expressions
td_wells <- tidy_hgwells %>% mutate(word = str_extract(word, "[a-z']+")) %>%
mutate(word = wordStem(word)) %>% group_by(gutenberg_id,word) %>% summarise(n=n())
dtm_wells <- td_wells %>% cast_tdm(gutenberg_id,word,n)
wells_matrix <- as.matrix(dtm_wells)
book_titles <- gutenberg_metadata %>%
filter(gutenberg_id == 35 |gutenberg_id == 36|gutenberg_id == 5230|gutenberg_id == 159|gutenberg_id == 34962|gutenberg_id == 1743) %>% select(title)
row.names(wells_matrix) <- book_titles$title
d <-dist(wells_matrix)
groups <- hclust(d)
plot(groups,size=0.5,xlab = "Books")
Question: What does it mean for two books being similar?
If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters.
The observations in that group are represented by the branches of the dendogram that spread out below the line.
For example, if we look at a height of 690, and move across the x-axis at that height, we’ll cross two lines. This defines a two-cluster solution;
Following the line down through all its branches, we can see the names of the books that are included in these two clusters.
The y-axis represents how close together words were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren’t very reliable.
If there’s a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data.
The dendogram shows that there seems to be two distinct groups.
Do these groups make sense? What is the interpretation of these groups?
See regular expression cheat sheet
(Ref: Peng and Matsui, 2017)
Some text from the Bob Dylan song, “Gotta Serve Somebody”
bd_text <- c("You may be an ambassador to England or France
You may like to gamble, you might like to dance
You may be the heavyweight champion of the world
You may be a socialite with a long string of pearls
But you're gonna have to serve somebody, yes
Indeed you're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You might be a rock 'n' roll addict prancing on the stage
You might have drugs at your command, women in a cage
You may be a business man or some high-degree thief
They may call you doctor or they may call you chief
But you're gonna have to serve somebody, yes you are
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a state trooper, you might be a young Turk
You may be the head of some big TV network
You may be rich or poor, you may be blind or lame
You may be living in another country under another name
But you're gonna have to serve somebody, yes you are
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a construction worker working on a home
You may be living in a mansion or you might live in a dome
You might own guns and you might even own tanks
You might be somebody's landlord, you might even own banks
But you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a preacher with your spiritual pride
You may be a city councilman taking bribes on the side
You may be workin' in a barbershop, you may know how to cut hair
You may be somebody's mistress, may be somebody's heir
But you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
Might like to wear cotton, might like to wear silk
Might like to drink whiskey, might like to drink milk
You might like to eat caviar, you might like to eat bread
You may be sleeping on the floor, sleeping in a king-sized bed
But you're gonna have to serve somebody, yes
Indeed you're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may call me Terry, you may call me Timmy
You may call me Bobby, you may call me Zimmy
You may call me R.J., you may call me Ray
You may call me anything but no matter what you say
Still, you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody")
bd_text
## [1] "You may be an ambassador to England or France\nYou may like to gamble, you might like to dance\nYou may be the heavyweight champion of the world\nYou may be a socialite with a long string of pearls\nBut you're gonna have to serve somebody, yes\nIndeed you're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou might be a rock 'n' roll addict prancing on the stage\nYou might have drugs at your command, women in a cage\nYou may be a business man or some high-degree thief\nThey may call you doctor or they may call you chief\nBut you're gonna have to serve somebody, yes you are\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a state trooper, you might be a young Turk\nYou may be the head of some big TV network\nYou may be rich or poor, you may be blind or lame\nYou may be living in another country under another name\nBut you're gonna have to serve somebody, yes you are\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a construction worker working on a home\nYou may be living in a mansion or you might live in a dome\nYou might own guns and you might even own tanks\nYou might be somebody's landlord, you might even own banks\nBut you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a preacher with your spiritual pride\nYou may be a city councilman taking bribes on the side\nYou may be workin' in a barbershop, you may know how to cut hair\nYou may be somebody's mistress, may be somebody's heir\nBut you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nMight like to wear cotton, might like to wear silk\nMight like to drink whiskey, might like to drink milk\nYou might like to eat caviar, you might like to eat bread\nYou may be sleeping on the floor, sleeping in a king-sized bed\nBut you're gonna have to serve somebody, yes\nIndeed you're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may call me Terry, you may call me Timmy\nYou may call me Bobby, you may call me Zimmy\nYou may call me R.J., you may call me Ray\nYou may call me anything but no matter what you say\nStill, you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody"
Now let’s turn it into a tidy text dataset. First put the text into a data frame
library(dplyr)
bd_text_df <- data_frame(line=1, text=bd_text)
bd_text_df
## # A tibble: 1 × 2
## line
## <dbl>
## 1 1
## # ... with 1 more variables: text <chr>
A token is a meaningful unit of text, such as a word, that we are intrested in using for analysis. Tokenization is the process of splitting text into tokens.
A text is ready for analysis if it has one-token-per-row. A token is often a single word but it could also be a sentence or a paragraph.
The tidytext paackage provides functionality to convert to a one-term-per-row format.
We can use unnest_tokens()
to break the the text into tokens.
library(tidytext)
bd_text_df %>% unnest_tokens(word,text)
## # A tibble: 538 × 2
## line word
## <dbl> <chr>
## 1 1 you
## 2 1 may
## 3 1 be
## 4 1 an
## 5 1 ambassador
## 6 1 to
## 7 1 england
## 8 1 or
## 9 1 france
## 10 1 you
## # ... with 528 more rows
We can use the count()
function in the dplyr library to find the most common words in the text.
bd_text_df %>% unnest_tokens(word,text) %>% count(word,sort=TRUE)
## # A tibble: 136 × 2
## word n
## <chr> <int>
## 1 may 42
## 2 you 41
## 3 be 34
## 4 to 31
## 5 have 22
## 6 gonna 21
## 7 serve 21
## 8 somebody 21
## 9 you're 21
## 10 the 20
## # ... with 126 more rows
We can create a plot of these frequencies using the ggplot2 library.
library(ggplot2)
bd_text_df %>% unnest_tokens(word,text) %>% count(word,sort=TRUE) %>%
mutate(word=reorder(word,n)) %>%
ggplot(aes(word,n),size=7) +
geom_col() +
xlab(NULL) +
coord_flip()