Today’s class

Text as data
Introduction to Statistics and Machine Learning
Introduction to the tidy text format in R

By the end of today:

R and several libraries should be installed.
Understand the idea that text can be mapped into data.
Be able to use the unnnest_tokens() and count() functions to count frequencies of words in text.

Data

Statistics and Machine Learning start with data.
Data can be numerical or text.
Numerical data arises during measurement. For example, weight, height, and age.
Categorical data arise when the outcome of the measurements are discrete categories. For example, colour, sex, satisfaction, and year of study at university.
Text.

“Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.” Viktor Frankl, Man’s Search for Meaning

Text as Data

In order to convert raw text into data it’s convenient to break the text into some pre-defined unit such as words or phrases.
A token is a meaningful unit of text, such as a word, that we are intrested in using for analysis. Tokenization is the process of splitting text into tokens.
A text is ready for analysis if it has one-token-per-row. A token is often a single word but it could also be a sentence or a paragraph.
R or other software packages can be used to break text into tokens.
All students should try this example.

# Uncomment this line if the packages are not installed or install via the Packages tab
# install.packages(c("dplyr","tidytext"))
library(dplyr)
library(tidytext)

vfquote <- c("Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.")

vf_df <- data_frame(vfquote) 

unnest_tokens(vf_df,output = word,input = vfquote)

## # A tibble: 31 × 1
##          word
##         <chr>
## 1  everything
## 2         can
## 3          be
## 4       taken
## 5        from
## 6           a
## 7         man
## 8         but
## 9         one
## 10      thing
## # ... with 21 more rows

When humans read text they do not see a sequence of unrelated tokens.
They interpret words in light of other words, and extract meaning from the text as a whole.
It might seem obvious that any attempt to distill text into meaningful data must similarly take account of complex grammatical structures and rich interactions among words. (Gentzkow et al. 2017)
Analysis of text in the humanities and social sciences ignores most of the complexity.
Raw text is an order sequence of words, punctutaion, and whitespace.
To reduce this to a simpler representation suitable for statistical analysis three types of simplifications are usually made

Divide the text into individual documents.
Reduce the number of language elements we consider.
Limit the extent to which we encode dependence among elements within a document.

tidy text in R

The tidy text format has one-token-per-row.
The tidytext library has function for getting text into a tidy format. For example, unnest_tokens().

Stop Words

Stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English.

The tidytext library has a dataset of stop words stop_words.

library(tidytext)
stop_words

## # A tibble: 1,149 × 2
##           word lexicon
##          <chr>   <chr>
## 1            a   SMART
## 2          a's   SMART
## 3         able   SMART
## 4        about   SMART
## 5        above   SMART
## 6    according   SMART
## 7  accordingly   SMART
## 8       across   SMART
## 9     actually   SMART
## 10       after   SMART
## # ... with 1,139 more rows

For more information on these stop words look at the help menu or type ?stop_words

Stemming Words

The SnowballC library has a function that stems words.

library(SnowballC)
wordStem(c("win", "winning", "winner"))

## [1] "win"    "win"    "winner"

More details on these algorithms can be found using help ?wordStem

Counting Words

If the text is in tidy text format and we want to count the number of words then we use the count() function in the dplyr library.

library(dplyr)
library(tidytext)

vfquote <- c("Everything can be taken from a man but one thing: the last of the human freedoms—to choose one’s attitude in any given set of circumstances, to choose one’s own way.")

vf_df <- data_frame(vfquote) 

unnest_tokens(vf_df,output = word,input = vfquote)

## # A tibble: 31 × 1
##          word
##         <chr>
## 1  everything
## 2         can
## 3          be
## 4       taken
## 5        from
## 6           a
## 7         man
## 8         but
## 9         one
## 10      thing
## # ... with 21 more rows

unnest_tokens(vf_df,output = word,input = vfquote) %>% count(word,sort=TRUE)

## # A tibble: 26 × 2
##        word     n
##       <chr> <int>
## 1    choose     2
## 2        of     2
## 3     one’s     2
## 4       the     2
## 5        to     2
## 6         a     1
## 7       any     1
## 8  attitude     1
## 9        be     1
## 10      but     1
## # ... with 16 more rows

This code takes the data in unnest_tokens(vf_df,output = word,input = vfquote) and pipes it into the count() function.
Piping is only available if the dplyr library is loaded.

Statistics and Machine Learning

Statistics

Statistics is the study of variation.
A certain town is served by two hospitals. In the larger hospital about 45 babies are born each day, and in the smaller hospital about 15 babies are born each day. As you know, about 50 percent of all babies are boys. However, the exact percentage varies from day to day. Sometimes it may be higher than 50 percent, sometimes lower. For a period of 1 year, each hospital recorded the days on which more than 60 percent of the babies born were boys. Which hospital do you think recorded more such days? (Tversky and Kahneman, 1974)

The fundemental notion in statistics that a larger size sample size (number of boys in a hospital) reduces variability is usually not part of people’s intuitions.

Machine Learning - Digit Recognition

Machine learning (ML) is the science of getting computers to act without being explicitly programmed. (Ng).
ML involves data exploration, statistical pattern recogition, and statistical prediction.
Consider the example of character recognition.
The goal is classification of handwritten numerals. This problem captured the attention of the machine learning and neural network community for many years, and has remained a benchmark problem in the field. (Elements of Statis. Learning, pg. 404)
Figure 11.9 shows some examples of normalized hand- written digits, automatically scanned from envelopes by the U.S. Postal Service. The original scanned digits are binary and of different sizes and orientations; the images shown here have been deslanted and size normal- ized, resulting in 16 × 16 grayscale images (Le Cun et al., 1990). These 256 pixel values are used as inputs to the neural network classifier. (Elements of Statis. Learning, pg. 404)

The data from this example come from the handwritten ZIP codes on envelopes from U.S. postal mail.
The data for each image are a 16x16 grid of pixel intensities (scored from 0-253) and the identity of each image.
Yaan LeCun (and many other groups) have been working very hard to obtain small error rates for this problem. Error rates have been reported as low as 0.7%.

Supervised versus Unsupervised Machine Learning

This is an example of a supervised learning problem. We have a set of features (pixel intensity), and a response (the true identity of each image). The problem is to use the features to predict the response.
If the response is quantitative then the it is a regression problem.
If the response is qualitative then it is a classification problem.
Quantitative variables take on numerical values (e.g., height, weight, income), and qualitative variables take on values in one several classes (e.g., sex, authorship, brand of product purchased, sentiment, etc.).
Classical statistical methods such as linear (quantitative response) or logistic (qualitative response with two categories) are often used.
In unsupervised learning we observe only the features, but no response variable. This is a more challenging situation.

Clusters in H.G. Wells Works

Consider six H.G. Wells books:

title

The Time Machine
The War of the Worlds
The Island of Doctor Moreau
Twelve Stories and a Dream
The Invisible Man: A Grotesque Romance
Boon, The Mind of the Race, The Wild Asses of the Devil, and The Last Trump; Being a First Selection from the Literary Remains of George Boon, Appropriate to the Times

Do some of these books have more in common with each other compared to other books?
Hierarchical cluster analysis may be able to give us some insight into this question.

library(gutenbergr)
library(dplyr)
library(tidytext)
library(stringr)
library(quanteda)
library(tm)
library(SnowballC)
library(ggdendro)

# instead could look at all Wells books 
# x <- gutenberg_metadata %>% filter(author=="Wells, H. G. (Herbert George)")  %>% select(gutenberg_id)
# 

hgwells <- gutenberg_download(c(35, 36, 5230, 159,34962,1743))

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

# str_extract() is from the library stringr and extracts matching patterns
# it takes regular expressions

td_wells <-  tidy_hgwells %>% mutate(word = str_extract(word, "[a-z']+")) %>%
  mutate(word = wordStem(word)) %>% group_by(gutenberg_id,word) %>% summarise(n=n())

dtm_wells <- td_wells %>% cast_tdm(gutenberg_id,word,n)

wells_matrix <- as.matrix(dtm_wells)

book_titles <- gutenberg_metadata %>% 
  filter(gutenberg_id == 35 |gutenberg_id == 36|gutenberg_id == 5230|gutenberg_id == 159|gutenberg_id == 34962|gutenberg_id == 1743) %>% select(title)
  
row.names(wells_matrix) <- book_titles$title

d <-dist(wells_matrix)

groups <- hclust(d)

plot(groups,size=0.5,xlab = "Books")

Question: What does it mean for two books being similar?

If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters.
The observations in that group are represented by the branches of the dendogram that spread out below the line.
For example, if we look at a height of 690, and move across the x-axis at that height, we’ll cross two lines. This defines a two-cluster solution;
Following the line down through all its branches, we can see the names of the books that are included in these two clusters.
The y-axis represents how close together words were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren’t very reliable.
If there’s a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data.
The dendogram shows that there seems to be two distinct groups.
Do these groups make sense? What is the interpretation of these groups?

Regular Expressions

See regular expression cheat sheet

Epicycle of Analysis

(Ref: Peng and Matsui, 2017)

Case study #1 - “Gotta Serve Somebody”

tidy text and Data Vizualization

Some text from the Bob Dylan song, “Gotta Serve Somebody”

bd_text <- c("You may be an ambassador to England or France
You may like to gamble, you might like to dance
You may be the heavyweight champion of the world
You may be a socialite with a long string of pearls
But you're gonna have to serve somebody, yes
Indeed you're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You might be a rock 'n' roll addict prancing on the stage
You might have drugs at your command, women in a cage
You may be a business man or some high-degree thief
They may call you doctor or they may call you chief
But you're gonna have to serve somebody, yes you are
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a state trooper, you might be a young Turk
You may be the head of some big TV network
You may be rich or poor, you may be blind or lame
You may be living in another country under another name
But you're gonna have to serve somebody, yes you are
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a construction worker working on a home
You may be living in a mansion or you might live in a dome
You might own guns and you might even own tanks
You might be somebody's landlord, you might even own banks
But you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may be a preacher with your spiritual pride
You may be a city councilman taking bribes on the side
You may be workin' in a barbershop, you may know how to cut hair
You may be somebody's mistress, may be somebody's heir
But you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
Might like to wear cotton, might like to wear silk
Might like to drink whiskey, might like to drink milk
You might like to eat caviar, you might like to eat bread
You may be sleeping on the floor, sleeping in a king-sized bed
But you're gonna have to serve somebody, yes
Indeed you're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody
You may call me Terry, you may call me Timmy
You may call me Bobby, you may call me Zimmy
You may call me R.J., you may call me Ray
You may call me anything but no matter what you say
Still, you're gonna have to serve somebody, yes
You're gonna have to serve somebody
Well, it may be the devil or it may be the Lord
But you're gonna have to serve somebody")

bd_text

## [1] "You may be an ambassador to England or France\nYou may like to gamble, you might like to dance\nYou may be the heavyweight champion of the world\nYou may be a socialite with a long string of pearls\nBut you're gonna have to serve somebody, yes\nIndeed you're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou might be a rock 'n' roll addict prancing on the stage\nYou might have drugs at your command, women in a cage\nYou may be a business man or some high-degree thief\nThey may call you doctor or they may call you chief\nBut you're gonna have to serve somebody, yes you are\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a state trooper, you might be a young Turk\nYou may be the head of some big TV network\nYou may be rich or poor, you may be blind or lame\nYou may be living in another country under another name\nBut you're gonna have to serve somebody, yes you are\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a construction worker working on a home\nYou may be living in a mansion or you might live in a dome\nYou might own guns and you might even own tanks\nYou might be somebody's landlord, you might even own banks\nBut you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may be a preacher with your spiritual pride\nYou may be a city councilman taking bribes on the side\nYou may be workin' in a barbershop, you may know how to cut hair\nYou may be somebody's mistress, may be somebody's heir\nBut you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nMight like to wear cotton, might like to wear silk\nMight like to drink whiskey, might like to drink milk\nYou might like to eat caviar, you might like to eat bread\nYou may be sleeping on the floor, sleeping in a king-sized bed\nBut you're gonna have to serve somebody, yes\nIndeed you're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody\nYou may call me Terry, you may call me Timmy\nYou may call me Bobby, you may call me Zimmy\nYou may call me R.J., you may call me Ray\nYou may call me anything but no matter what you say\nStill, you're gonna have to serve somebody, yes\nYou're gonna have to serve somebody\nWell, it may be the devil or it may be the Lord\nBut you're gonna have to serve somebody"

Now let’s turn it into a tidy text dataset. First put the text into a data frame

library(dplyr)
bd_text_df <- data_frame(line=1, text=bd_text)
bd_text_df

## # A tibble: 1 × 2
##    line
##   <dbl>
## 1     1
## # ... with 1 more variables: text <chr>

A token is a meaningful unit of text, such as a word, that we are intrested in using for analysis. Tokenization is the process of splitting text into tokens.

A text is ready for analysis if it has one-token-per-row. A token is often a single word but it could also be a sentence or a paragraph.

The tidytext paackage provides functionality to convert to a one-term-per-row format.

We can use unnest_tokens() to break the the text into tokens.

library(tidytext)
bd_text_df %>% unnest_tokens(word,text)

## # A tibble: 538 × 2
##     line       word
##    <dbl>      <chr>
## 1      1        you
## 2      1        may
## 3      1         be
## 4      1         an
## 5      1 ambassador
## 6      1         to
## 7      1    england
## 8      1         or
## 9      1     france
## 10     1        you
## # ... with 528 more rows

We can use the count() function in the dplyr library to find the most common words in the text.

bd_text_df %>% unnest_tokens(word,text) %>% count(word,sort=TRUE)

## # A tibble: 136 × 2
##        word     n
##       <chr> <int>
## 1       may    42
## 2       you    41
## 3        be    34
## 4        to    31
## 5      have    22
## 6     gonna    21
## 7     serve    21
## 8  somebody    21
## 9    you're    21
## 10      the    20
## # ... with 126 more rows

We can create a plot of these frequencies using the ggplot2 library.

library(ggplot2)

bd_text_df %>% unnest_tokens(word,text) %>% count(word,sort=TRUE) %>%
mutate(word=reorder(word,n)) %>%
  ggplot(aes(word,n),size=7) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Questions

Create a bar chart of the word frequencies.
Are there any stop words that you think should be removed?
Should the words be stemmed?

Introduction to Machine Learning in the Digital Humanities - Day 1