Today’s class

Unsupervised learning
Latent Dirchlet Allocation
Heirarchical Clustering

Topic Modelling - Latent Dirchlet Allocation (LDA)

LDA is one of the “simplest” topic models.
A topic is defined as having words associated with that topic with high probability. For example, genetics topic has words about genetics with high probability.

LDA Assumptions

The model assumes that these topics are generated before the documents.
Each document in the collection shares the same topics.
Each document exhibits the topics in different proportion.
Each word in each document belongs to one of the topics.

LDA Model

The LDA model calculates the conditional probability of a word being generated from a particular topic: the probability of a topic given the word.
This is called the per-topic-per-word probability.

Unsupervised Learning

In supervised learning we have a set of features and a response. The goal is to predict the response based on the features.
In unsupervised learning we only have a set of features.
The goal is usually to discover interesting things about the features.
For example, an informative vizualization or meaningful subgroups among the observations.
Unsupervised learning is much more challenging since it’s more subjective.
There is usually (not including topic modelling) no simple goal such as predicting the response.

Heirarchical Clustering

Clustering techniques are used to find subgroups or clusters in a data set.
The idea is to partition the observations into groups so that the observations within a group are similar, and the observations in different groups are different from each other.
This requires defining similar. This is usually defined as Euclidean distance.
The term heirarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.

## # A tibble: 9 × 3
##     obs          X1         X2
##   <int>       <dbl>      <dbl>
## 1     1 -0.89691455 -0.4521053
## 2     2  0.18484918  0.2598286
## 3     3  1.58784533  2.2497441
## 4     4 -1.13037567 -0.7428261
## 5     5 -0.08025176  0.7566374
## 6     6  0.13242028  0.2829217
## 7     7  0.70795473  1.0552270
## 8     8 -0.23969802  0.2490752
## 9     9  1.98447394  2.1337208

##           1         2         3         4         5         6         7
## 2 1.3028226                                                            
## 3 3.6220645 2.3823302                                                  
## 4 1.1532208 1.7564782 3.9244112                                        
## 5 2.0305221 1.2233848 2.2829559 1.8073032                              
## 6 2.1967349 1.4616399 2.6069190 1.7330061 0.6210401                    
## 7 3.0538414 2.0446983 2.0471497 2.7155968 1.0931977 1.0002391          
## 8 2.7195583 2.2288690 3.1925172 1.9476950 1.2103898 0.8144211 1.2562080
## 9 4.7460913 3.5829700 2.2268867 4.4851071 2.8067419 2.7571601 1.7719305
##           8
## 2          
## 3          
## 4          
## 5          
## 6          
## 7          
## 8          
## 9 2.8399718

How to interpret a dendogram

(Reference, James et.al., An Introduction to Statistical Learning) - Each leaf represents one of the nine observations. - As we move up the tree some leaves fuse into branches. These correspond to observations that are similar to each other. - In fact, this statement can be made precise: for any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different the two observations are. Thus, observations that fuse at the very bottom of the tree are quite similar to each other, whereas observations that fuse close to the top of the tree will tend to be quite different.

If the dendogram is cut at height 3.5 then this results in two distinct clusters.
If the dendogram is cut at height 2.5 then this results in three distinct clusters.
Conclusions about similarity are based on the location on the vertical axis where branches containing the observations are first fused.

hgwells <- gutenberg_download(c(35, 36, 5230, 159,34962,1743))

book_words <- hgwells %>%
  unnest_tokens(word, text) %>% count(gutenberg_id, word, sort = TRUE) %>% ungroup()

total_words <- book_words %>% group_by(gutenberg_id) %>% summarize(total = sum(n))

book_words <- left_join(book_words,total_words) 
book_words

## # A tibble: 38,306 × 4
##    gutenberg_id  word     n total
##           <int> <chr> <int> <int>
## 1            36   the  4793 60513
## 2          1743   the  3696 73932
## 3          5230   the  3323 49168
## 4          1743   and  3145 73932
## 5           159   the  3034 44281
## 6         34962   the  2887 48079
## 7            36   and  2504 60513
## 8            36    of  2301 60513
## 9            35   the  2260 32653
## 10         1743    of  2219 73932
## # ... with 38,296 more rows

book_titles <- gutenberg_metadata %>% filter(gutenberg_id == 35 |gutenberg_id == 36|gutenberg_id ==5230|gutenberg_id == 159|gutenberg_id == 34962|gutenberg_id == 1743) %>% select(gutenberg_id,title)

book_words <- left_join(book_words,book_titles) 

book_words <- book_words %>% bind_tf_idf(word, title, n)



book_words_high <- book_words %>%
        select(-total) %>%
        arrange(desc(tf_idf)) %>% slice(1:50)

dtm_wells <- book_words_high %>% cast_tdm(title,word,n)

wells_matrix <- as.matrix(dtm_wells)

d <-dist(scale(wells_matrix))

groups <- hclust(d,method = "complete")

plot(groups,size=0.5,xlab = "Books")

If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters.
The observations in that group are represented by the branches of the dendogram that spread out below the line.
For example, if we look at a height of 690, and move across the x-axis at that height, we’ll cross two lines. This defines a two-cluster solution;
Following the line down through all its branches, we can see the names of the books that are included in these two clusters.
The y-axis represents how close together words were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren’t very reliable.
If there’s a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data.
The dendogram shows that there seems to be two distinct groups.

Questions

What does it mean for two books being similar?
Do these groups make sense?
What is the interpretation of these groups?

Introduction to Machine Learning in the Digital Humanities - Day 3