Today’s class

Topic Modelling - Latent Dirchlet Allocation (LDA)

LDA Assumptions

  • The model assumes that these topics are generated before the documents.

  • Each document in the collection shares the same topics.
  • Each document exhibits the topics in different proportion.
  • Each word in each document belongs to one of the topics.

LDA Model

  • The LDA model calculates the conditional probability of a word being generated from a particular topic: the probability of a topic given the word.
  • This is called the per-topic-per-word probability.

Unsupervised Learning

Heirarchical Clustering

  • Clustering techniques are used to find subgroups or clusters in a data set.

  • The idea is to partition the observations into groups so that the observations within a group are similar, and the observations in different groups are different from each other.

  • This requires defining similar. This is usually defined as Euclidean distance.

  • The term heirarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.

## # A tibble: 9 × 3
##     obs          X1         X2
##   <int>       <dbl>      <dbl>
## 1     1 -0.89691455 -0.4521053
## 2     2  0.18484918  0.2598286
## 3     3  1.58784533  2.2497441
## 4     4 -1.13037567 -0.7428261
## 5     5 -0.08025176  0.7566374
## 6     6  0.13242028  0.2829217
## 7     7  0.70795473  1.0552270
## 8     8 -0.23969802  0.2490752
## 9     9  1.98447394  2.1337208
##           1         2         3         4         5         6         7
## 2 1.3028226                                                            
## 3 3.6220645 2.3823302                                                  
## 4 1.1532208 1.7564782 3.9244112                                        
## 5 2.0305221 1.2233848 2.2829559 1.8073032                              
## 6 2.1967349 1.4616399 2.6069190 1.7330061 0.6210401                    
## 7 3.0538414 2.0446983 2.0471497 2.7155968 1.0931977 1.0002391          
## 8 2.7195583 2.2288690 3.1925172 1.9476950 1.2103898 0.8144211 1.2562080
## 9 4.7460913 3.5829700 2.2268867 4.4851071 2.8067419 2.7571601 1.7719305
##           8
## 2          
## 3          
## 4          
## 5          
## 6          
## 7          
## 8          
## 9 2.8399718

How to interpret a dendogram

(Reference, James et.al., An Introduction to Statistical Learning) - Each leaf represents one of the nine observations. - As we move up the tree some leaves fuse into branches. These correspond to observations that are similar to each other. - In fact, this statement can be made precise: for any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different the two observations are. Thus, observations that fuse at the very bottom of the tree are quite similar to each other, whereas observations that fuse close to the top of the tree will tend to be quite different.

hgwells <- gutenberg_download(c(35, 36, 5230, 159,34962,1743))

book_words <- hgwells %>%
  unnest_tokens(word, text) %>% count(gutenberg_id, word, sort = TRUE) %>% ungroup()

total_words <- book_words %>% group_by(gutenberg_id) %>% summarize(total = sum(n))

book_words <- left_join(book_words,total_words) 
book_words
## # A tibble: 38,306 × 4
##    gutenberg_id  word     n total
##           <int> <chr> <int> <int>
## 1            36   the  4793 60513
## 2          1743   the  3696 73932
## 3          5230   the  3323 49168
## 4          1743   and  3145 73932
## 5           159   the  3034 44281
## 6         34962   the  2887 48079
## 7            36   and  2504 60513
## 8            36    of  2301 60513
## 9            35   the  2260 32653
## 10         1743    of  2219 73932
## # ... with 38,296 more rows
book_titles <- gutenberg_metadata %>% filter(gutenberg_id == 35 |gutenberg_id == 36|gutenberg_id ==5230|gutenberg_id == 159|gutenberg_id == 34962|gutenberg_id == 1743) %>% select(gutenberg_id,title)

book_words <- left_join(book_words,book_titles) 

book_words <- book_words %>% bind_tf_idf(word, title, n)



book_words_high <- book_words %>%
        select(-total) %>%
        arrange(desc(tf_idf)) %>% slice(1:50)

dtm_wells <- book_words_high %>% cast_tdm(title,word,n)

wells_matrix <- as.matrix(dtm_wells)

d <-dist(scale(wells_matrix))

groups <- hclust(d,method = "complete")

plot(groups,size=0.5,xlab = "Books")

Questions