LDA is one of the “simplest” topic models.
A topic is defined as having words associated with that topic with high probability. For example, genetics topic has words about genetics with high probability.
The model assumes that these topics are generated before the documents.
Each word in each document belongs to one of the topics.
In supervised learning we have a set of features and a response. The goal is to predict the response based on the features.
In unsupervised learning we only have a set of features.
The goal is usually to discover interesting things about the features.
For example, an informative vizualization or meaningful subgroups among the observations.
Unsupervised learning is much more challenging since it’s more subjective.
There is usually (not including topic modelling) no simple goal such as predicting the response.
Clustering techniques are used to find subgroups or clusters in a data set.
The idea is to partition the observations into groups so that the observations within a group are similar, and the observations in different groups are different from each other.
This requires defining similar. This is usually defined as Euclidean distance.
The term heirarchical refers to the fact that clusters obtained by cutting the dendrogram at a given height are necessarily nested within the clusters obtained by cutting the dendrogram at any greater height.
## # A tibble: 9 × 3
## obs X1 X2
## <int> <dbl> <dbl>
## 1 1 -0.89691455 -0.4521053
## 2 2 0.18484918 0.2598286
## 3 3 1.58784533 2.2497441
## 4 4 -1.13037567 -0.7428261
## 5 5 -0.08025176 0.7566374
## 6 6 0.13242028 0.2829217
## 7 7 0.70795473 1.0552270
## 8 8 -0.23969802 0.2490752
## 9 9 1.98447394 2.1337208
## 1 2 3 4 5 6 7
## 2 1.3028226
## 3 3.6220645 2.3823302
## 4 1.1532208 1.7564782 3.9244112
## 5 2.0305221 1.2233848 2.2829559 1.8073032
## 6 2.1967349 1.4616399 2.6069190 1.7330061 0.6210401
## 7 3.0538414 2.0446983 2.0471497 2.7155968 1.0931977 1.0002391
## 8 2.7195583 2.2288690 3.1925172 1.9476950 1.2103898 0.8144211 1.2562080
## 9 4.7460913 3.5829700 2.2268867 4.4851071 2.8067419 2.7571601 1.7719305
## 8
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 2.8399718
(Reference, James et.al., An Introduction to Statistical Learning) - Each leaf represents one of the nine observations. - As we move up the tree some leaves fuse into branches. These correspond to observations that are similar to each other. - In fact, this statement can be made precise: for any two observations, we can look for the point in the tree where branches containing those two observations are first fused. The height of this fusion, as measured on the vertical axis, indicates how different the two observations are. Thus, observations that fuse at the very bottom of the tree are quite similar to each other, whereas observations that fuse close to the top of the tree will tend to be quite different.
hgwells <- gutenberg_download(c(35, 36, 5230, 159,34962,1743))
book_words <- hgwells %>%
unnest_tokens(word, text) %>% count(gutenberg_id, word, sort = TRUE) %>% ungroup()
total_words <- book_words %>% group_by(gutenberg_id) %>% summarize(total = sum(n))
book_words <- left_join(book_words,total_words)
book_words
## # A tibble: 38,306 × 4
## gutenberg_id word n total
## <int> <chr> <int> <int>
## 1 36 the 4793 60513
## 2 1743 the 3696 73932
## 3 5230 the 3323 49168
## 4 1743 and 3145 73932
## 5 159 the 3034 44281
## 6 34962 the 2887 48079
## 7 36 and 2504 60513
## 8 36 of 2301 60513
## 9 35 the 2260 32653
## 10 1743 of 2219 73932
## # ... with 38,296 more rows
book_titles <- gutenberg_metadata %>% filter(gutenberg_id == 35 |gutenberg_id == 36|gutenberg_id ==5230|gutenberg_id == 159|gutenberg_id == 34962|gutenberg_id == 1743) %>% select(gutenberg_id,title)
book_words <- left_join(book_words,book_titles)
book_words <- book_words %>% bind_tf_idf(word, title, n)
book_words_high <- book_words %>%
select(-total) %>%
arrange(desc(tf_idf)) %>% slice(1:50)
dtm_wells <- book_words_high %>% cast_tdm(title,word,n)
wells_matrix <- as.matrix(dtm_wells)
d <-dist(scale(wells_matrix))
groups <- hclust(d,method = "complete")
plot(groups,size=0.5,xlab = "Books")
If you choose any height along the y-axis of the dendogram, and move across the dendogram counting the number of lines that you cross, each line represents a group that was identified when objects were joined together into clusters.
The observations in that group are represented by the branches of the dendogram that spread out below the line.
For example, if we look at a height of 690, and move across the x-axis at that height, we’ll cross two lines. This defines a two-cluster solution;
Following the line down through all its branches, we can see the names of the books that are included in these two clusters.
The y-axis represents how close together words were when they were merged into clusters, clusters whose branches are very close together (in terms of the heights at which they were merged) probably aren’t very reliable.
If there’s a big difference along the y-axis between the last merged cluster and the currently merged one, that indicates that the clusters formed are probably doing a good job in showing us the structure of the data.
The dendogram shows that there seems to be two distinct groups.