Tutorial grades will be assigned according to the following marking scheme.
| Mark | |
|---|---|
| Attendance for the entire tutorial | 1 | 
| Assigned homework completion | 1 | 
| In-class exercises | 4 | 
| Total | 6 | 
A classification tree was built to predict a dependent variable categorized as “Yes”/“No”. 80% of the observations were used to train the classification tree and the remaining 20% were used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.
| Predicted | Yes | No | 
|---|---|---|
| Yes | 100 | 30 | 
| No | 10 | 37 | 
We are given that N_total * 0.2 = N_test, where N_test is the number of observations in the test set, and N_total is the number of observations in the data set.
N_test <- 100 + 10 + 30 + 37
N_total <- N_test / 0.2
N_total## [1] 885N_train <- N_total - N_test
N_train## [1] 708The overall accuracy is: (100 + 37)/177 = 0.7740113.
The false-negative rate is: 10/ (100 + 10) = 0.0909091.
The false-positive rate is: 30 /(30 + 37) = 0.4477612.
No. The curve would only have one point. In order to draw an ROC curve we would need confusion matricies at different cutpoints.
[Adapted from Modern Data Science with R] The nasaweather package contains data about tropical storms from 1995-2005. There are four types of storms: Extratropical, Hurricane, Tropical Depression, and Tropical Storm. Consider the scatterplot between the windspeed and pressure of these storms shown below
library(mdsr)
library(nasaweather)
ggplot(data = nasaweather::storms, aes(x = pressure, y = wind, color = type)) +
geom_point(alpha = 0.5)Figure 1: Plot of storm type as a function of wind speed and pressure
wind) and pressure and plot it. Note: use nasaweather::storms to access the data set for this question (there is another data set called storms in the dplyr library).set.seed(1);
n <- nrow(nasaweather::storms) # number of observations in storms data
n## [1] 2747# random sample of 20% of row indexes 
test_idx <- sample.int(n, size = round(0.2 * n)) 
# training data is all observations except from training row indexes
train <- nasaweather::storms[-test_idx, ]
nrow(train)## [1] 2198# test data 
test <- nasaweather::storms[test_idx, ]
nrow(test)## [1] 549rtree_fit <- rpart(type ~ wind + pressure, train)
plot(as.party(rtree_fit), type="simple", gp=gpar(cex=0.8))ggplot, you can use the following functions geom_vline(xintercept) # draws a vertical line
 geom_hline(yintercept) # draws a horizontal line
 geom_segment(x, y, xend, yend) # where (x,y) are the coordinates of one endpoint and (xend,yend) are the coordinates of the other endpoint
 geom_text(x,y,label) # prints text at the specified (x,y) positionggplot(data = train, aes(x = pressure, y = wind, color = type)) +
geom_point(alpha = 0.5) +
  geom_hline(yintercept = 62.5, col="black", lwd=2) +
  geom_hline(yintercept = 32.5, col="black", lwd=2) +
  geom_segment(x=985.5, y=32.5, xend=985.5, yend=62.5,
               col="black", lwd=2) +
  geom_text(x=925, y=75, label="Hurricane", col="black") +
  geom_text(x=925, y=20, label="Tropical Depression", col="black") +
  geom_text(x=925, y=50, label="Extratropical", col="black") +
  geom_text(x=1000, y=50, label="Tropical Storm", col="black")The classification tree evaluated the prediction error of all possible splits of the variables wind and pressure using the Gini splitting crietria. Storms with wind speed greater than 62.5 knots are classified as Hurricanes, while those with wind speed lower than 32.5 knots are classified as Tropical Depressions. For storms with wind speeds between 32.5 knots and 62.5 knots, those with pressure lower than 985.5 millibars are classified as Extratropical and those with pressure greater than 985.5 millibars as Tropical Storms.
pred_tree <- predict(object=rtree_fit,newdata=test,type="class")
m <- table(pred_tree, test$type)
m##                      
## pred_tree             Extratropical Hurricane Tropical Depression
##   Extratropical                  20         0                   0
##   Hurricane                       1       184                   0
##   Tropical Depression            21         0                 100
##   Tropical Storm                 37         0                   0
##                      
## pred_tree             Tropical Storm
##   Extratropical                   10
##   Hurricane                       11
##   Tropical Depression              0
##   Tropical Storm                 165The accuracy of our classification tree is \((20 + 184 + 100 + 165) / 549 = 0.854\), (note that there are 549 observations in our test data set.) Our model predicts that 196 storms are hurricanes, but only 184 of these are actually hurricanes. In other words, \(184/196 * 100 = 93.9\%\) of storms categorized as hurricanes are actually hurricanes. However, only \(20/30 * 100 = 66.67\%\) of storms categorized as extratropical storms are actually extratropical storms.
rtree_fit2 <- rpart(type ~ wind+pressure+seasday+lat+long+year+month+hour, nasaweather::storms)
plot(as.party(rtree_fit2), type="simple", gp=gpar(cex=0.8))pred_tree2 <- predict(object=rtree_fit2,newdata=test,type="class")
table(pred_tree2, test$type)##                      
## pred_tree2            Extratropical Hurricane Tropical Depression
##   Extratropical                  60         0                   7
##   Hurricane                       1       184                   0
##   Tropical Depression             7         0                  93
##   Tropical Storm                 11         0                   0
##                      
## pred_tree2            Tropical Storm
##   Extratropical                   14
##   Hurricane                       11
##   Tropical Depression              0
##   Tropical Storm                 161sum(diag(table(pred_tree2, test$type)))/nrow(test)## [1] 0.9071038The accuracy of this tree is \((60+184+93+161)/549 = 0.907\). As in the tree from part (a), the wind speed variable (wind) is used to make the first two splits (at 62.5 knots and 32.5 knots respectively). However, for the tree from (a), storms with wind speeds between 32.5 and 62.5 knots were divided into two terminal nodes based on the value of pressure while in this tree, splitting on the latitude and longitude variables leads to purer terminal nodes. In other words, geographical location is a better predictor of storm type than pressure.
The NHANES data set contains several demographic, medical, and physical variables for a sample of 10,000 individuals. One of the variables in this dataset is Smoke100, and you will build a classification tree to predict which individuals have smoked at least 100 cigarettes in their lifetime.
Smoke100. What proportion of individuals have smoked at least 100 cigarettes in their lifetime?data <- NHANES %>% filter(!is.na(Smoke100))
sum(data$Smoke100=="Yes") / nrow(data)## [1] 0.443814844% of individuals in the NHANES dataset have smoked at least 100 cigarettes in their lifetime.
Education and Age variables only.set.seed(1);
n <- nrow(data) # number of observations in full dataset
n## [1] 7235# random sample of 20% of row indexes 
test_idx <- sample.int(n, size = round(0.2 * n)) 
# training data is all observations except from training row indexes
train <- data[-test_idx, ]
nrow(train)## [1] 5788# test data 
test <- data[test_idx, ]
nrow(test)## [1] 1447rtree_fit <- rpart(Smoke100 ~ Education + Age, train)
plot(as.party(rtree_fit), type="simple", gp=gpar(cex=0.8))predicted_tree <- predict(object = rtree_fit, newdata = test, 
                          type = "prob")
m50 <- table(predicted_tree[,2] >= 0.50,test$Smoke100)
m44 <- table(predicted_tree[,2] >= 0.44,test$Smoke100)
# Calculations for cutpoint of 0.50
tp50 <- m50[2,2] / sum(m50[2,])
tn50 <- m50[1,1] / sum(m50[1,])
fpr50 <- 1 - tn50;
fnr50 <- 1 - tp50;
accuracy50 <- sum(diag(m50))/sum(m50)
c(tp50, tn50, fpr50, fnr50, accuracy50)## [1] 0.5431894 0.6000000 0.4000000 0.4568106 0.5763649# Calculations for cutpoint of 0.44
tp44 <- m44[2,2] / sum(m44[2,])
tn44 <- m44[1,1] / sum(m44[1,])
fpr44 <- 1 - tn44;
fnr44 <- 1 - tp44;
accuracy44 <- sum(diag(m44))/sum(m44)
c(tp44, tn44, fpr44, fnr44, accuracy44)## [1] 0.5127932 0.6385069 0.3614931 0.4872068 0.5570145All of the values change since the cut-point only changes how the predictions are classified (i.e., the FPR and FNR).
Smoke100, but use all the variables in the NHANES training data (apart from SmokeNow and Smoke100n). Plot the ROC curves for both classification trees predicting Smoke100 on a single plot (see sample code below). How do the ROC curves help you determine which model is more accurate?Hint: The R syntax for using all the variables in the data frame not otherwise in the formula is to use . in place of the dependent variable.
Some starter code for this question is below.
train <- train %>% select(-one_of("SmokeNow", "Smoke100n"))
tree_full <- rpart(Smoke100 ~ ., data = train, parms = list(split = "gini"))
# Add your code here ...
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",nrow(perf_df_full)),rep("Two Vars",nrow(perf_df))))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) + 
  geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) + 
  ylab(perf@y.name) + 
  xlab(perf@x.name)pred <- ROCR::prediction(predictions = predicted_tree[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df <- data.frame(perf@x.values, perf@y.values) 
names(perf_df) <- c("fpr", "tpr") train <- train %>% select(-one_of("SmokeNow", "Smoke100n"))
tree_full <- rpart(Smoke100 ~ ., train, parms=list(split="gini"))
predicted_tree_full <- predict(object = tree_full, newdata = test, type = "prob")
confusion_matrix <- table(predicted_tree_full[,2] >= 0.5,test$Smoke100)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix##      
##        No Yes
##   No  594 246
##   Yes 188 419# Accuracy
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
pred <- ROCR::prediction(predictions = predicted_tree_full[,2], test$Smoke100)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df_full <- data.frame(perf@x.values, perf@y.values) 
names(perf_df_full) <- c("fpr", "tpr")
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",nrow(perf_df_full)),rep("Two Vars",nrow(perf_df))))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) + 
  geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) + 
  ylab(perf@y.name) + 
  xlab(perf@x.name)The tree classifier using all the variables in the dataset has much better performance than the classifier using only Education and Age as predictors. The plot shows that the ROC curve for the model with all variables is above the ROC curve for the model with two variables, which indicates that the model with all variables is more accurate overall.