Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completion | 1 |
In-class exercises | 4 |
Total | 6 |
A classification tree was built to predict a dependent variable categorized as “Yes”/“No”. 80% of the observations were used to train the classification tree and the remaining 20% were used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.
Predicted | Yes | No |
---|---|---|
Yes | 100 | 30 |
No | 10 | 37 |
We are given that N_total * 0.2 = N_test
, where N_test
is the number of observations in the test set, and N_total
is the number of observations in the data set.
N_test <- 100 + 10 + 30 + 37
N_total <- N_test / 0.2
N_total
## [1] 885
N_train <- N_total - N_test
N_train
## [1] 708
The overall accuracy is: (100 + 37)/177 = 0.7740113.
The false-negative rate is: 10/ (100 + 10) = 0.0909091.
The false-positive rate is: 30 /(30 + 37) = 0.4477612.
No. The curve would only have one point. In order to draw an ROC curve we would need confusion matricies at different cutpoints.
[Adapted from Modern Data Science with R] The nasaweather
package contains data about tropical storms from 1995-2005. There are four types of storms: Extratropical, Hurricane, Tropical Depression, and Tropical Storm. Consider the scatterplot between the windspeed and pressure of these storms shown below
library(mdsr)
library(nasaweather)
ggplot(data = nasaweather::storms, aes(x = pressure, y = wind, color = type)) +
geom_point(alpha = 0.5)
wind
) and pressure
and plot it. Note: use nasaweather::storms
to access the data set for this question (there is another data set called storms
in the dplyr
library).set.seed(1);
n <- nrow(nasaweather::storms) # number of observations in storms data
n
## [1] 2747
# random sample of 20% of row indexes
test_idx <- sample.int(n, size = round(0.2 * n))
# training data is all observations except from training row indexes
train <- nasaweather::storms[-test_idx, ]
nrow(train)
## [1] 2198
# test data
test <- nasaweather::storms[test_idx, ]
nrow(test)
## [1] 549
rtree_fit <- rpart(type ~ wind + pressure, train)
plot(as.party(rtree_fit), type="simple", gp=gpar(cex=0.8))
ggplot
, you can use the following functions geom_vline(xintercept) # draws a vertical line
geom_hline(yintercept) # draws a horizontal line
geom_segment(x, y, xend, yend) # where (x,y) are the coordinates of one endpoint and (xend,yend) are the coordinates of the other endpoint
geom_text(x,y,label) # prints text at the specified (x,y) position
ggplot(data = train, aes(x = pressure, y = wind, color = type)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 62.5, col="black", lwd=2) +
geom_hline(yintercept = 32.5, col="black", lwd=2) +
geom_segment(x=985.5, y=32.5, xend=985.5, yend=62.5,
col="black", lwd=2) +
geom_text(x=925, y=75, label="Hurricane", col="black") +
geom_text(x=925, y=20, label="Tropical Depression", col="black") +
geom_text(x=925, y=50, label="Extratropical", col="black") +
geom_text(x=1000, y=50, label="Tropical Storm", col="black")
The classification tree evaluated the prediction error of all possible splits of the variables wind
and pressure
using the Gini splitting crietria. Storms with wind speed greater than 62.5 knots are classified as Hurricanes, while those with wind speed lower than 32.5 knots are classified as Tropical Depressions. For storms with wind speeds between 32.5 knots and 62.5 knots, those with pressure lower than 985.5 millibars are classified as Extratropical and those with pressure greater than 985.5 millibars as Tropical Storms.
pred_tree <- predict(object=rtree_fit,newdata=test,type="class")
m <- table(pred_tree, test$type)
m
##
## pred_tree Extratropical Hurricane Tropical Depression
## Extratropical 20 0 0
## Hurricane 1 184 0
## Tropical Depression 21 0 100
## Tropical Storm 37 0 0
##
## pred_tree Tropical Storm
## Extratropical 10
## Hurricane 11
## Tropical Depression 0
## Tropical Storm 165
The accuracy of our classification tree is \((20 + 184 + 100 + 165) / 549 = 0.854\), (note that there are 549 observations in our test
data set.) Our model predicts that 196 storms are hurricanes, but only 184 of these are actually hurricanes. In other words, \(184/196 * 100 = 93.9\%\) of storms categorized as hurricanes are actually hurricanes. However, only \(20/30 * 100 = 66.67\%\) of storms categorized as extratropical storms are actually extratropical storms.
rtree_fit2 <- rpart(type ~ wind+pressure+seasday+lat+long+year+month+hour, nasaweather::storms)
plot(as.party(rtree_fit2), type="simple", gp=gpar(cex=0.8))
pred_tree2 <- predict(object=rtree_fit2,newdata=test,type="class")
table(pred_tree2, test$type)
##
## pred_tree2 Extratropical Hurricane Tropical Depression
## Extratropical 60 0 7
## Hurricane 1 184 0
## Tropical Depression 7 0 93
## Tropical Storm 11 0 0
##
## pred_tree2 Tropical Storm
## Extratropical 14
## Hurricane 11
## Tropical Depression 0
## Tropical Storm 161
sum(diag(table(pred_tree2, test$type)))/nrow(test)
## [1] 0.9071038
The accuracy of this tree is \((60+184+93+161)/549 = 0.907\). As in the tree from part (a), the wind speed variable (wind
) is used to make the first two splits (at 62.5 knots and 32.5 knots respectively). However, for the tree from (a), storms with wind speeds between 32.5 and 62.5 knots were divided into two terminal nodes based on the value of pressure
while in this tree, splitting on the latitude and longitude variables leads to purer terminal nodes. In other words, geographical location is a better predictor of storm type than pressure.
The NHANES
data set contains several demographic, medical, and physical variables for a sample of 10,000 individuals. One of the variables in this dataset is Smoke100
, and you will build a classification tree to predict which individuals have smoked at least 100 cigarettes in their lifetime.
Smoke100
. What proportion of individuals have smoked at least 100 cigarettes in their lifetime?data <- NHANES %>% filter(!is.na(Smoke100))
sum(data$Smoke100=="Yes") / nrow(data)
## [1] 0.4438148
44% of individuals in the NHANES dataset have smoked at least 100 cigarettes in their lifetime.
Education
and Age
variables only.set.seed(1);
n <- nrow(data) # number of observations in full dataset
n
## [1] 7235
# random sample of 20% of row indexes
test_idx <- sample.int(n, size = round(0.2 * n))
# training data is all observations except from training row indexes
train <- data[-test_idx, ]
nrow(train)
## [1] 5788
# test data
test <- data[test_idx, ]
nrow(test)
## [1] 1447
rtree_fit <- rpart(Smoke100 ~ Education + Age, train)
plot(as.party(rtree_fit), type="simple", gp=gpar(cex=0.8))
predicted_tree <- predict(object = rtree_fit, newdata = test,
type = "prob")
m50 <- table(predicted_tree[,2] >= 0.50,test$Smoke100)
m44 <- table(predicted_tree[,2] >= 0.44,test$Smoke100)
# Calculations for cutpoint of 0.50
tp50 <- m50[2,2] / sum(m50[2,])
tn50 <- m50[1,1] / sum(m50[1,])
fpr50 <- 1 - tn50;
fnr50 <- 1 - tp50;
accuracy50 <- sum(diag(m50))/sum(m50)
c(tp50, tn50, fpr50, fnr50, accuracy50)
## [1] 0.5431894 0.6000000 0.4000000 0.4568106 0.5763649
# Calculations for cutpoint of 0.44
tp44 <- m44[2,2] / sum(m44[2,])
tn44 <- m44[1,1] / sum(m44[1,])
fpr44 <- 1 - tn44;
fnr44 <- 1 - tp44;
accuracy44 <- sum(diag(m44))/sum(m44)
c(tp44, tn44, fpr44, fnr44, accuracy44)
## [1] 0.5127932 0.6385069 0.3614931 0.4872068 0.5570145
All of the values change since the cut-point only changes how the predictions are classified (i.e., the FPR and FNR).
Smoke100
, but use all the variables in the NHANES training data (apart from SmokeNow and Smoke100n). Plot the ROC curves for both classification trees predicting Smoke100
on a single plot (see sample code below). How do the ROC curves help you determine which model is more accurate?Hint: The R syntax for using all the variables in the data frame not otherwise in the formula is to use . in place of the dependent variable.
Some starter code for this question is below.
train <- train %>% select(-one_of("SmokeNow", "Smoke100n"))
tree_full <- rpart(Smoke100 ~ ., data = train, parms = list(split = "gini"))
# Add your code here ...
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",nrow(perf_df_full)),rep("Two Vars",nrow(perf_df))))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) +
geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) +
ylab(perf@y.name) +
xlab(perf@x.name)
pred <- ROCR::prediction(predictions = predicted_tree[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df <- data.frame(perf@x.values, perf@y.values)
names(perf_df) <- c("fpr", "tpr")
train <- train %>% select(-one_of("SmokeNow", "Smoke100n"))
tree_full <- rpart(Smoke100 ~ ., train, parms=list(split="gini"))
predicted_tree_full <- predict(object = tree_full, newdata = test, type = "prob")
confusion_matrix <- table(predicted_tree_full[,2] >= 0.5,test$Smoke100)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##
## No Yes
## No 594 246
## Yes 188 419
# Accuracy
accuracy <- sum(diag(confusion_matrix))/sum(confusion_matrix)
pred <- ROCR::prediction(predictions = predicted_tree_full[,2], test$Smoke100)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df_full <- data.frame(perf@x.values, perf@y.values)
names(perf_df_full) <- c("fpr", "tpr")
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",nrow(perf_df_full)),rep("Two Vars",nrow(perf_df))))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) +
geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) +
ylab(perf@y.name) +
xlab(perf@x.name)
The tree classifier using all the variables in the dataset has much better performance than the classifier using only Education
and Age
as predictors. The plot shows that the ROC curve for the model with all variables is above the ROC curve for the model with two variables, which indicates that the model with all variables is more accurate overall.