Your answer to question 2 parts (d), (e), (f), (g).
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week8/Week8PracticeProblems-student.Rmd"
download.file(url = file_url , destfile = "Week8PracticeProblems-student.Rmd")
Look for the file “Week8PracticeProblems-student.Rmd” under the Files tab then click on it to open.
Change the subtitle to “Week 8 Practice Problems Solutions” and change the author to your name and student number.
Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.
Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
A classification tree was built to predict a dependent variable categorized as “Yes”, “No”. 80% of the data set were used to train the classification tree and the remaining 20% was used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.
Predicted | Yes | No |
---|---|---|
Yes | 100 | 30 |
No | 10 | 37 |
We are given that N_total * 0.2 = N_test
, where N_test
is the number of observations in the test set, and N_total
is the number of observations in the data set.
N_test <- 100 + 10 + 30 + 37
N_total <- N_test / 0.2
N_total
## [1] 885
N_train <- N_total - N_test
N_train
## [1] 708
The overall accuracy is: (100 + 37)/177 = 0.7740113. The false-negative rate is: 10/ (100 + 10) = 0.0909091. The false-positive rate is: 30 /(30 + 37) = 0.4477612.
No. The curve would only have one point. In order to draw an ROC curve we would need confusion matricies at different cutpoints.
library(NHANES)
glimpse(NHANES)
Read the description in Exercise 8.1 and answer the following questions.
SleepHrsNight
and Depressed
in the NHANES
data to build a classification tree to predict SleepTrouble
.library(NHANES)
library(rpart)
library(partykit)
tree <- rpart(SleepTrouble ~ SleepHrsNight + Depressed, data = NHANES, parms = list(split = "gini"))
tree
## n=7768 (2232 observations deleted due to missingness)
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 7768 1969 No (0.7465242 0.2534758)
## 2) SleepHrsNight>=5.5 6810 1509 No (0.7784141 0.2215859) *
## 3) SleepHrsNight< 5.5 958 460 No (0.5198330 0.4801670)
## 6) Depressed=None 680 278 No (0.5911765 0.4088235) *
## 7) Depressed=Several,Most 278 96 Yes (0.3453237 0.6546763) *
plot(as.party(tree),type = "simple")
The classification tree evaluated the prediction error of all possible splits of the variables SleepHrsNight
and Depressed
using the Gini splitting criteria. The best splits were: sleeping at least 5.5 hours per night, which resulted in a prediction of No sleep trouble; sleeping less than 5.5 hours with no depression resulted in a prediction of no sleep trouble; and sleeping less than 5.5 hours with several or majority of days depressed predicted sleep trouble.
NHANES %>%
filter(is.na(SleepTrouble) == F) %>%
group_by(SleepTrouble) %>%
summarise(n = n()) %>%
mutate(pct = n/sum(n))
## # A tibble: 2 x 3
## SleepTrouble n pct
## <fct> <int> <dbl>
## 1 No 5799 0.746
## 2 Yes 1973 0.254
Which values change and which values are the same for different cut-points? Explain.
library(NHANES)
library(rpart)
library(partykit)
set.seed(364)
n <- nrow(NHANES)
test_idx <- sample.int(n, size = round(0.25 * n))
train <- NHANES[-test_idx, ]
nrow(train)
## [1] 7500
test <- NHANES[test_idx, ] %>% filter(is.na(SleepTrouble) == F)
nrow(test)
## [1] 1939
tree <- rpart(SleepTrouble ~ SleepHrsNight + Depressed, data = train, parms = list(split = "gini"))
predicted_tree <- predict(object = tree, newdata = test, type = "prob")
# Cut-point using 0.5
# if predicted prob of "Yes" is >= 0.5 then predicted class is "Yes"
# otherwise predicted class is "No"
confusion_matrix <- table(predicted_tree[,2] >= 0.5,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##
## No Yes
## No 1425 435
## Yes 24 55
sensit_50 <- confusion_matrix[4]/(confusion_matrix[4] + confusion_matrix[3])
sensit_50
## [1] 0.1122449
specif_50 <- confusion_matrix[1]/(confusion_matrix[1] + confusion_matrix[2])
specif_50
## [1] 0.9834369
fpr_50 <- 1 - specif_50
fpr_50
## [1] 0.01656315
fnr_50 <- 1 - sensit_50
fnr_50
## [1] 0.8877551
accuracy_50 <- (confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
accuracy_50
## [1] 0.76328
# Cut-point using 0.25
confusion_matrix <- table(predicted_tree[,2] >= 0.25,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##
## No Yes
## No 1316 377
## Yes 133 113
sensit <- confusion_matrix[4]/(confusion_matrix[4] + confusion_matrix[3])
sensit
## [1] 0.2306122
specif <- confusion_matrix[1]/(confusion_matrix[1] + confusion_matrix[2])
specif
## [1] 0.9082126
fpr <- 1 - specif
fpr
## [1] 0.09178744
fnr <- 1 - sensit
fnr
## [1] 0.7693878
accuracy <- (confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
accuracy
## [1] 0.7369778
All of the values change since the cut-point only changes how the predictions are classified (i.e., the FPR and FNR).
The overall accuracy in predicting trouble sleeping using only number of hours slept and depression is 0.76. If a person has trouble sleeping then the model will predict this with 0.11 accuracy, and if a person does not have trouble sleeping then the model will predict this with 0.98 accuracy. The model has higher accuracy in predicting trouble sleeping compared to no trouble sleeping, this is due to the high false-negative rate of 0.89.
predicted_tree <- predict(object = tree, newdata = test, type = "prob")
pred <- ROCR::prediction(predictions = predicted_tree[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df <- data.frame(perf@x.values, perf@y.values)
names(perf_df) <- c("fpr", "tpr")
roc <- ggplot(data = perf_df, aes(x = fpr, y = tpr)) +
geom_line(color = "blue") + geom_abline(intercept = 0, slope = 1, lty = 3) + ylab(perf@y.name) + xlab(perf@x.name)
roc
SleepTrouble
, but use all the variables in the NHANES
training data. Plot the ROC curves for the Tree Classifier you developed in part (d) and the Tree Classifier you developed using all the variables on a single plot (see sample code below). What is the accuracy of the Tree Classifier using all the variables? Interpret the plot. Can you say which Tree Classifier has higher accuracy?NB: The R syntax for using all the variables in the data frame not otherwise in the formula is to use .
in place of the dependent variable.
tree_full <- rpart(SleepTrouble ~ ., data = train, parms = list(split = "gini"))
predicted_tree_full <- predict(object = tree_full, newdata = test, type = "prob")
confusion_matrix <- table(predicted_tree_full[,2] >= 0.5,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##
## No Yes
## No 1395 421
## Yes 54 69
#Accuracy
(confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
## [1] 0.7550284
# Use this code
pred <- ROCR::prediction(predictions = predicted_tree_full[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df_full <- data.frame(perf@x.values, perf@y.values)
names(perf_df_full) <- c("fpr", "tpr")
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",7),rep("Two Vars",4)))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) +
geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) +
ylab(perf@y.name) +
xlab(perf@x.name)
The tree classifier using all the variables in the data has very similar performance. The accuracy of both models using a cut-point of 0.5 is 0.76. The plot shows that the ROC curve for the model with all variables is above the ROC curve for the model with two variables. This indicates that the model with all variables has slightly higher overall accuracy, although the difference does not appear to be large.
Data was collected on 30 cancer patients to investigate the effectivness (Yes/No) of a treatment. Two quantitative variables, \(x_i \in (0,1), i=1,2\), are considered to be important predictors of effectiveness. Suppose that the rectangles labelled as nodes in the scatterplot below represent nodes of a classification tree. What is the predicted class of each node? What proportion of observations in each node is correctly classified? Create a confusion matrix, and calculate the overall accuracy.
Node | Predicted Class | Proportion Corretly Classified |
---|---|---|
1 | No | 7/12 |
2 | Yes | 3/5 |
3 | No | 3/4 |
4 | No | 7/9 |
The Confusion matrix is:
Predicted | Yes | No |
---|---|---|
Yes | 3 | 2 |
No | 8 | 17 |
Overall accuracy: (3+17)/30 = 0.67.