Instructions

What should I bring to tutorial on March 9?

Your answer to question 2 parts (d), (e), (f), (g).

First steps to answering these questions.

  • Download this R Notebook directly into RStudio by typing the following code into the RStudio console window.
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week8/Week8PracticeProblems-student.Rmd"
download.file(url = file_url , destfile = "Week8PracticeProblems-student.Rmd")

Look for the file “Week8PracticeProblems-student.Rmd” under the Files tab then click on it to open.

  • Change the subtitle to “Week 8 Practice Problems Solutions” and change the author to your name and student number.

  • Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completiona 1
In-class exercises 4
Total 6
  1. Student’s must bring answers to questions that were assigned to bring to tutorial. Answers do not have to be perfect in order to receive full credit, but a serious attempt at the problem is required for full credit. Your must work be your own.

Practice Problems

Question 1

A classification tree was built to predict a dependent variable categorized as “Yes”, “No”. 80% of the data set were used to train the classification tree and the remaining 20% was used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.

Predicted Yes No
Yes 100 30
No 10 37
  1. How many observations were used to train the model? How many observations were used to test the model?

We are given that N_total * 0.2 = N_test, where N_test is the number of observations in the test set, and N_total is the number of observations in the data set.

N_test <- 100 + 10 + 30 + 37
N_total <- N_test / 0.2
N_total
## [1] 885
N_train <- N_total - N_test
N_train
## [1] 708
  1. What is the accuracy, false-positive rate, and false-negative rate? Assume that “Yes” is positive and “No” is negative.

The overall accuracy is: (100 + 37)/177 = 0.7740113. The false-negative rate is: 10/ (100 + 10) = 0.0909091. The false-positive rate is: 30 /(30 + 37) = 0.4477612.

  1. Is it possible to use this table to draw an ROC curve? Explain.

No. The curve would only have one point. In order to draw an ROC curve we would need confusion matricies at different cutpoints.

Question 2

library(NHANES)
glimpse(NHANES)

Read the description in Exercise 8.1 and answer the following questions.

  1. Use the variables SleepHrsNight and Depressed in the NHANES data to build a classification tree to predict SleepTrouble.
library(NHANES)
library(rpart) 
library(partykit)
tree <- rpart(SleepTrouble ~ SleepHrsNight + Depressed, data = NHANES, parms = list(split = "gini"))
tree
## n=7768 (2232 observations deleted due to missingness)
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 7768 1969 No (0.7465242 0.2534758)  
##   2) SleepHrsNight>=5.5 6810 1509 No (0.7784141 0.2215859) *
##   3) SleepHrsNight< 5.5 958  460 No (0.5198330 0.4801670)  
##     6) Depressed=None 680  278 No (0.5911765 0.4088235) *
##     7) Depressed=Several,Most 278   96 Yes (0.3453237 0.6546763) *
plot(as.party(tree),type = "simple")

  1. Summarize the classification tree in part (a) in a four to six sentence paragraph, using complete English sentences. The paragraph should address: how the splits on the variables were selected; which nodes are terminal; and how a new observation would be predicted by this classifciation tree.

The classification tree evaluated the prediction error of all possible splits of the variables SleepHrsNight and Depressed using the Gini splitting criteria. The best splits were: sleeping at least 5.5 hours per night, which resulted in a prediction of No sleep trouble; sleeping less than 5.5 hours with no depression resulted in a prediction of no sleep trouble; and sleeping less than 5.5 hours with several or majority of days depressed predicted sleep trouble.

  1. What proportion of subjects in the NHANES data have trouble sleeping?
NHANES %>% 
  filter(is.na(SleepTrouble) == F) %>% 
  group_by(SleepTrouble) %>%
  summarise(n = n()) %>%
  mutate(pct = n/sum(n))
## # A tibble: 2 x 3
##   SleepTrouble     n   pct
##   <fct>        <int> <dbl>
## 1 No            5799 0.746
## 2 Yes           1973 0.254
  1. Separate the NHANES data set uniformly at random into 75% training and 25% testing sets. Use the training set to build the classification tree using the variables in part (a). Use the test set to calculate the confusion matrix for two cut-points: 0.5 and proportion of subjects in the NHANES data that have trouble sleeping (use the value from part (c)). For each cut-point use the confusion matrix to calculate the following values:
  1. True positive rate (sensitivity)
  2. True negative rate (specificity)
  3. False positive rate
  4. False negative rate
  5. Accuracy

Which values change and which values are the same for different cut-points? Explain.

library(NHANES)
library(rpart) 
library(partykit)
set.seed(364)
n <- nrow(NHANES)
test_idx <- sample.int(n, size = round(0.25 * n)) 
train <- NHANES[-test_idx, ]
nrow(train)
## [1] 7500
test <- NHANES[test_idx, ] %>% filter(is.na(SleepTrouble) == F)
nrow(test)
## [1] 1939
tree <- rpart(SleepTrouble ~ SleepHrsNight + Depressed, data = train, parms = list(split = "gini"))
predicted_tree <- predict(object = tree, newdata = test, type = "prob")

# Cut-point using 0.5

# if predicted prob of "Yes" is >= 0.5 then predicted class is "Yes"
# otherwise predicted class is "No"
confusion_matrix <- table(predicted_tree[,2] >= 0.5,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##      
##         No  Yes
##   No  1425  435
##   Yes   24   55
sensit_50 <- confusion_matrix[4]/(confusion_matrix[4] + confusion_matrix[3])
sensit_50
## [1] 0.1122449
specif_50 <- confusion_matrix[1]/(confusion_matrix[1] + confusion_matrix[2])
specif_50
## [1] 0.9834369
fpr_50 <- 1 - specif_50
fpr_50
## [1] 0.01656315
fnr_50 <- 1 - sensit_50
fnr_50
## [1] 0.8877551
accuracy_50 <- (confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
accuracy_50
## [1] 0.76328
# Cut-point using 0.25

confusion_matrix <- table(predicted_tree[,2] >= 0.25,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##      
##         No  Yes
##   No  1316  377
##   Yes  133  113
sensit <- confusion_matrix[4]/(confusion_matrix[4] + confusion_matrix[3])
sensit
## [1] 0.2306122
specif <- confusion_matrix[1]/(confusion_matrix[1] + confusion_matrix[2])
specif
## [1] 0.9082126
fpr <- 1 - specif
fpr
## [1] 0.09178744
fnr <- 1 - sensit
fnr
## [1] 0.7693878
accuracy <- (confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
accuracy
## [1] 0.7369778

All of the values change since the cut-point only changes how the predictions are classified (i.e., the FPR and FNR).

  1. In a few sentences, using complete English, interpret i-v in part (d) for the 0.5 cut-point. Be sure to use the dependent variables meaning in your interpretation.

The overall accuracy in predicting trouble sleeping using only number of hours slept and depression is 0.76. If a person has trouble sleeping then the model will predict this with 0.11 accuracy, and if a person does not have trouble sleeping then the model will predict this with 0.98 accuracy. The model has higher accuracy in predicting trouble sleeping compared to no trouble sleeping, this is due to the high false-negative rate of 0.89.

  1. Create an ROC curve of the classification tree that you developed in (c). Suggest a cutpoint to use for classifying a person as having sleep trouble. In one to two sentences, using complete English, explain why you chose this cutpoint.
predicted_tree <- predict(object = tree, newdata = test, type = "prob")
pred <- ROCR::prediction(predictions = predicted_tree[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df <- data.frame(perf@x.values, perf@y.values) 
names(perf_df) <- c("fpr", "tpr") 
roc <- ggplot(data = perf_df, aes(x = fpr, y = tpr)) +
geom_line(color = "blue") + geom_abline(intercept = 0, slope = 1, lty = 3) + ylab(perf@y.name) + xlab(perf@x.name)
roc

  1. Use the training and testing sets that you created in part (c) to create a classification tree to predict SleepTrouble, but use all the variables in the NHANES training data. Plot the ROC curves for the Tree Classifier you developed in part (d) and the Tree Classifier you developed using all the variables on a single plot (see sample code below). What is the accuracy of the Tree Classifier using all the variables? Interpret the plot. Can you say which Tree Classifier has higher accuracy?

NB: The R syntax for using all the variables in the data frame not otherwise in the formula is to use . in place of the dependent variable.

tree_full <- rpart(SleepTrouble ~ ., data = train, parms = list(split = "gini"))
predicted_tree_full <- predict(object = tree_full, newdata = test, type = "prob")
confusion_matrix <- table(predicted_tree_full[,2] >= 0.5,test$SleepTrouble)
row.names(confusion_matrix) <- c("No","Yes")
confusion_matrix
##      
##         No  Yes
##   No  1395  421
##   Yes   54   69
#Accuracy
(confusion_matrix[1] + confusion_matrix[4])/sum(confusion_matrix)
## [1] 0.7550284
# Use this code 
pred <- ROCR::prediction(predictions = predicted_tree_full[,2], test$SleepTrouble)
perf <- ROCR::performance(pred, 'tpr', 'fpr')
perf_df_full <- data.frame(perf@x.values, perf@y.values) 
names(perf_df_full) <- c("fpr", "tpr")

plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",7),rep("Two Vars",4)))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) + 
  geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) + 
  ylab(perf@y.name) + 
  xlab(perf@x.name)

The tree classifier using all the variables in the data has very similar performance. The accuracy of both models using a cut-point of 0.5 is 0.76. The plot shows that the ROC curve for the model with all variables is above the ROC curve for the model with two variables. This indicates that the model with all variables has slightly higher overall accuracy, although the difference does not appear to be large.

Question 3

Data was collected on 30 cancer patients to investigate the effectivness (Yes/No) of a treatment. Two quantitative variables, \(x_i \in (0,1), i=1,2\), are considered to be important predictors of effectiveness. Suppose that the rectangles labelled as nodes in the scatterplot below represent nodes of a classification tree. What is the predicted class of each node? What proportion of observations in each node is correctly classified? Create a confusion matrix, and calculate the overall accuracy.

Node Predicted Class Proportion Corretly Classified
1 No 7/12
2 Yes 3/5
3 No 3/4
4 No 7/9

The Confusion matrix is:

Predicted Yes No
Yes 3 2
No 8 17

Overall accuracy: (3+17)/30 = 0.67.

R Markdown source