Instructions

What should I bring to tutorial on November 16?

  • R output (e.g., plots and explanations) for Question 2. You can either bring a hardcopy or bring your laptop with the output.

Tutorial Grading

Tutorial grades will be assigned according to the following marking scheme.

Mark
Attendance for the entire tutorial 1
Assigned homework completion 1
In-class exercises 4
Total 6

Practice Problems

Question 1

A classification tree was built to predict a dependent variable categorized as “Yes”/“No”. 80% of the observations were used to train the classification tree and the remaining 20% were used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.

Predicted Yes No
Yes 100 30
No 10 37
  1. How many observations were used to train the model? How many observations were used to test the model?

  2. What is the accuracy, false-positive rate, and false-negative rate? Assume that “Yes” is positive and “No” is negative.

  3. Is it possible to use this table to draw an ROC curve? Explain.

Question 2

[Adapted from Modern Data Science with R] The nasaweather package contains data about tropical storms from 1995-2005. There are four types of storms: Extratropical, Hurricane, Tropical Depression, and Tropical Storm. Consider the scatterplot between the windspeed and pressure of these storms shown below

library(mdsr)
library(nasaweather)
ggplot(data = nasaweather::storms, aes(x = pressure, y = wind, color = type)) +
geom_point(alpha = 0.5)
Figure 1: Plot of storm type as a function of wind speed and pressure

Figure 1: Plot of storm type as a function of wind speed and pressure

  1. Divide the data into training (80%) and testing (20%) data sets. Build a classification tree using the training data to predict the type of storm as a function of wind speed (wind) and pressure and plot it. Note: use nasaweather::storms to access the data set for this question (there is another data set called storms in the dplyr library).

  2. Visualize your classification tree in data space, by adding lines to the plot in Figure 1 and labelling each region with the type of storm that is predicted by the tree. Hint: if you want to do this using ggplot, you can use the following functions

 geom_vline(xintercept) # draws a vertical line
 geom_hline(yintercept) # draws a horizontal line
 geom_segment(x, y, xend, yend) # where (x,y) are the coordinates of one endpoint and (xend,yend) are the coordinates of the other endpoint
 geom_text(x,y,label) # prints text at the specified (x,y) position
  1. Summarize the classification tree from part (a) in a four to six sentence paragraph (in complete English sentences). The paragraph should address the following: how the splits on each variable were selected, and how a new observation would be predicted by this classification tree. Be sure to use the correct units when referring to values of wind speed and pressure.

  2. Use the test set to calculate a matrix comparing the true and predicted storm types. This will be similar to the confusion matrices covered in class, but with four rows and four columns, one for each type of storm. Calculate the estimated accuracy for your classification tree. What percentage of storms which are classified as hurricanes are actually hurricanes? What percentage of storms which are classified as extratropical storms are actually extratropical storms?

  3. Use the training dataset from (a) to create a classification tree to predict storm type, but use any of the following variables as predictors: year, month, hour, lat, long, pressure, wind, seasday. Calculate the accuracy of your new classification tree using the testing dataset. Write 2-3 sentences commenting on any differences you observe in the variables used for splits in this tree compared to the tree you created in part (a).

Question 3

The NHANES data set contains several demographic, medical, and physical variables for a sample of 10,000 individuals. One of the variables in this dataset is Smoke100, and you will build a classification tree to predict which individuals have smoked at least 100 cigarettes in their lifetime.

  1. Begin by filtering out observations with a missing value for Smoke100. What proportion of individuals have smoked at least 100 cigarettes in their lifetime?

  2. Divide the data into training (80%) and testing (20%) data sets. Build a classification tree using the training data to predict which individuals have smoked more than 100 cigarettes in their lifetime, based on the Education and Age variables only.

  3. Use the test set from part (b) to calculate the confusion matrix for two cut-points: 0.5 and the proportion of individuals in the NHANES data that have smoked at least 100 cigarettes in their lifetime (use the value from part (a)). For each cutpoint, use the confusion matrix to calculate the following values: (i) True positive rate (sensitivity), (ii) True negative rate (specificity); (iii) False positive rate; (iv) False negative rate; (v) Accuracy. Which values stay the same and which are the same for the different cut-points?

  4. Use the training and testing sets that you created in part (b) to create a classification tree to predict Smoke100, but use all the variables in the NHANES training data (apart from SmokeNow and Smoke100n). Plot the ROC curves for both classification trees predicting Smoke100 on a single plot (see sample code below). How do the ROC curves help you determine which model is more accurate?

Hint: The R syntax for using all the variables in the data frame not otherwise in the formula is to use . in place of the dependent variable.

Some starter code for this question is below.

train <- train %>% select(-one_of("SmokeNow", "Smoke100n"))
tree_full <- rpart(Smoke100 ~ ., data = train, parms = list(split = "gini"))

# Add your code here ...

plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",nrow(perf_df_full)),rep("Two Vars",nrow(perf_df))))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) + 
  geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) + 
  ylab(perf@y.name) + 
  xlab(perf@x.name)