Your answer to question 2 parts (d), (e), (f), and (g)
file_url <- "https://raw.githubusercontent.com/ntaback/UofT_STA130/master/week8/Week8PracticeProblems-student1.Rmd"
download.file(url = file_url , destfile = "Week8PracticeProblems-student.Rmd")
Look for the file “Week8PracticeProblems-student.Rmd” under the Files tab then click on it to open.
Change the subtitle to “Week 8 Practice Problems Solutions” and change the author to your name and student number.
Type your answers below each question. Remember that R code chunks can be inserted directly into the notebook by choosing Insert R from the Insert menu (see Using R Markdown for Class Assignments). In addition this R Markdown cheatsheet, and reference are great resources as you get started with R Markdown.
Tutorial grades will be assigned according to the following marking scheme.
Mark | |
---|---|
Attendance for the entire tutorial | 1 |
Assigned homework completiona | 1 |
In-class exercises | 4 |
Total | 6 |
Read through the Final Project information.
A classification tree was built to predict a dependent variable categorized as “Yes”, “No”. 80% of the data set were used to train the classification tree and the remaining 20% was used to test the resulting model. The prediction accuracy was evaluated using the test set. The confusion matrix is below.
Predicted | Yes | No |
---|---|---|
Yes | 100 | 30 |
No | 10 | 37 |
How many observations were used to train the model? How many observations were used to test the model?
What is the accuracy, false-positive rate, and false-negative rate? Assume that “Yes” is positive and “No” is negative.
Is it possible to use this table to draw an ROC curve? Explain.
Read the description in Exercise 8.1 and answer the following questions.
# Have a look at the data
library(NHANES)
glimpse(NHANES)
Use the variables SleepHrsNight
and Depressed
in the NHANES
data to build a classification tree to predict SleepTrouble
.
Summarize the classification tree in part (a) in a four to six sentence paragraph, using complete English sentences. The paragraph should address: how the splits on the variables were selected; which nodes are terminal; and how a new observation would be predicted by this classifciation tree.
What proportion of subjects in the NHANES data have trouble sleeping?
Separate the NHANES data set uniformly at random into 75% training and 25% testing sets. Use the training set to build the classification tree using the variables in part (a). Use the test set to calculate the confusion matrix for two cut-points: 0.5 and proportion of subjects in the NHANES data that have trouble sleeping (use the value from part (c)). For each cut-point use the confusion matrix to calculate the following values:
Which values change and which values are the same for different cut-points? Explain.
In a few sentences, using complete English, interpret i-v in part (d) for the 0.5 cut-point. Be sure to use the dependent variable’s meaning in your interpretation.
Create an ROC curve of the classification tree that you developed in (c). Suggest a cutpoint to use for classifying a person as having sleep trouble. In one to two sentences, using complete English, explain why you chose this cutpoint.
Use the training and testing sets that you created in part (c) to create a classification tree to predict SleepTrouble
, but use all the variables in the NHANES
training data. Plot the ROC curves for the Tree Classifier you developed in part (d) and the Tree Classifier you developed using all the variables on a single plot (see sample code below). What is the accuracy of the Tree Classifier using all the variables? Interpret the plot. Can you say which Tree Classifier has higher accuracy?
NB: The R syntax for using all the variables in the data frame not otherwise in the formula is to use .
in place of the dependent variable.
Starter code for this question is below.
tree_full <- rpart(SleepTrouble ~ ., data = train, parms = list(split = "gini"))
# Add your code here ...
plot_dat <- cbind(rbind(perf_df_full,perf_df), model = c(rep("All Vars",7),rep("Two Vars",4)))
ggplot(data = plot_dat, aes(x = fpr, y = tpr, colour = model)) +
geom_line() + geom_abline(intercept = 0, slope = 1, lty = 3) +
ylab(perf@y.name) +
xlab(perf@x.name)
Data was collected on 30 cancer patients to investigate the effectivness (Yes/No) of a treatment. Two quantitative variables, \(x_i \in (0,1), i=1,2\), are considered to be important predictors of effectiveness. Suppose that the rectangles labelled as nodes in the scatterplot below represent nodes of a classification tree. What is the predicted class of each node? What proportion of observations in each node is correctly classified? Create a confusion matrix, and calculate the overall accuracy.