The final project for this course will be a statistical analysis of data from Geotab on the topic below. You will present your findings in the style of a poster display of a professional scientific conference. You and your team will post your work on a board, give a short oral presentation of your findings when visited by evaluators, and be prepared to answer questions about your work.
Date: April 2, 2018
Time: During the lecture time that you are registered. If you are registered in LEC0101 then the time is 10:10 - 12:00, or if you are registered in LEC0201 then the time is 14:10 - 16:00. Make sure to come in time to post your work before 10 minutes past the hour. Take down your poster at the end of your class time.
A few guidelines for an effective poster display:
At the poster fair you will be asked to give a 5 minute presentation summarizing your work. This time limit is firm and you will be asked to stop when time is up. Each team member must speak during this presentation.
Grade component* | Value |
---|---|
Poster Content | 50% |
Reproducibility of Poster Content | 10% |
Oral Presentation | 40% |
* Students will be evaluated as a team.
The rubric for the poster evaluation is here.
Before the poster fair, you must send your TA the html and R Markdown files of your poster. These are due at 9:30 if you are in section LEC0101 and at 13:30 if you are in section LEC0201.
Your TA will attempt to reproduce your poster using the R Markdown (.rmd) files you submit.
If your TA cannot run the .rmd files you submit to reproduce your poster content then your group will receive 0; if the TA has to make minor changes to get it to run then your group will receive 1; and if it runs with no changes then your group will recieve 2.
If the R Markdown and HTML files are submitted after the deadline every member of your group will lose 10% of your overall final project mark as long as they are submitted at most 24 hours after the deadline. The R Markdown and HTML files will not be accepted more than 24 hours after the deadline.
During the poster fair, you will be visited by members of the STA130H1 teaching team. You will give them a 5 minute presentation about your work. The rubric for the oral presentation is here.
Every member of the team is expected to speak as part of the oral presentation.
If a student in a group isn’t present at their group’s presentation then they will need a valid excuse (e.g., UofT illness form), otherwise they will receive 50% of the group mark.
If a student doesn’t speak at all during the presentation and is unable to answer a direct question then they will receive 50% of the group mark. If a student neither speaks nor responds to any questions they will receive 0.
You will carry out a data analysis on data from Geotab using R to address the topic below.
We expect that your analysis may require data wrangling, exploratory data analysis (plots and summary statistics), tests, confidence intervals, classification trees, and regression models. Your project does not need to include all of these statistical methods nor does it need to include all of the variables in the data set. You might also choose not to include all observations, or to make new variables from the data that may be more suitable for answering your questions of interest.
The goal is not to carry out an exhaustive analysis, nor to apply everything you have learned in the course. The goal is to demonstrate that you have learned how to use R, that you can appropriately apply the methods we have covered in class to address a question, and that you can effectively interpret and present the results.
The red dots on the map below show hazardous driving instances in Canada recorded in Geotab’s hazardous driving data set.
library(tidyverse)
library(ggmap)
hazardousdriving <- read_csv("hazardousdriving.csv")
hazardousdriving_can <- hazardousdriving %>% filter(Country == "Canada")
qmplot(AvgLongitude, AvgLatitude, data = hazardousdriving_can, maptype = "toner-lite", color = I("red"))
Your group will explore the hazardous driving data set. We have proposed questions for your group to address in your project. There are many ways you can address these questions in the data. Your group will need to focus your project, choosing what you will consider. You do not need to consider every variable in the data set.
The hazardous driving data set is described here. The “Datasheet” link provides an explanation of each of the variables in the data.
Question: Can the severity score be derived from other variables in the data set?
Answer: No. The severity score cannot be derived from other variables in the data set.
Question: Is it correct to say that there is no data on the number of people involved in an incident? So, for example, we can’t assume anything about the number of people involved in a multi-passenger vehicle incident or any other incident.
Answer: Yes.
Question: What is the definition of multi-passenger vehicle?
Answer: MPV or multi-purpose vehicles is a classification specified by the car manufacturer. SUV and vans are also included this classification.
There is conflicting evidence about which province in Canada has the most dangerous driving conditions. A recent study claimed that Halifax is the most dangerous Canadian city to drive. Does this mean that Nova Scotia has the most dangerous driving in Canada? According to a report by the Allstate Insurance Company New Brunswick had the highest collision claims. dangerousroads.org claims some of the most dangerous highways in Canada are in Manitoba and Nova Scotia.
Use the hazardous driving data to answer the following questions:
To answer questions 1 - 3, you must use the hazardous driving data. It is not necessary to use data from any other source. However you are allowed to supplement the hazardous driving data with other data if you’d like.