Due: Friday May 15 at 11:59pm on Blackboard

For this project, you will choose an interesting dataset and analyze it using the statistical techniques learned in class. You will present your findings to the class at the end of the semester and describe them on a webpage that can be used as part of a portfolio. All code will be uploaded to GitHub.

The project has been divided into 11 milestones, which are described at the bottom of this page. Each milestone has a suggested due date to keep you on track with the project, and the milestones will be checked off in class (10% of the final project grade). If you miss a due date, a completed milestone can be checked off any time on or before Thurs. May 7.

The Data:

Choose any interesting dataset that meets these criteria: Sources for finding your dataset:

GitHub:

All code for your project must be uploaded to GitHub, a free, widely-used website for storing code, with a focus on tracking changes and encouraging collaboration. You can use a free public repository (directory), or, as a student, apply for free private repositories.

Project Presentation: Tuesday 12 May in class (slides due by 10am May 12 on Blackboard)

Each person will give a 2 minute presentation about their project using slides. The presentation should introduce the dataset and describe one or two of the most interesting results. Your presentation must be in PDF format and contain:

Project milestones:

The required project milestones are listed below. You may include up to three additional analysis for extra credit, by listing them in Blackboard when submitting your final project.

Each milestone has a suggested due date to keep you on track with the project, and the milestones will be checked off in class (10% of the final project grade). If you miss a due date, a completed milestone can be checked off any time on or before Thurs. May 7.

Due Date: Milestone: Details:
#1 Tues. 11 Feb in class Choose dataset Be able to open up dataset in class in Excel.
#2 Feb. 13 on Blackboard Create GitHub account and upload dataset to GitHub Submit your GitHub username on Blackboard. If you wish to keep your code private, please add me (megan073) to the repository or project. If you create a GitHub project for your code, also submit the project name.
#3 Feb. 20 in class Create webpage for your project with title and description of dataset Easy, free ways to create a webpage: Description of dataset should include:
  1. why you are interested in it
  2. where you found the dataset (including link)
  3. who (person, government, organization, company, etc.) created the dataset
  4. when the data is from
  5. what kind of information is contained in the dataset
#4 Feb. 27 in class Single variable distribution plots
  1. Choose 4 columns and do the following for each of these columns:
    • Plot the distribution of data in that column using a histogram if the data is quantitative and a bar chart if the data is categorical.
    • Add a title and axes labels to the plot.
    • Add the plot to your webpage and write several sentences telling the reader what they should notice about the plot (ex. shape of distribution, outliers, skew, anything surprising, etc)
  2. Add your R code for this milestone to your GitHub account.
#5 March 12 in class Missing data and outliers (if applicable)
  1. Based on the distribution plots from the previous milestone, you may see outliers that are so extreme they could dominate the analysis. You can choose to remove these outliers and use the remaining data for the rest of the milestones. You may also wish to remove observations with missing data. If you want to remove any outliers or observations (rows) with missing data:
    • Take a subset of your data to remove the desired observations (rows).
    • Write a few sentences on your webpage describing what data was removed and why (you may want to reference the plots from Milestone 4).
    • Redo any plots from Milestone 4 that changed. Leave the original plot and description on your webpage, and add the redone plot with a few sentences describing any changes.
  2. Add your R code for this milestone to your GitHub account.
#6 March 19 in class Measures of center and spread
  1. Choose at least 2 quantitative columns, and do the following for each column:
    • Compute the mean, median, variance, and standard deviation of the column data.
    • Add the means, medians, variances, and standard deviation to your webpage.
    • Write a few sentences comparing the corresponding means and medians (ex. are the mean and median different? Why/why not?)
    • Add a few sentences on your webpage giving your interpretation of the standard deviations (ex. are the data close to the mean?)
  2. Add your R code for this milestone to your GitHub account.
#7 March 26 in class Scatterplots and correlation
  1. For at least one pair of quantitative columns/variables:
    • Plot a scatterplot of the data in the two columns/variables.
    • Compute the correlation between the columns/variables.
    • Add the scatterplot and correlation to your webpage.
    • Add a few sentences on your webpage interpreting the scatterplot and correlation (ex. how closely are the two variables related? If there is a relationship, does it appear linear?)
  2. Add your R code for this milestone to your GitHub account.
#8 April 2 in class Confidence intervals
  1. Choose at least 2 quantitative columns, and do the following for each column:
    • Compute the 95% confidence interval for the mean.
    • Add the confidence intervals to your webpage (can be near the previously computed means).
    • Add a few sentences on your webpage interpreting the confidence intervals (ex. are the confidence intervals large or small? How much should we trust our estimates of the means?)
  2. Add your R code for this milestone to your GitHub account.
#9 April 23 in class Linear regression
  1. Choose a quantiative column to predict using linear regression. Note: Linear regression may not work well for some data sets, so you will be graded on the process of performing linear regression, not on the fit of the model.
    • Compute the linear regression model using one or more of the other data columns as the independent variable(s)
    • Assess the fit of the model by computing R-squared, plotting a histogram of the residuals, and plotting a scatter plot of the actual observed response value (x axis) vs. residual (y axis).
    • Add a description of the model (including the equation) and the results of assessing the model fit to your webpage, including the plots.
    • Write a few sentences on your webpage explaining whether or not your linear regression model is a good fit, and why.
  2. Add your R code for this milestone to your GitHub account.
#10 April 30 in class Hypothesis testing
  1. Come up with two different hypotheses about your data that are testable using the hypothesis tests we covered in class. For each hypothesis, do the following:
    • Conduct the hypothesis test (you can choose alpha).
    • Add the results of the hypothesis test to your webpage.
    • Write a few sentences on your webpage stating the results of the hypothesis test and interpreting it (ex. can you reject the null hypothesis? Why or why not?).
  2. Add your R code for this milestone to your GitHub account.
#11 May 7 in class Your choice MAT 782: Perform two other analyses on your data, performing the following steps for each.
  1. Perform any other analysis on your dataset. You may also perform one of the previous analysis on a different variable.
  2. Add any plots or the results of computations to your webpage.
  3. Write a few sentences on your webpage interpreting the results or plots for the reader.
  4. Add your R code for this milestone to your GitHub account.