|
Due Date: |
Milestone: |
Details: |
#1 |
Sept. 12 in class |
Choose dataset |
Be able to open up dataset in class in Excel or Jupyter notebook. |
#2 |
Sept. 19 on Blackboard |
Create GitHub account and upload dataset to GitHub |
Submit your GitHub username on Blackboard. If you wish to keep your code private, please add me (megan073) to the repository or project. If you create a GitHub project for your code, also submit the project name. |
#3 |
Sept. 26 in class |
Create webpage for your project with title and description of dataset |
Easy, free way to create a webpage: sites.google.com
Description of dataset should include:
- why you are interested in it
- where you found the dataset (including link)
- who (what person, government, organization, or company) created the dataset
- when the data is from
- what kind of information is contained in the dataset
|
#4 |
Oct. 3 in class |
Missing data and distribution plots |
- Use describe() or any other method to see if there is missing data. Add a sentence or short paragraph on your webpage that either states there is no missing data, or describes how much of the data is missing.
- For 4 different columns, do the following for each column:
- plot the distribution of data in that column. The best plot type is probably a bar chart or histogram, but you are welcome to experiment with other plots. The plot should only contain data from that column.
- Add a title and axes labels to the plot.
- Add the plot to your webpage, with a few sentences pointing out any interesting featurs in the plot (ex. does the distribution look normal, are there outliers, what are the most likely data values). You can also include opinions here, such as if you are surprised by some aspect of the distribution.
- Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|
#5 |
Oct. 10 in class |
Outliers and multi-variable plots |
- In the last milestone, you may have noticed some outliers in your dataset. If you would like to remove any:
- filter your dataset to remove the outliers. You can either filter your dataset the same way each time your use it, or you can create a new .csv file using the command: filtered_df.to_csv("new_csv_file name.csv")
- add a note on your webpage explaining what outliers you removed and why
- redo any distribution plots from the previous milestone if they will look significantly different, and add the new plots to your webpage
- add the code for filtering your data to GitHub
- Make three different plots showing relationships between two or more of your columns:
- use scatterplots or any other plot type
- Add a title and axes labels to each plot
- Add the plots to your webpage, with a few sentences pointing out any interesting featurs in each plot (ex. does the relationship look linear, what patterns do you see in the plot, etc). You can also include opinions here, such as if you are surprised by some aspect of the relationship.
- Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|
#6 |
Oct. 24 in class |
Linear or logistic regression |
Choose a quantitative variable or categorical variable with two categories to predict using linear or logistic regression. For that variable:
- Compute the linear or logistic regression model using one or more of the other data columns as the independent variable(s)
- Assess the fit of the model using at least two different techniques (ex. plots of residuals; split data into testing and training data; computation of a measure like R-squared, mean squared error, sensitivity, specificity, etc.)
- Add a description of the model (including the equation) and the results of checking the model fit to your webpage, along with any relevant plots. Include whether you believe the models performs particularly well or poorly on any part of the data.
Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|
#7 |
Nov. 14 in class |
Decision trees |
Use decision tree classifier or regressor to predict the same categorical or quantitative variable that you predicted in Milestone 6. You should split your data into training and testing data for this milestone.
- Add the graph of the decision tree model based on the training data to your webpage.
- Use the testing data to make predictions and assess your model. If classifying, compute the confusion matrix and at least two of sensitivity, specificity, precision, or accuracy. If regressing, compute the mean squared error and plot the actual value (x axis) vs. the error or predicted value (y axis).
- Write a few sentences on your webpage summarizing the model (what does it appear to be basing its decisions on) and how well it performs.
Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|
#8 |
Nov. 21 in class |
k-nearest neighbors |
Use the k-nearest neighbor classifier or regressor to predict the same categorical or quantitative variable that you predicted in Milestone 6 and 7. You should split your data into training and testing data for this milestone.
- Use the testing data to make predictions and assess your model. If classifying, compute the confusion matrix and at least two of sensitivity, specificity, precision, or accuracy. If regressing, compute the mean squared error and plot the actual value (x axis) vs. the error or predicted value (y axis).
- Write a few sentences on your webpage summarizing the model (what does it appear to be basing its decisions on) and how well it performs.
Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|
#9 |
Dec. 5 in class |
Your choice |
Choose any analysis that is appropriate for your data, like a choropleth map, a contingency table, clustering, principal components analysis, or prediction of another variable than the one used for previous milestones.
Add the results of your analysis to your webpage, along with a description of what you did and how to interpret the results.
Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
|