MAT 328 Project

For this project, you will choose an interesting dataset and analyze it using the data science techniques learned in class, including developing machine learning models. You will present your findings to the class at the end of the semester and well as display them on a webpage that can be used as part of a portfolio. All code will be uploaded to GitHub.

The Data set:

Choose any interesting dataset that meets these criteria:

in CSV format (or you have put it in this format)
each row represents one observation
at least four columns of data
at least one column of qualitative data (nominal or ordinal) with duplicate values
at least one column of quantitative data (integer or real number data that are measurements, not categorical data where the categories are numbers (ex. zip codes))
at least 30 rows of data

Sources for finding your dataset:

NYC Open Data (many other cities, states, and countries also provide open data)
DataQuest's list of 19 places to find datasets for projects
Awesome Public Datasets, organized by topic
Kaggle
data that you or a device you own (ex. fitbit) have collected

GitHub

GitHub is a free, widely-used website for storing code, with a focus on tracking changes and encouraging collaboration. Students can apply for free private repositories.
GitHub resources:

GitHub Basic Tutorial

Project Presentation: Tuesday 10 December in class (slides due by 10am December 10 on Blackboard)

Each person will give a 2 minute presentation about their project using slides. The presentation should introduce the dataset and describe one or two of the most interesting results. You may use 2 or 3 slides in pdf format. The first slide should be a title slide, with the title of your project and your name.

Project milestones:

	Due Date:	Milestone:	Details:
#1	Sept. 12 in class	Choose dataset	Be able to open up dataset in class in Excel or Jupyter notebook.
#2	Sept. 19 on Blackboard	Create GitHub account and upload dataset to GitHub	Submit your GitHub username on Blackboard. If you wish to keep your code private, please add me (megan073) to the repository or project. If you create a GitHub project for your code, also submit the project name.
#3	Sept. 26 in class	Create webpage for your project with title and description of dataset	Easy, free way to create a webpage: sites.google.com Description of dataset should include: why you are interested in it where you found the dataset (including link) who (what person, government, organization, or company) created the dataset when the data is from what kind of information is contained in the dataset
#4	Oct. 3 in class	Missing data and distribution plots	Use describe() or any other method to see if there is missing data. Add a sentence or short paragraph on your webpage that either states there is no missing data, or describes how much of the data is missing. For 4 different columns, do the following for each column: plot the distribution of data in that column. The best plot type is probably a bar chart or histogram, but you are welcome to experiment with other plots. The plot should only contain data from that column. Add a title and axes labels to the plot. Add the plot to your webpage, with a few sentences pointing out any interesting featurs in the plot (ex. does the distribution look normal, are there outliers, what are the most likely data values). You can also include opinions here, such as if you are surprised by some aspect of the distribution. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
#5	Oct. 10 in class	Outliers and multi-variable plots	In the last milestone, you may have noticed some outliers in your dataset. If you would like to remove any: filter your dataset to remove the outliers. You can either filter your dataset the same way each time your use it, or you can create a new .csv file using the command: filtered_df.to_csv("new_csv_file name.csv") add a note on your webpage explaining what outliers you removed and why redo any distribution plots from the previous milestone if they will look significantly different, and add the new plots to your webpage add the code for filtering your data to GitHub Make three different plots showing relationships between two or more of your columns: use scatterplots or any other plot type Add a title and axes labels to each plot Add the plots to your webpage, with a few sentences pointing out any interesting featurs in each plot (ex. does the relationship look linear, what patterns do you see in the plot, etc). You can also include opinions here, such as if you are surprised by some aspect of the relationship. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
#6	Oct. 24 in class	Linear or logistic regression	Choose a quantitative variable or categorical variable with two categories to predict using linear or logistic regression. For that variable: Compute the linear or logistic regression model using one or more of the other data columns as the independent variable(s) Assess the fit of the model using at least two different techniques (ex. plots of residuals; split data into testing and training data; computation of a measure like R-squared, mean squared error, sensitivity, specificity, etc.) Add a description of the model (including the equation) and the results of checking the model fit to your webpage, along with any relevant plots. Include whether you believe the models performs particularly well or poorly on any part of the data. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
#7	Nov. 14 in class	Decision trees	Use decision tree classifier or regressor to predict the same categorical or quantitative variable that you predicted in Milestone 6. You should split your data into training and testing data for this milestone. Add the graph of the decision tree model based on the training data to your webpage. Use the testing data to make predictions and assess your model. If classifying, compute the confusion matrix and at least two of sensitivity, specificity, precision, or accuracy. If regressing, compute the mean squared error and plot the actual value (x axis) vs. the error or predicted value (y axis). Write a few sentences on your webpage summarizing the model (what does it appear to be basing its decisions on) and how well it performs. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
#8	Nov. 21 in class	k-nearest neighbors	Use the k-nearest neighbor classifier or regressor to predict the same categorical or quantitative variable that you predicted in Milestone 6 and 7. You should split your data into training and testing data for this milestone. Use the testing data to make predictions and assess your model. If classifying, compute the confusion matrix and at least two of sensitivity, specificity, precision, or accuracy. If regressing, compute the mean squared error and plot the actual value (x axis) vs. the error or predicted value (y axis). Write a few sentences on your webpage summarizing the model (what does it appear to be basing its decisions on) and how well it performs. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.
#9	Dec. 5 in class	Your choice	Choose any analysis that is appropriate for your data, like a choropleth map, a contingency table, clustering, principal components analysis, or prediction of another variable than the one used for previous milestones. Add the results of your analysis to your webpage, along with a description of what you did and how to interpret the results. Add your Jupyter notebook containing your code to GitHub. It is fine if your notebook also contains notes about the results.