HW #5, Data Science at Lehman College, CUNY, Spring 2017

Datasets

This assignment uses the following datasets:

Name data collected by the Social Security Administration.
The weather data for the month of January from Homework #1
The zipcode collision data collected for Homework #3
Spam data collected by Apache.

The Social Security Administration keeps track of the most popular names given each year as well as by state. For this assignment, you will need 21 years of state name data. You can use the nystate.tar file for 1990 to 2010 for New York state, or you may download a different state (or time range) from the SSA data page.

Spam Data

This assignment uses data collected and made publicly by Apache, and can be found at:

http://spamassassin.apache.org/publiccorpus/

For this assignment, you will need to download three different data sets:

20021010_easy_ham.tar.bz2
20021010_hard_ham.tar.bz2
20021010_spam.tar.bz2

(If you are on a Windows machine, you might need a program like 7-Zip to decompress and extract the data files.)

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted is the same for both the undergraduate and graduate course.

	CMP 464/788 Homework:
#1-2	Use regular expressions (regex) to search for name occurrences in the Social Security Administration data. Choose a name that can be spelled 3 or more ways (for example, "Megan" has alternative spellings of "Meaghan", "Megyn", "Meghan", etc.). Use regex to combine the totals from different spellings and graph for the 21 years of state data. #1: Submit your Python program as a .py file. #2: Submit a screen shot of the graphics window containing the plot.
#3-4	Are collisions correlated to temperature? Limit your zipcode data set to dates just in January 2017 (either write a quick filter program or download again with limited dates). Using a dictionary structure of your choice, count the number of collisions that occurred in your zipcode on each day in January. On same plot, plot the number of collisions and the daily temperature against the date (see twoPlots.py for graphing plots with different scales on same image). #3: Submit your Python program as a .py file. #4: Submit a screen shot of the graphics window containing the plot.
#5-6	Extend the textbook's analysis of the spam data set to count plurals (all words of 4 or more characters that end in a single s) and -est words (all words of 5 or more letters ending in -est to count as the base word). See the discussion in the textbook. #5: Submit your Python program as a .py file. #6: Submit a text file with your results and conclusion-- how much more spam did you find with this extension? Did this increase the amount of real mail ('ham') that was identified as spam?

Homework #5

CMP 464788:
Topics Course: Data Science
Spring 2017

Textbook's Code

Datasets

Spam Data

Assignment

Submitting Homework

Homework #5

CMP 464788: Topics Course: Data Science Spring 2017

Textbook's Code

Datasets

Spam Data

Assignment

Submitting Homework

CMP 464788:
Topics Course: Data Science
Spring 2017