Homework #5

CMP 464788:
Topics Course: Data Science
Spring 2017

Topics: Bayes Theorem, Simpson's Paradox, & Regular Expressions
Deadline: Thursday, March 9 2017, 11:59pm

Textbook's Code

This assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:

https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/naive_bayes.py

Datasets

This assignment uses the following datasets:

The Social Security Administration keeps track of the most popular names given each year as well as by state. For this assignment, you will need 21 years of state name data. You can use the nystate.tar file for 1990 to 2010 for New York state, or you may download a different state (or time range) from the SSA data page.

Spam Data

This assignment uses data collected and made publicly by Apache, and can be found at:

http://spamassassin.apache.org/publiccorpus/

For this assignment, you will need to download three different data sets:

  1. 20021010_easy_ham.tar.bz2
  2. 20021010_hard_ham.tar.bz2
  3. 20021010_spam.tar.bz2
(If you are on a Windows machine, you might need a program like 7-Zip to decompress and extract the data files.)

We will use these data sets for later homework assignments. Since scraping the data takes time, save these data sets to use again for the future programs.

Assignment

The work to be submitted is the same for both the undergraduate and graduate course.

CMP 464/788 Homework:
#1-2 Use regular expressions (regex) to search for name occurrences in the Social Security Administration data. Choose a name that can be spelled 3 or more ways (for example, "Megan" has alternative spellings of "Meaghan", "Megyn", "Meghan", etc.). Use regex to combine the totals from different spellings and graph for the 21 years of state data.

#1: Submit your Python program as a .py file.
#2: Submit a screen shot of the graphics window containing the plot.
#3-4 Are collisions correlated to temperature? Limit your zipcode data set to dates just in January 2017 (either write a quick filter program or download again with limited dates). Using a dictionary structure of your choice, count the number of collisions that occurred in your zipcode on each day in January. On same plot, plot the number of collisions and the daily temperature against the date (see twoPlots.py for graphing plots with different scales on same image).

#3: Submit your Python program as a .py file.
#4: Submit a screen shot of the graphics window containing the plot.
#5-6 Extend the textbook's analysis of the spam data set to count plurals (all words of 4 or more characters that end in a single s) and -est words (all words of 5 or more letters ending in -est to count as the base word). See the discussion in the textbook.

#5: Submit your Python program as a .py file.
#6: Submit a text file with your results and conclusion-- how much more spam did you find with this extension? Did this increase the amount of real mail ('ham') that was identified as spam?

Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.