# Homework #6

## CMP 464/788: Topics Course: Data Science Spring 2017

Topics: A/B Testing, Simple Classifers
Deadline: Thursday, 16 March 2017, 11:59pm

### Textbook's Code

The computer science assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:

https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/naive_bayes.py
You may also find the hypothesis and inference code useful:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/hypothesis_and_inference.py

### Datasets

This assignment uses the following datasets:
• https://www.ssa.gov/oact/babynames/names.zip, which is name data collected by the Social Security Administration. This .zip file contains one text file per year that lists all baby names for that year with more than 5 occurrences, along with the sex of the baby and number of occurrences. See the readme file NationalReadMe.pdf inside the .zip file for details. We will use the files for 2014 and 2015 (yob2014.txt and yob2015.txt).

### Assignment

The work to be submitted is the same for the undergraduate and graduate course.

CMP 464/788 Homework:
#1 The Department of Transportation (DOT), as part of Vision Zero, is interested in reducing accidents and speeding on roadways across the city. They are interested in what signage has a larger affect on speeding. They collected data with two different messages: the first sign says "Speeding Kills" and the second sign gave the speed that the car was moving. Data is collected for both signs:
• For the first ("Speeding Kills" sign), 140 out of 1200 cars were observed going the speed limit.
• For the second (sign with current speed), 150 out of 1100 cars were observed going the speed limit.
The second seems more effective. Could this have happened by chance? What is the probability that you would see such a difference if the signs were equally effective at slowing traffic? Justify your answer.

#1: Submit a .pdf or .png file of your neatly handwritten or typed answer.
#2-7 Write a classifer program that predicts if a name is boy or girl's name based on the last letters of the name.
#2: Write a Python program that takes as input a Social Security Administration name file (see above for files and format) and outputs three files. The first file should have 26 lines (one for each letter of the alphabet). Each line contains three values: the letter, the fraction of boys' names that end in that letter in the training set (inputted file), the fraction of girls' names that end in that letter in the training set. For example: a possible file could start:
```a, 0.023, 0.451
b, 0.010, 0.008
...
```
The second file should be identical to the first, except that for each line, the numbers should be the fraction of boys' and girls' names that have that letter as the second to last letter.
The third file should be identical to the first and second, except that for each line, the numbers should be the fraction of boys' and girls' names that have that letter as the third to last letter.
#2: Submit your Python program as a .py file.

#3: Run your program with yob2014.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file.

#4: Run your program with yob2014.txt and submit the second file generated by your program (that is, the fractions for the second-to-last letter of the name). Submit your file as a .txt file.

#5: Run your program with yob2014.txt and submit the third file generated by your program (that is, the fractions for the third-to-last letter of the name). Submit your file as a .txt file.

#6: Modify the Naive Bayes spam filter program to reads in the three files generated above as well as a fourth file of test data (use another year's data such as yob2015.txt). Instead of multiplying together word occurrences, your program should multiply the probabilities of the last letter, second-to-last letter, and third-to-last letter that you computed above. Your program should classify each name in the test data (similar to the Naive Bayes filter from the book) and report back the percentage of names your program correctly predicted as well as the names you predicted incorrectly.
Submit your Python program as a .py file.

#7: Submit the output of your file (the percentage you correctly predicted along with the names you predicted incorrectly) as a .txt file.

### Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.