The computer science assignment uses the Naive Bayes Spam Filter developed by the textbook's author and available at:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/naive_bayes.pyYou may also find the hypothesis and inference code useful:
https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/hypothesis_and_inference.py
The work to be submitted is the same for the undergraduate and graduate course.
CMP 464/788 Homework: | |
---|---|
#1 |
The Department of Transportation (DOT), as part of Vision Zero, is interested in reducing accidents and speeding on roadways across the city. They are interested in what signage has a larger affect on speeding. They collected data with two different messages: the first sign says "Speeding Kills" and the second sign gave the speed that the car was moving. Data is collected for both signs:
#1: Submit a .pdf or .png file of your neatly handwritten or typed answer. |
#2-7 |
Write a classifer program that predicts if a name is boy or girl's name based on the last letters of the name.
#2: Write a Python program that takes as input a Social Security Administration name file (see above for files and format) and outputs three files. The first file should have 26 lines (one for each letter of the alphabet). Each line contains three values: the letter, the fraction of boys' names that end in that letter in the training set (inputted file), the fraction of girls' names that end in that letter in the training set. For example: a possible file could start: a, 0.023, 0.451 b, 0.010, 0.008 ...The second file should be identical to the first, except that for each line, the numbers should be the fraction of boys' and girls' names that have that letter as the second to last letter. The third file should be identical to the first and second, except that for each line, the numbers should be the fraction of boys' and girls' names that have that letter as the third to last letter. #2: Submit your Python program as a .py file. #3: Run your program with yob2014.txt and submit the first file generated by your program (that is, the fractions for the last letter of the name). Submit your file as a .txt file. #4: Run your program with yob2014.txt and submit the second file generated by your program (that is, the fractions for the second-to-last letter of the name). Submit your file as a .txt file. #5: Run your program with yob2014.txt and submit the third file generated by your program (that is, the fractions for the third-to-last letter of the name). Submit your file as a .txt file. #6: Modify the Naive Bayes spam filter program to reads in the three files generated above as well as a fourth file of test data (use another year's data such as yob2015.txt). Instead of multiplying together word occurrences, your program should multiply the probabilities of the last letter, second-to-last letter, and third-to-last letter that you computed above. Your program should classify each name in the test data (similar to the Naive Bayes filter from the book) and report back the percentage of names your program correctly predicted as well as the names you predicted incorrectly. Submit your Python program as a .py file. #7: Submit the output of your file (the percentage you correctly predicted along with the names you predicted incorrectly) as a .txt file. |