Instructor: Prof. Megan Owen
E-mail: megan.owen@lehman.cuny.edu
Phone: 718-960-7423
Office hours: Monday and Wednesday, 9:30 - 10:50am, Gillet 137E, or immediately after class on Mondays and 1:15-1:45pm on Wednesdays
Course time: 1:50pm - 3:30pm on Mondays (in Gillet 225) and Wednesdays (in Gillet 231)

Python

Link to Python resources

Textbook:

Data Science from Scratch by Joel Grus. (ISBN 978-1491901427). Approximately $30. Available online from Amazon and O'Reilly

Grading:

Problem Sets 30%
Quizzes 20%
Project 20%
Final exam 30%
You must take and pass the final to pass the course.

Syllabus

Academic Integrity Policy

Note: While I encourage you to help each other on assignments and the project, you should never share your code with another student. If you do and that student submits your code, you will both receive 0 on the assignment/project.

Weekly Problem Sets: see right column of course outline below and Blackboard

Quizzes: see Blackboard

Project

Outline:

Date: Topics: Handouts: Reading: Quiz Topics: HW/Project:
#1
Mon 30 January
First Day Details, Topics Overview, Mean, variance and random variables Syllabus, DS venn diagram,
Gallery: NY density, nearest airport, precincts, citibike, buses vs. subways, transit + census, life spans, ebola, disease, jobs;
Printing (from __future__), Textbook's repo, summaries sometimes hides the big picture, Anscombe's Quartet
Academic Integrity Policy,
Chapters 1-3
#2
Wed 1 February
Lab
Python 2 vs. 3, Python Refresher: basics; Quick look at matplotlib's line and bar charts;More on matplotlib: histograms and scatterplots; Data as vectors: scaling, dot products;
Python Refresher: list comprehensions & zip
basic stats, list comprehension examples, list comprehension tutorial, matplotlib, Textbook's repo, Plotting recipes Chapters 2,4,5
#3
Mon 6 February
Scaling and dot product; correlation & causation
Python Refresher: lists & tuples
weather3.py, lymeScaled.py, book's stats.py (depends on linear_algebra.py), Anscombe's Quartet, correlation guessing Chapters 2,5 #1: Academic Integrity HW #1: Simple graphs with pyplot
#4
Wed 8 February
Lab
Applying correlation, Simpson's Paradox;
Getting Data: CSV Files
Python refresher: lists & tuples, dictionaries
lists vs. tuples, dictionary examples, lymeScaled.py, simple csv example & data, Simpson's paradox wiki, wage growth paradox Chapters 2,6,9
Mon 13 February Lincoln's Birthday - Lehman is closed
15 February Classes follow Monday schedule
#5
Wed 15 February
Monday schedule
Probability: Distributions & Central Limit Theorem;
CSV Files
Simpson's paradox wiki, wage growth paradox, simple csv example & data, normal distribution calculator, rolling dice, Central Limit Theorem Visualized, Matt Nedrich on CLT Chapters 5,6, 9 #2: Python Basics HW #2: Scaling Vector Data
20 February President's Day - Lehman is closed
#6
Wed 22 February
Lab
Causation vs. Correlation, CSV Files
Python Refresher: collections, regular expressions
simple csv example & data, dsWiki.txt (for group work), regex cheat sheet, regex online tester, correlation does not equal causation Chapters 2,9 #3: Vectors, Means, and Variances HW #3: Binning Data & Measuring Dispersion
#7
Mon 27 February
Bayes Theorem; Naive Bayes: Spam Filter Example
regex online tester,book's naive Bayes spam filter, spam dataset Chapters 6,13 #4: Python Lists, Dictionaries, & csv HW #4: Correlations & Distributions
#8
Wed 1 March
Lab
Naive Bayes: Spam Filter Example; Python Refresher: more on matplotlib & sets twoPlots.py, subplots,book's naive Bayes spam filter, spam dataset Chapters 2,7
#9
Mon 6 March
Hypothesis & Inference: Confidence Intervals;More on Confidence Intervals, A/B Testing; Khan Academy on confidence intervals, Khan Academy on hypothesis testing, normal distribution calculator, numpy, plotting revisited Chapters 7,25 #5: Correlation & Regular Expressions HW #5: Bayes Theorem, Simpson's Paradox, & Regular Expressions
#10
Wed 8 March
Lab
Hypothesis & Inference: Confidence Intervals;More on Confidence Intervals, A/B Testing continued scipy lecture notes on arrays, arrays & images, , 3d surface example code, mplot3d tutorial, matplotlib colormaps Chapters 8,9,25
#11
Mon 13 March
Gradient descent, Linear Algebra Refresher: Eigenvalues & Eigenvectors Example: Simple Linear Regression Matt Nedrich's intro to gradient descent & example, Quinn Liu's gradient descent image,Andrew Ng's linear regression notes;
Eigenvectors & eigenvalues, visually, linear transformations example
Chapters 2,8,9 #6: Bayes Theorem HW #6: A/B Testing
#12
Wed 15 March
Lab
Manipulating image files with numpy
Python Refresher: numpy
numpy: plotting revisited, detailed numpy tutorial, numpy cheatsheet;
scipy lecture notes on arrays, arrays & images;
regression and GitHub classwork
Chapters 9,10
#13
Mon 20 March
Eigenvectors and eigenvalues; review: gradient descent and linear regression Matt Nedrich's intro to gradient descent & example;
Eigenvectors & eigenvalues, visually
Chapters 2,10,25 #7: Hypothesis & Inference HW #7: Gradient Descent & Images
#14
Wed 22 March
Lab
Using github; using Pandas and Seaborn for correlation and regreesion;
github for beginners, github Hello World, github student pack, github cheat sheet;
regression and GitHub classwork;
Folium classwork, Folium tutorial
Chapters 5,25
#15
Mon 27 March
Computing eigenvalues and eigenvectors; Working with Multidimensional Data: Rescaling, Principal Components Analysis Example of using numpy to compute eigenvalues and eigenvectors;
PCA, explained visually, Lindsay Smith's computing PCA, Sebastian Raschka's PCA overview and implementating in Python;
scipy, sklearn's PCA, pca on iris dataset, NY Fed's unemployment rates and by major
Chapters 2,10,25 #8: Gradient Descent & numpy HW #8: Mapping Data

#16
Wed 29 March
Lab
Principal Components Analysis via sci-kit learn; JSON and geoJSON; choropleth maps ERSI's shapefiles, shapefile wikipage, JSON, KML, summary & comparison;
geometric interpretation of covariance matrix,PCA explained in greater and greater detail (first answer), sample PCA code, PCA method in sci-kit learn, PCA on the iris dataset;
geoJSON and choropleth Lab, geoJSON specifications, geoJSON editor
Chapters 2,11,12
#17
Mon 3 April
Nearest Neighbors & Voronoi Diagrams;
Clustering: k-means
nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering,
Chapters 12,19 #9: Eigenvectors & eigenvalues HW #9: Shading Maps & PCA
Project: Proposal
#18
Wed 5 April
Lab
Scraping webpages: Beautiful Soup; k- Nearest Neighbors beautifulSoup, soup documentation, where's beautifulSoup?, Frances Zlotnick's tutorial, DOM tutorial, book's code;
k-nearest neighbors tutorial
Chapters 10,19
10-18 April Spring recess: no classes
19 April Last day to withdraw from class with a grade of W
#19
Wed 19 April
Lab
k-Nearest Neighbors
book's code;
k-nearest neighbors tutorial
Chapters 14-15 #10: Using github & beautifulSoup HW #10: Nearest Neighbors

Project: Timeline
20 April Classes follow Monday schedule
#20
Thurs 20 April
Voronoi Diagrams, Clustering: k-means nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering
Chapter 16
#21
Mon 24 April
k-means continued; hierarchical clustering; Multi-dimensional Scaling (MDS) k-means (wiki), k-means image example,k means example, k-nearest-neighbor versus k-means, scikit-learn clustering;
hierarchical clustering;
Noel O'Boyle's map example, Zachary Nichols' NYC scaled to commute time and part 2
Chapters 16,20 #11: PCA HW #11: k-Nearest Neighbors and Voronoi Diagrams

Project: Data Collection
#22
Wed 26 April
Lab
Voronoi Diagrams and Clustering Labs Voronoi Diagram Lab,Voronoi function in Scipy;scikit-learn clustering,k-means image example Chapters 17,20
#23
Mon 1 May
Refresher: Trees & Graphs;
Network Analysis
networkx tutorial, Cambridge tutorial, graph review Chapter 21 #12: Nearest Neighbors & Clustering HW #12: MDS & Regression

Project: Analysis
#24
Wed 3 May
Lab
Regression Cont'd regression recap, logistic regression wiki, Marcel Caracliolo's university entrance example, dummies on iris data set, sklearn logistic regression, sklean logistic regression example, 311 Requests (filter for Descriptor = "Pothole"),
sklearn's MDS,middle school data
Chapters 18, 22
#25
Mon 8 May
MapReduce & PageRank PageRank as applied lin. alg. (SIAM Review 2006) Chapter 23 #13: Regression & NLP Project: Visualization & Draft Slide
#26
Wed 10 May
Lab
Crash Course in SQL Khan Academy on SQL, sqlitebrowser, sqlite, SQL lab Chapter 24
#27
Mon 15 May
Not from scratch: iPython (jupyter), pandas, and seaborn Thomas Wiecki's modern guide to data science, OpenTechSchool iPython tutorial,
pandas cookbook, cheat sheet,
seaborn, elevator data
Chapter 25 Complete Project

Project: Sneak Preview Slide
#28
Wed 17 May
Lab
Project Presentations
Wed 24 May Final exam 1:30pm - 3:30pm