Instructor: Prof. Megan Owen
Phone: 718-960-7423
Office hours: Monday and Wednesday, 9:30 - 10:50am, Gillet 137E, or immediately after class on Mondays and 1:15-1:45pm on Wednesdays
Course time: 1:50pm - 3:30pm on Mondays (in Gillet 225) and Wednesdays (in Gillet 231)


Link to Python resources


Data Science from Scratch by Joel Grus. (ISBN 978-1491901427). Approximately $30. Available online from Amazon and O'Reilly


Problem Sets 30%
Quizzes 20%
Project 20%
Final exam 30%
You must take and pass the final to pass the course.


Academic Integrity Policy

Note: While I encourage you to help each other on assignments and the project, you should never share your code with another student. If you do and that student submits your code, you will both receive 0 on the assignment/project.

Weekly Problem Sets: see right column of course outline below and Blackboard

Quizzes: see Blackboard



Date: Topics: Handouts: Reading: Quiz Topics: HW/Project:
Mon 30 January
First Day Details, Topics Overview, Mean, variance and random variables Syllabus, DS venn diagram,
Gallery: NY density, nearest airport, precincts, citibike, buses vs. subways, transit + census, life spans, ebola, disease, jobs;
Printing (from __future__), Textbook's repo, summaries sometimes hides the big picture, Anscombe's Quartet
Academic Integrity Policy,
Chapters 1-3
Wed 1 February
Python 2 vs. 3, Python Refresher: basics; Quick look at matplotlib's line and bar charts;More on matplotlib: histograms and scatterplots; Data as vectors: scaling, dot products;
Python Refresher: list comprehensions & zip
basic stats, list comprehension examples, list comprehension tutorial, matplotlib, Textbook's repo, Plotting recipes Chapters 2,4,5
Mon 6 February
Scaling and dot product; correlation & causation
Python Refresher: lists & tuples,, book's (depends on, Anscombe's Quartet, correlation guessing Chapters 2,5 #1: Academic Integrity HW #1: Simple graphs with pyplot
Wed 8 February
Applying correlation, Simpson's Paradox;
Getting Data: CSV Files
Python refresher: lists & tuples, dictionaries
lists vs. tuples, dictionary examples,, simple csv example & data, Simpson's paradox wiki, wage growth paradox Chapters 2,6,9
Mon 13 February Lincoln's Birthday - Lehman is closed
15 February Classes follow Monday schedule
Wed 15 February
Monday schedule
Probability: Distributions & Central Limit Theorem;
CSV Files
Simpson's paradox wiki, wage growth paradox, simple csv example & data, normal distribution calculator, rolling dice, Central Limit Theorem Visualized, Matt Nedrich on CLT Chapters 5,6, 9 #2: Python Basics HW #2: Scaling Vector Data
20 February President's Day - Lehman is closed
Wed 22 February
Causation vs. Correlation, CSV Files
Python Refresher: collections, regular expressions
simple csv example & data, dsWiki.txt (for group work), regex cheat sheet, regex online tester, correlation does not equal causation Chapters 2,9 #3: Vectors, Means, and Variances HW #3: Binning Data & Measuring Dispersion
Mon 27 February
Bayes Theorem; Naive Bayes: Spam Filter Example
regex online tester,book's naive Bayes spam filter, spam dataset Chapters 6,13 #4: Python Lists, Dictionaries, & csv HW #4: Correlations & Distributions
Wed 1 March
Naive Bayes: Spam Filter Example; Python Refresher: more on matplotlib & sets, subplots,book's naive Bayes spam filter, spam dataset Chapters 2,7
Mon 6 March
Hypothesis & Inference: Confidence Intervals;More on Confidence Intervals, A/B Testing; Khan Academy on confidence intervals, Khan Academy on hypothesis testing, normal distribution calculator, numpy, plotting revisited Chapters 7,25 #5: Correlation & Regular Expressions HW #5: Bayes Theorem, Simpson's Paradox, & Regular Expressions
Wed 8 March
Hypothesis & Inference: Confidence Intervals;More on Confidence Intervals, A/B Testing continued scipy lecture notes on arrays, arrays & images, , 3d surface example code, mplot3d tutorial, matplotlib colormaps Chapters 8,9,25
Mon 13 March
Gradient descent, Linear Algebra Refresher: Eigenvalues & Eigenvectors Example: Simple Linear Regression Matt Nedrich's intro to gradient descent & example, Quinn Liu's gradient descent image,Andrew Ng's linear regression notes;
Eigenvectors & eigenvalues, visually, linear transformations example
Chapters 2,8,9 #6: Bayes Theorem HW #6: A/B Testing
Wed 15 March
Manipulating image files with numpy
Python Refresher: numpy
numpy: plotting revisited, detailed numpy tutorial, numpy cheatsheet;
scipy lecture notes on arrays, arrays & images;
regression and GitHub classwork
Chapters 9,10
Mon 20 March
Eigenvectors and eigenvalues; review: gradient descent and linear regression Matt Nedrich's intro to gradient descent & example;
Eigenvectors & eigenvalues, visually
Chapters 2,10,25 #7: Hypothesis & Inference HW #7: Gradient Descent & Images
Wed 22 March
Using github; using Pandas and Seaborn for correlation and regreesion;
github for beginners, github Hello World, github student pack, github cheat sheet;
regression and GitHub classwork;
Folium classwork, Folium tutorial
Chapters 5,25
Mon 27 March
Computing eigenvalues and eigenvectors; Working with Multidimensional Data: Rescaling, Principal Components Analysis Example of using numpy to compute eigenvalues and eigenvectors;
PCA, explained visually, Lindsay Smith's computing PCA, Sebastian Raschka's PCA overview and implementating in Python;
scipy, sklearn's PCA, pca on iris dataset, NY Fed's unemployment rates and by major
Chapters 2,10,25 #8: Gradient Descent & numpy HW #8: Mapping Data

Wed 29 March
Principal Components Analysis via sci-kit learn; JSON and geoJSON; choropleth maps ERSI's shapefiles, shapefile wikipage, JSON, KML, summary & comparison;
geometric interpretation of covariance matrix,PCA explained in greater and greater detail (first answer), sample PCA code, PCA method in sci-kit learn, PCA on the iris dataset;
geoJSON and choropleth Lab, geoJSON specifications, geoJSON editor
Chapters 2,11,12
Mon 3 April
Nearest Neighbors & Voronoi Diagrams;
Clustering: k-means
nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering,
Chapters 12,19 #9: Eigenvectors & eigenvalues HW #9: Shading Maps & PCA
Project: Proposal
Wed 5 April
Scraping webpages: Beautiful Soup; k- Nearest Neighbors beautifulSoup, soup documentation, where's beautifulSoup?, Frances Zlotnick's tutorial, DOM tutorial, book's code;
k-nearest neighbors tutorial
Chapters 10,19
10-18 April Spring recess: no classes
19 April Last day to withdraw from class with a grade of W
Wed 19 April
k-Nearest Neighbors
book's code;
k-nearest neighbors tutorial
Chapters 14-15 #10: Using github & beautifulSoup HW #10: Nearest Neighbors

Project: Timeline
20 April Classes follow Monday schedule
Thurs 20 April
Voronoi Diagrams, Clustering: k-means nearest airport, precincts' Voronoi diagram, Voronoi diagrams from triagulations, scipy Voronoi module
k-means (wiki), k-means image example, scikit-learn clustering
Chapter 16
Mon 24 April
k-means continued; hierarchical clustering; Multi-dimensional Scaling (MDS) k-means (wiki), k-means image example,k means example, k-nearest-neighbor versus k-means, scikit-learn clustering;
hierarchical clustering;
Noel O'Boyle's map example, Zachary Nichols' NYC scaled to commute time and part 2
Chapters 16,20 #11: PCA HW #11: k-Nearest Neighbors and Voronoi Diagrams

Project: Data Collection
Wed 26 April
Voronoi Diagrams and Clustering Labs Voronoi Diagram Lab,Voronoi function in Scipy;scikit-learn clustering,k-means image example Chapters 17,20
Mon 1 May
Refresher: Trees & Graphs;
Network Analysis
networkx tutorial, Cambridge tutorial, graph review Chapter 21 #12: Nearest Neighbors & Clustering HW #12: MDS & Regression

Project: Analysis
Wed 3 May
Regression Cont'd regression recap, logistic regression wiki, Marcel Caracliolo's university entrance example, dummies on iris data set, sklearn logistic regression, sklean logistic regression example, 311 Requests (filter for Descriptor = "Pothole"),
sklearn's MDS,middle school data
Chapters 18, 22
Mon 8 May
MapReduce & PageRank PageRank as applied lin. alg. (SIAM Review 2006) Chapter 23 #13: Regression & NLP Project: Visualization & Draft Slide
Wed 10 May
Crash Course in SQL Khan Academy on SQL, sqlitebrowser, sqlite, SQL lab Chapter 24
Mon 15 May
Not from scratch: iPython (jupyter), pandas, and seaborn Thomas Wiecki's modern guide to data science, OpenTechSchool iPython tutorial,
pandas cookbook, cheat sheet,
seaborn, elevator data
Chapter 25 Complete Project

Project: Sneak Preview Slide
Wed 17 May
Project Presentations
Wed 24 May Final exam 1:30pm - 3:30pm