(Created with wordle with text from wiki)
Instructor: Prof. Megan Owen (she/her)
E-mail: megan.owen@lehman.cuny.edu
Office: Gillet 137E
Office hours: 2:40 - 3:20pm, 3:35 - 4:15pm on Tuesdays and Thursdays (Gillet 231 or 137E), or by appointment

Course time: 4:15pm - 5:55pm on Tuesdays and Thursdays, Gillet 231


Python and Jupyter:


See Reading and References column in the schedule below.

Assignments: see Blackboard



Date: Topics: Lab and Handouts: Reading and References: Project deadline:
Tues 27 August
Syllabus; review of loading CSV files, processing data, plots, and filtering Syllabus

weather data (use KNYC.csv)

Package installation commands

Pre-class, empty: Lab 1 - Review of dates and plots
Pre-class, completed: Lab 1 - Review of dates and plots
From class: Lab 1 - Review of dates and plots
Academic Integrity Policy
FiveThirtyEight article using the weather data
Timedate components
Thurs 29 August
Review: bar charts, filtering Lab 2 - Reivew of bar charts and filtering
From class: Lab 2 - Reivew of bar charts and filtering
Bar chart in Pandas
Filtering in Pandas
Condensed filtering examples
Mon 2 September CUNY: No classes (Labor Day)
Tues 3 Sept
Groupby, Seaborn plots Lab 3 - groupby and more plots
From class: Lab 3 - groupby and more plots
Pandas groupby tutorial (intro)
Pandas groupby tutorial (medium)
Pandas' user guide to groupby (detailed)

Seaborn package: gallery and tutorials
Another Seaborn tutorial
Thurs 5 Sept Classes follow a Monday schedule
Tues 10 Sept
Normal and exponential probability distributions Lab 4 - Probability distributions
From class: Lab 4 - Probability distributions
Sampling with numpy
Introduction to Normal distribution
Normal distribution in Scipy
Exponential distribution in Scipy
Thurs 12 Sept
Non-parametric distributions, confidence intervals, bootstrap, comparing means Lab 5 - Non-parametric distributions and bootstrap
From class: Lab 5 - Non-parametric distributions and bootstrap

Introduction to GitHub
Parametric vs. non-parametric data (first two sections)
Parametric and non-parametric bootstrap (starting at section "The notion of a Sampling Distribution")
Central Limit Theorem
Milestone 1: find dataset
Tues 17 Sept
Review of linear regression Introduction to GitHub
Lab 6 - Review of linear regression
From class: Lab 6 - Review of linear regression
Another tutorial on linear regression using the Boston housing data
Online stats book: linear regression
Thurs 19 Sept
Linear regression continued: r-squared, predictions, dummy variables Lab 7 - Linear regression continued (empty)
From class: Lab 7 - Linear regression continued
Introduction to Linear regression tutorial
More detailed introduction to linear regression
Insurance data set on Kaggle (click on kernels to see how others have analyzed it)
Milestone 2: GitHub account and upload data
Tues 24 Sept
Linear regression continued: more on dummy variables, mean square error, validation Lab 8 - Mean Squared Error and validation
From class: Lab 8 - Mean Squared Error and validation
Dummy variables
Training and test data, cross-validation in Python
Thurs 26 Sept
Overfitting, underfitting, cross-validation Lab 9 - Overfitting and underfitting, fitting polynomials, k-fold cross validation
From class: Lab 9
Anscombe's Quartet Over- and under-fitting, cross-validation in Python Milestone 3: webpage and data description
30 September - 1 October CUNY: No classes
Thurs 3 October
Logistic Regression Lab 9b - 2-fold cross validation
From class: Lab 9b - 2-fold cross validation

Lab 10 - Logistic Regression
From class: Lab 10 - Logistic Regression
Logistic regression tutorial Milestone 4: missing data and column distributions
8-9 October CUNY: No classes
Thurs 10 October
Logistic Regression Continued: multiple indpendent variables, accuracy and precision Lab 11 - Logistic Regression Continued
From class: Lab 11
Lab 11b - Classwork
Precision, recall, sensitivity, specificity
Methods for evaluating binary classification
Milestone 5: outliers and multi-column relationships
Tues 15 October
Decision trees: Classification Lab 12 - Decision trees
From class: Lab 12
A visual introduction to machine learning via decision trees
Introduction to decision trees in sci-kit learn
Sklearn: decision trees
Thurs 17 October
Decision trees: Regression Lab 13 - Decision trees for regression
From class: Lab 13
Detailed explanation of decision trees
Another detailed explanation of decision trees
Gini impurity
Tues 22 October
Choropleth maps Lab 14 - Mapping data
Folium tutorial
Thurs 24 October
Review for midterm Milestone 6: Linear or logistic regression
Tues 29 October
Thurs 31 October
Cross tabulation (contingency tables) and more probability Lab 17 - Cross tabulation Cross tabulation in Pandas
5 November Last day to withdraw from class with a grade of W
Tues 5 November
Introduction to vectors and distances
Thurs 7 November
K-nearest neighbors Lab 19 - k-nearest neighbors
From class: Lab 19 - k-nearest neighbors
k-nearest neighbors using sci-kit learn
k-nearest neighbors concept
Tues 12 November
Hierarchical clustering Lab 20 - Hierarchical clustering
From class: Lab 20 - Hierarchical clustering
Labor market data
Hierarchical clustering
Sci-kit learn: hierarchical clustering
Thurs 14 November
k-means clustering Lab 21 - k-means clustering
From class: Lab 21 - k-means clustering
Interactive visualization of k-means clustering
Another interactive visualization of k-means clustering
Visualization of k-means clustering algorithm
k-means clustering in depth
Limitations of k-means clustering
images of the digits
Milestone 7: Decision trees
Tues 19 November
Determining the number of clusters: elbow method and silhouette score Lab 22 - Determining the number of clusters
From class: Lab 22 - Determining the number of clusters
Starbucks dataset
Estimating k with the elbow method
Silhouette analysis
Thurs 21 November
Principal Components Analysis Lab 23 - Silhouette Score revisited and Principal Components Analysis
From class: Lab 23 - Silhouette Score revisited and Principal Components Analysis
Sklearn: Selecting the number of clusters with silhouette analysis
PCA Explained Visually
Milestone 8: k-nearest neighbors
Tues 26 November
PCA continued Lab 24 - Simulated clusters
From class: Lab 24 - Simulated clusters
28 November - 1 December Thanksgiving Recess: College Closed
Tues 3 December
More Hypothesis testing: Testing with multiple categories Lab 25 - Hypothesis testing with multiple categories
From class: Lab 25 - Hypothesis testing with multiple categories
Lab 25 - Part 2
From class: Lab 25 - Part 2
Hypothesis testing for multiple categories
Step in hypothesis testing
Thurs 5 December
Hypothesis testing: Testing means of groups with permutation testing Lab 26 - Permutation tests
From class: Lab 26 - Permutation tests
Hypothesis testing to compare two samples Milestone 9: your choice
Tues 10 December
Project presentations
Thurs 12 December
Review for final exam
Tues 17 December Final exam 3:45pm - 5:45pm, Gillet 231