(Created with wordle with text from wiki)
Instructor: Prof. Megan Owen (she/her)
E-mail: megan.owen@lehman.cuny.edu
Office: Gillet 137E
Office hours: 2:40 - 3:20pm, 3:35 - 4:15pm on Tuesdays and Thursdays (Gillet 231 or 137E), or by appointment

Course time: 4:15pm - 5:55pm on Tuesdays and Thursdays, Gillet 231

Syllabus

Python and Jupyter:

Textbooks:

See Reading and References column in the schedule below.

Assignments: see Blackboard

Project

Outline:

Date: Topics: Lab and Handouts: Reading and References: Project deadline:
#1
Tues 27 August
Syllabus; review of loading CSV files, processing data, plots, and filtering Syllabus

weather data (use KNYC.csv)

Package installation commands

Pre-class, empty: Lab 1 - Review of dates and plots
Pre-class, completed: Lab 1 - Review of dates and plots
From class: Lab 1 - Review of dates and plots
Academic Integrity Policy
FiveThirtyEight article using the weather data
Timedate components
#2
Thurs 29 August
Review: bar charts, filtering Lab 2 - Reivew of bar charts and filtering
From class: Lab 2 - Reivew of bar charts and filtering
Bar chart in Pandas
Filtering in Pandas
Condensed filtering examples
Mon 2 September CUNY: No classes (Labor Day)
#3
Tues 3 Sept
Groupby, Seaborn plots Lab 3 - groupby and more plots
From class: Lab 3 - groupby and more plots
Pandas groupby tutorial (intro)
Pandas groupby tutorial (medium)
Pandas' user guide to groupby (detailed)

Seaborn package: gallery and tutorials
Another Seaborn tutorial
Thurs 5 Sept Classes follow a Monday schedule
#4
Tues 10 Sept
Normal and exponential probability distributions Lab 4 - Probability distributions
From class: Lab 4 - Probability distributions
babyboom.dat.txt
Sampling with numpy
Introduction to Normal distribution
Normal distribution in Scipy
Exponential distribution in Scipy
#5
Thurs 12 Sept
Non-parametric distributions, confidence intervals, bootstrap, comparing means Lab 5 - Non-parametric distributions and bootstrap
From class: Lab 5 - Non-parametric distributions and bootstrap
DOHMH_New_York_City_Restaurant_Inspection_Results.csv

Introduction to GitHub
Parametric vs. non-parametric data (first two sections)
Parametric and non-parametric bootstrap (starting at section "The notion of a Sampling Distribution")
Central Limit Theorem
Milestone 1: find dataset
#6
Tues 17 Sept
Review of linear regression Introduction to GitHub
Lab 6 - Review of linear regression
From class: Lab 6 - Review of linear regression
Another tutorial on linear regression using the Boston housing data
Online stats book: linear regression
#7
Thurs 19 Sept
Linear regression continued: r-squared, predictions, dummy variables Lab 7 - Linear regression continued (empty)
From class: Lab 7 - Linear regression continued
Introduction to Linear regression tutorial
More detailed introduction to linear regression
Insurance data set on Kaggle (click on kernels to see how others have analyzed it)
Milestone 2: GitHub account and upload data
#8
Tues 24 Sept
Linear regression continued: more on dummy variables, mean square error, validation Lab 8 - Mean Squared Error and validation
From class: Lab 8 - Mean Squared Error and validation
insurance.csv
Dummy variables
Training and test data, cross-validation in Python
#9
Thurs 26 Sept
Overfitting, underfitting, cross-validation Lab 9 - Overfitting and underfitting, fitting polynomials, k-fold cross validation
From class: Lab 9
Anscombe's Quartet Over- and under-fitting, cross-validation in Python Milestone 3: webpage and data description
30 September - 1 October CUNY: No classes
#10
Thurs 3 October
Logistic Regression Lab 9b - 2-fold cross validation
From class: Lab 9b - 2-fold cross validation

Lab 10 - Logistic Regression
From class: Lab 10 - Logistic Regression
Logistic regression tutorial Milestone 4: missing data and column distributions
8-9 October CUNY: No classes
#11
Thurs 10 October
Logistic Regression Continued: multiple indpendent variables, accuracy and precision Lab 11 - Logistic Regression Continued
From class: Lab 11
Lab 11b - Classwork
Precision, recall, sensitivity, specificity
Methods for evaluating binary classification
Milestone 5: outliers and multi-column relationships
#12
Tues 15 October
Decision trees: Classification Lab 12 - Decision trees
From class: Lab 12
A visual introduction to machine learning via decision trees
Introduction to decision trees in sci-kit learn
Sklearn: decision trees
#13
Thurs 17 October
Decision trees: Regression Lab 13 - Decision trees for regression
From class: Lab 13
Detailed explanation of decision trees
Another detailed explanation of decision trees
Gini impurity
#14
Tues 22 October
Choropleth maps Lab 14 - Mapping data
nyc_school_districts.json
math_district.csv
Folium tutorial
#15
Thurs 24 October
Review for midterm Milestone 6: Linear or logistic regression
#16
Tues 29 October
Midterm
#17
Thurs 31 October
Cross tabulation (contingency tables) and more probability Lab 17 - Cross tabulation Cross tabulation in Pandas
5 November Last day to withdraw from class with a grade of W
#18
Tues 5 November
Introduction to vectors and distances
#19
Thurs 7 November
K-nearest neighbors Lab 19 - k-nearest neighbors
From class: Lab 19 - k-nearest neighbors
k-nearest neighbors using sci-kit learn
k-nearest neighbors concept
#20
Tues 12 November
Hierarchical clustering Lab 20 - Hierarchical clustering
From class: Lab 20 - Hierarchical clustering
Labor market data
Hierarchical clustering
Sci-kit learn: hierarchical clustering
#21
Thurs 14 November
k-means clustering Lab 21 - k-means clustering
From class: Lab 21 - k-means clustering
Interactive visualization of k-means clustering
Another interactive visualization of k-means clustering
Visualization of k-means clustering algorithm
k-means clustering in depth
Limitations of k-means clustering
images of the digits
Milestone 7: Decision trees
#22
Tues 19 November
Determining the number of clusters: elbow method and silhouette score Lab 22 - Determining the number of clusters
From class: Lab 22 - Determining the number of clusters
Starbucks dataset
Estimating k with the elbow method
Silhouette analysis
#23
Thurs 21 November
Principal Components Analysis Lab 23 - Silhouette Score revisited and Principal Components Analysis
From class: Lab 23 - Silhouette Score revisited and Principal Components Analysis
Sklearn: Selecting the number of clusters with silhouette analysis
PCA Explained Visually
Milestone 8: k-nearest neighbors
#24
Tues 26 November
PCA continued Lab 24 - Simulated clusters
From class: Lab 24 - Simulated clusters
28 November - 1 December Thanksgiving Recess: College Closed
#25
Tues 3 December
More Hypothesis testing: Testing with multiple categories Lab 25 - Hypothesis testing with multiple categories
From class: Lab 25 - Hypothesis testing with multiple categories
Lab 25 - Part 2
From class: Lab 25 - Part 2
Mar3_4_2019_311_Service_Requests.csv
Hypothesis testing for multiple categories
Step in hypothesis testing
#26
Thurs 5 December
Hypothesis testing: Testing means of groups with permutation testing Lab 26 - Permutation tests
From class: Lab 26 - Permutation tests
Hypothesis testing to compare two samples Milestone 9: your choice
#27
Tues 10 December
Project presentations
#28
Thurs 12 December
Review for final exam
Tues 17 December Final exam 3:45pm - 5:45pm, Gillet 231