(Created with wordle with text from wiki)
Instructor: Prof. Megan Owen
E-mail: megan.owen@lehman.cuny.edu
Phone: 718-960-7423
Office hours (as of March 22): 2:40 - 3:30pm on Tuesdays (Gillet 137E), 12 - 1pm on Wednesdays (Gillet 137E), 12:00 - 12:50pm on Thursdays
Course time: 1:00pm - 2:40pm on Tuesdays and Thursdays, Gillet 231


Student mentor: Jona Kerluku
E-mail: jonakerluku@gmail.com
Office hours (as of March 22): 11:45am - 12:45pm on Tuesdays and 2:45-3:545pm on Thursdays (both Gillet 233-B)

About this course:

Python

R

Free Online Textbooks:

How to Think Like a Computer Scientist (Python 3) by Jeffrey Elkner, Allen B. Downey, and Chris Meyers
Think Stats by Allen B. Downey
Online Statistics: An Interactive Multimedia Course of Study Project leader: David M. Lane
R for Data Science by Garrett Grolemund and Hadley Wickham
List of free online books related to data science (various levels)

Grading:

Assigments 30%
Classwork (quizzes and group work) 35%
Final exam 35%
You must take and pass the final exam to pass the course.

Syllabus

Academic Integrity Policy

While I encourage you to help each other on assignments, you should never share your code with another student. If you do and that student submits your code or section of it, you will both receive 0 on the assignment.

Assignments: see Blackboard

In-class Quizzes: see Blackboard

Outline:

Date: Topics: Handouts: Reading: Classwork & Quiz Topics:
#1
Tues 30 January
First Day Details; What is Data Science; Introduction to Python: printing, variables; plotting with Pandas Syllabus, DS venn diagram,
Gallery: NY density, nearest airport, precincts, citibike, buses vs. subways, ebola, disease Data Science Process
Academic Integrity Policy,
Think CS: Chapter 1 & Chapter 2
Online Stats:Variables &Line graphs
Academic Integrity
#2
Thurs 1 February
Introduction to plotting and Pandas; types of statistical variables; Lab 1 Online Stats:Variables Groups:variables in statistics
#3
Tues 6 February
More plotting and columnn operations Lab 2 Groups:
#4
Thurs 8 February
Histograms; Mean, median, mode Lab 3 Online Stats: Median and mean
Non-technical overview: mean, median, mode
Quiz: printing, variables, plotting, types of variables
Mon 12 February Lincoln's Birthday - Lehman is closed
#5
Tues 13 February
Variance and boxplots Lab 4, lab4.py
Anscombe's quartet
Online Stats: percentiles, boxplots, variance Quiz: printing and (computer) variables, lab 2, types of statistical variables
#6
Thurs 15 February
Sample and Population Means and Variances Lab 5 Online Stats: http://onlinestatbook.com/2/summarizing_distributions/variability.html Quiz: lab 3, variance, review
19 February President's Day - Lehman is closed
20 February Classes follow Monday schedule
#7
Thurs 22 February
Hypotheses, selecting rows in a dataframe Lab 6 Selecting pandas Dataframe rows based on conditions Classwork: making and checking a hypothesis
#8
Tues 27 February
Introduction to probability, probability mass functions, sample vs. distribution Lab 7 Code Think CS: Generating random numbers
Online stats: Introduction to Probability
Quiz: Lab 6 and review
#9
Thurs 1 March
Computing probabilities, bar plots, counting unique values of data Lab 8 Online stats: Probability Basic Concepts, Bar charts
Quiz: boxplots and review
#10
Tues 6 March
Probability density distributions, uniform distribution, estimating probabilities continued Lab 9 Code Paper quiz: Homework 1-8; 1 sheet of paper (8" x 11") with handwritten notes on both sides is allowed
#11
Thurs 8 March
Normal distribution, data and time in pandas Lab 10 Code (normal distribution),
Lab 10 Code (rodent complaints)
Online stats: Normal distribution
Visualizing normal distributions
Quiz: probabilities and review
#12
Tues 13 March
Central Limit Theorem in action Lab 11
Lab 11 code from class
Visualizing the Central Limit Theorem 1
Visualizing the Central Limit Theorem 2
Online stats: Introduction to sampling distributions
Sampling distribution of the mean
Quiz: Lab 9 and review
#13
Thurs 15 March
Review IMDb dataset
Lab 12 code from class
#14
Tues 20 March
Confidence Intervals Lab 13 code from class Online stats: confidence intervals Paper quiz: Homework 9-16; 1 sheet of paper (8" x 11") with handwritten notes on both sides is allowed
#15
Thurs 22 March
Correllation and causation, scatter plots, heatmaps Labor market data,
Lab 14 Code
Spurious Correlation
Correlation Guessing Game
Online stats: correlation 1
correlation 2
#16
Tues 27 March
Regression: Simple Linear Regression Labor market data
Lab 15 Code
Introduction to Linear Regression
Online stats: Introduction to Linear Regression
Think stats: Statsmodel
A more comprehensive example of using linear regression
Quiz: Central Limit Theorem and review
#17
Thurs 29 March
Regression continued: Rsquared and Multiple Linear Regression Lab 16 (partial answers)
Lab 16 code from class
Online stats: Multiple linear regression, R-squared
Picture illustrating R-squared in section 6
Classwork: Multiple Linear Regression
30 March - 8 April Spring recess: no classes
#18
Tues 10 April
Introduction to hypothesis testing Lab 17 code from class
Background for lab at top - the code is different
Online stats:Introduction to Hypothesis Testing Paper quiz: Homework 17-24; 1 sheet of paper (8" x 11") with handwritten notes on both sides is allowed
11 April Classes follow Friday schedule
#19
Thurs 12 April
Hypothesis testing Lab 18, Code from class Steps for hypothesis testing Classwork
16 April Last day to withdraw from class with a grade of W
#20
Tues 17 April
Hypothesis testing continued Lab 18, Lab 18 Details
Code from class
Classwork: hypothesis testing
#21
Thurs 19 April
Introduction to R; vectors; plotting in R DataCamp Introduction to R
Lab 19
Try R Paper quiz: Estimating probabilities; 1 sheet of paper (8" x 11") with handwritten notes on both sides is allowed
#22
Tues 24 April
Dataframes in R DataCamp Introduction to R
Lab 20
Try R
#23
Thurs 26 April
ggplot2 - fancy plotting in R Buzzfeed's cleaned FBI NICS Firearm Background Check Data
Lab 21 code from class
ggplot2 cheatsheet
NY Times article based on NICS data
DataCamp Introduction to ggplot2 Paper quiz: Confidence intervals, correlation, heatmaps; 1 sheet of paper (8" x 11") with handwritten notes on both sides is allowed
#24
Tues 1 May
Intro to Machine Learning: understanding the data Kaggle: Titanic: Machine Learning from Disaster titanic_train.csv
titanic_test.csv
Classwork: understanding the titanic dataset
#25
Thurs 3 May
Guest talk by Violet Fredericks, continuation to machine learning Titanic tutorial part 1
Titanic tutorial part 2
Titanic tutorial part 3
Code from class
#26
Tues 8 May
Continuation of machine learning introduction Code from class
#27
Thurs 10 May
Review Exam review 1
#28
Tues 15 May
Review Sample Final Exam (answers)
original NBA dataset (see blackboard for how to clean)
Thurs 24 May Final exam 1:30pm - 3:30pm, Gillet 231