Classwork: correlation & regression (pandas & seaborn), and github

CMP 464/788:
Data Science
Spring 2017

Original version by Katherine St. John.

Useful Packages: Pandas & Seaborn

This lab looks at popular packages to manage and visualize data. Before starting the next section, check to see if the following are installed, by typing at the Python shell (in spyder, idle, or your favorite Python interface):

	import pandas as pd

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

	conda install pandas
Pandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets). It incorporates most of the Python constructs and libraries that we have seen thus far.

Next, check if seaborn is installed:

	import seaborn as sns

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

	conda install seaborn
Seaborn is a Python visualization library based on matplotlib. It provides beautiful statistical graphics.

Regression & Correlation

In class, we discussed the uses of regression and correlation. Let's now apply those to a data set of the NY Fed's labor trends for recent college graduates, labor.csv.

Open the data set and look at the fields. For each major, it gives information (unemployment rate, underemployment rate, median wage, etc.) about the labor market for recent graduates with that degree.

Our goal is to see the correlation between the under-employment rate and the median wage, and to add a linear regression it.

Seaborn uses the data structures in pandas as its default. And given how easy it is to use, we will too. The basic structure is a DataFrame which stored data in rectangular grids.

Let's use this to visualize the labor data. First, start your file with the standard import statements:

	import numpy as np
	import pandas as pd
	import matplotlib as mpl
	import matplotlib.pyplot as plt
	import seaborn as sns

Next, let's read in the NY Fed data (this assumes that the file is called labor.csv and located in the same directory as your Python program):

	labor = pd.read_csv('labor.csv', skiprows=13)
Open labor.csv and notice that the first 13 lines do not contain any data. The read_csv() function has an option to skip rows that don't contain data. We now have stored the data from labor.csv into the DataFrame, labor in a single line (instead of the multiple lines it took with regular Python file I/O or the csv library).

To see if this works, try to print the column of majors:

	print("The majors are:", labor["Major"])

To compute the correlation between two columns, we select the columns (labor.iloc[:,[2,3]]) and then apply Pandas correlation function: corr()

	print( labor.iloc[:,[2,3]].corr() )

If we wanted to compute the correlations between all columns, we can just apply the function to the whole DataFrame: labor.corr()).

In seaborn, making a regression plot is very straightforward:

	sns.regplot(x="Underemployment Rate", y="Median Wage Early Career", data=labor)

Note that we specified the columns by the names that were used in the original CSV file.

If you are using Idle, then you have to add sns.plt.show() to show the plot. This commands opens the plot in a new window.

Additional Challenges

github

github is the standard way to share and collaborate on code. It functions much as Google docs does for documents. The second part of today's classwork is to get started on github:
  1. If you do not already have an account, create an account on github.
  2. Work through the github Hello World tutorial.
  3. If you are interested in using github from the command line, work through the github for beginners tutorial.