# Classwork: correlation & regression (pandas & seaborn), and github

## CMP 464/788: Data Science Spring 2017

Original version by Katherine St. John.

### Useful Packages: Pandas & Seaborn

This lab looks at popular packages to manage and visualize data. Before starting the next section, check to see if the following are installed, by typing at the Python shell (in spyder, idle, or your favorite Python interface):

```	import pandas as pd
```

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

```	conda install pandas
```
Pandas, Python Data Analysis Library, is an elegant, open-source package for extracting, manipulating, and analyzing data, especially those stored in 2D arrays (like spreadsheets). It incorporates most of the Python constructs and libraries that we have seen thus far.

Next, check if seaborn is installed:

```	import seaborn as sns
```

If you get an error that the library is not found. Open up a terminal, and use conda to install it:

```	conda install seaborn
```
Seaborn is a Python visualization library based on matplotlib. It provides beautiful statistical graphics.

### Regression & Correlation

In class, we discussed the uses of regression and correlation. Let's now apply those to a data set of the NY Fed's labor trends for recent college graduates, labor.csv.

Open the data set and look at the fields. For each major, it gives information (unemployment rate, underemployment rate, median wage, etc.) about the labor market for recent graduates with that degree.

Our goal is to see the correlation between the under-employment rate and the median wage, and to add a linear regression it.

Seaborn uses the data structures in pandas as its default. And given how easy it is to use, we will too. The basic structure is a DataFrame which stored data in rectangular grids.

Let's use this to visualize the labor data. First, start your file with the standard import statements:

```	import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
```

Next, let's read in the NY Fed data (this assumes that the file is called labor.csv and located in the same directory as your Python program):

```	labor = pd.read_csv('labor.csv', skiprows=13)
```
Open labor.csv and notice that the first 13 lines do not contain any data. The read_csv() function has an option to skip rows that don't contain data. We now have stored the data from labor.csv into the DataFrame, labor in a single line (instead of the multiple lines it took with regular Python file I/O or the csv library).

To see if this works, try to print the column of majors:

```	print("The majors are:", labor["Major"])
```
• How would you print out the unemployment rates?

To compute the correlation between two columns, we select the columns (labor.iloc[:,[2,3]]) and then apply Pandas correlation function: corr()

```	print( labor.iloc[:,[2,3]].corr() )
```

If we wanted to compute the correlations between all columns, we can just apply the function to the whole DataFrame: labor.corr()).

In seaborn, making a regression plot is very straightforward:

```	sns.regplot(x="Underemployment Rate", y="Median Wage Early Career", data=labor)
```

Note that we specified the columns by the names that were used in the original CSV file.

If you are using Idle, then you have to add sns.plt.show() to show the plot. This commands opens the plot in a new window.