 # Homework #2

## CMP 464/788: Topics Course: Data Science Spring 2017

Topics: Data as vectors, more on matplotlib & Weather Underground data
Deadline: Thursday, 16 February 2016, 11:59pm

### Weather Data and matplotlib

This assignment relies on data collected in Homework #1. See it for directions on getting started with matplotlib and scraping the Weather Underground website.

Built-in to Python are functions for downloading pages ('scraping data') directly from the web. We will use the urllib library to plot historical weather data.

### Messy Data

Real data can be a messy. For example, on the weather data question below, you are asked to scale the size of the "bubbles" in the scatter plot to reflect the snow depth on the ground that day. As you scrape this data, you will see many values of "0", but then a "T" pops up half-way through the month. For example, here's part of the Weather Underground page for January 19, 2016: The raw html file that produces the last 2 lines looks like:

```		<td class="indent"><span>Since 1 July snowfall</span></td>
<td>0.6</td>
<td>10.0</td>
<td> </td>
</tr>
<tr>
<td class="indent"><span>Snow Depth</span>
<td>
<span class="wx-data"><span class="wx-value">T</span> in</span></span>
</td>
<td>
<td>
</tr>
```
• What does this mean?
The "T" refers to trace amount of snow on the ground. For the purpose of our scatter plot, this is very similar to no snow on the ground. Thinking of "T" as a value of 0 is a reasonable intepretation in this case. (In other cases, it might not be. For example, imagine if you're trying to measure snow cover affect on traffic collisions. In this case, trace amounts of snow on the ground could play a role in slipperiness of the roadway and you may want to treat as a separate category.)
• Do I need to understand HTML?
Just the very, very basics that we discussed in class on February 8: we are looking for patterns in the html webpage code that we can use to automate the extracting of data. The important pattern here is 2 lines after the "Snow Depth" is where Weather Underground stores the value for snow depth.
• How do you handle this in an automated fashion?
We could just go through by hand and set all the "T" to 0, but while this data set is relatively small, this just won't work for many data sets (for example, the next homework looks at collisions in NYC and has hundreds to thousands per day).

A good approach is to run your program, and if you discover that there's non-numeric data where you expect numbers, is to go examine the data (which we did above) to decide if it's an error in coding or an unexpected value. Let's look at the code from weather3.py:

```def getTempFromWeb(kind,url):
page = urllib.request.urlopen(url)
for i in range(len(lines)):
if lines[i].decode("utf8").find(kind+" Temperature") >= 0:
m = i
break
searchObj = re.search('\d+', lines[m+2].decode("utf8"))
return int(searchObj.group(0))

```
What does this code do? (Again, we will discuss it on 2/8, but here are the notes as a reminder). It opens up the url and reads through the lines until it finds kind+" Temperature" and then searches the 2 lines later for number ('\d+' is a way of writing you would like a number of 1 or more digits as a regular expression). The re.search will return the search objects if found. What does it do if there is no number on that line? It will return Python's default I-don't-know-what-to-say value of None. But the code above assumes that searchObj contains values and continues processing. Instead, there should be a test here to make sure searchObj has a non-None value and process the data appropriately.

How can we do this? Here's the pseudocode for a function that looks for the snow depth and returns the number given or 0 if trace amounts are reported:

def getSnowDepth(url):

1. Open up the url and store the page source in lines.
2. Go through line by line looking for the string "Snow Depth" and store line number in m.
3. Use a regular expression to look for a number 2 lines later (i.e. on line m+2).
4. If none is found (i.e. if searchObj == None:), return a value of 0.
5. Otherwise return the first number found, as an integer

### Assignment

CMP 464/788 Homework:
#1-2 Using the data you collected for Homework #1, #5, use matplotlib to produce a plot that shows the fluctuation of the daily min temperature with respect to the month's average. That is, first compute the average min temperature of the 31 daily min temperatures and then scale each daily min temperature to reflect its percentage of the average min temperature. For an example, see lymeScaled.py which does a similar (but not identical) scaling to this problem. Make sure to change the title of your plot to reflect the information plotted.

#1: Submit your Python program as a .py file.
#2: Submit a screen shot of the graphics window containing the plot.
#3-4 For the January minimum temperature data, compute and display the running average of the temperatures over the previous 5 days. That is, you display the average temperature over the previous 5 days for each day (if all exist, if not use as many as do exist).
For example, if the temperatures were 10,20,10,20,15,35,30,... :
• The first day has no previous values, so would be 10.
• The second day is (10+20)/2 = 15.
• The third day is (10+20+10)/3 = 13.
• The fourth day is (10+20+10+20)/4 = 15.
• The fifth day: we now have enough to do the running average of a full 5 days: (10+20+10+20+15)/5 = 75/5 = 15.
• The sixth day uses the the previous 5 days: (20+10+20+15+35)/5 = 90/5 = 18...
#3: Submit your Python program as a .py file.
#4: Submit a screen shot of the graphics window containing the plot.
#5-6 Collect the snow depths for January 2017. Display the January minimum temperatures (collected in Homework #1) as a scatter plot (of day versus temperature) with the size of each `bubble' proportional to the snow depth on that day (see scatter_plot.py for a sample of varying `bubble' sizes).

#5: Submit your Python program as a .py file.
#6: Submit a screen shot of the graphics window containing the plot.
#7-8 Plot the percentage of New York City's population that lives in each borough. The raw historical population data for New York city from 1790 to 2010 is available here. Your plot should not display the raw population numbers, but instead give the percentages. For example, in 1790, 31,131 people lives in Manhattan out of the 49,447 that lived in New York City overall. The displayed value for Manhattan in 1790 would be 31,131/49,447 * 100 = 63 percent.

#7: Submit your Python program as a .py file.
#8: Submit a screen shot of the graphics window containing the plot.

### Submitting Homework

To submit your homework, log on to the Blackboard system, and go to "Homework". For each part of the homework, there is a separate input box. You may submit the homework as many times as you would like before the deadline.