Commit 566dda97 by Sanjay Krishnan

new homework 4

parent 57a6a71f
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Homework 4: Python Dask Lab\n",
"\n",
"*Due May 7th, 2021 11:59 PM*\n",
"\n",
"Dask is an open source library for parallel computing written in Python. We will use Dask over the next few weeks to illustrate the basics of parallel and distributed computation. This homework assignment will walk you through some of the basic syntax of Dask. \n",
"\n",
"It is your job to read the documentation and figure out how to do each step on your own. You are responsible for adding code in every \"FILL IN HERE\" statement below.\n",
"\n",
"## Installing Dask\n",
"To get started, you need to install the dask packages. If you are using `pip`\n",
"```\n",
"pip install dask\n",
"pip install \"dask[distributed]\"\n",
"```\n",
"If you are using, `conda`:\n",
"```\n",
"conda install numpy pandas h5py pillow matplotlib scipy toolz pytables snakeviz scikit-image dask distributed -c conda-forge\n",
"```\n",
"Let us know if you have any difficulties installing Dask.\n",
"\n",
"\n",
"## Exercise 1. Loading Data Sets\n",
"\n",
"We've given you a sample dataset of flights from the JFK aiport (arrival, departure, delays, etc.). Dask is similar to Pandas as it exposes a DataFrame interface. Write code thee below to load the data in `nycflights.csv` into a Dask DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import dask.dataframe as dd\n",
"df = #FILL IN HERE"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2013</td>\n",
" <td>6</td>\n",
" <td>30</td>\n",
" <td>940</td>\n",
" <td>15</td>\n",
" <td>1216</td>\n",
" <td>-4</td>\n",
" <td>VX</td>\n",
" <td>N626VA</td>\n",
" <td>407</td>\n",
" <td>JFK</td>\n",
" <td>LAX</td>\n",
" <td>313</td>\n",
" <td>2475</td>\n",
" <td>9</td>\n",
" <td>40</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2013</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>1657</td>\n",
" <td>-3</td>\n",
" <td>2104</td>\n",
" <td>10</td>\n",
" <td>DL</td>\n",
" <td>N3760C</td>\n",
" <td>329</td>\n",
" <td>JFK</td>\n",
" <td>SJU</td>\n",
" <td>216</td>\n",
" <td>1598</td>\n",
" <td>16</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2013</td>\n",
" <td>12</td>\n",
" <td>8</td>\n",
" <td>859</td>\n",
" <td>-1</td>\n",
" <td>1238</td>\n",
" <td>11</td>\n",
" <td>DL</td>\n",
" <td>N712TW</td>\n",
" <td>422</td>\n",
" <td>JFK</td>\n",
" <td>LAX</td>\n",
" <td>376</td>\n",
" <td>2475</td>\n",
" <td>8</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2013</td>\n",
" <td>5</td>\n",
" <td>14</td>\n",
" <td>1841</td>\n",
" <td>-4</td>\n",
" <td>2122</td>\n",
" <td>-34</td>\n",
" <td>DL</td>\n",
" <td>N914DL</td>\n",
" <td>2391</td>\n",
" <td>JFK</td>\n",
" <td>TPA</td>\n",
" <td>135</td>\n",
" <td>1005</td>\n",
" <td>18</td>\n",
" <td>41</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2013</td>\n",
" <td>7</td>\n",
" <td>21</td>\n",
" <td>1102</td>\n",
" <td>-3</td>\n",
" <td>1230</td>\n",
" <td>-8</td>\n",
" <td>9E</td>\n",
" <td>N823AY</td>\n",
" <td>3652</td>\n",
" <td>LGA</td>\n",
" <td>ORF</td>\n",
" <td>50</td>\n",
" <td>296</td>\n",
" <td>11</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" year month day dep_time dep_delay arr_time arr_delay carrier tailnum \\\n",
"0 2013 6 30 940 15 1216 -4 VX N626VA \n",
"1 2013 5 7 1657 -3 2104 10 DL N3760C \n",
"2 2013 12 8 859 -1 1238 11 DL N712TW \n",
"3 2013 5 14 1841 -4 2122 -34 DL N914DL \n",
"4 2013 7 21 1102 -3 1230 -8 9E N823AY \n",
"\n",
" flight origin dest air_time distance hour minute \n",
"0 407 JFK LAX 313 2475 9 40 \n",
"1 329 JFK SJU 216 1598 16 57 \n",
"2 422 JFK LAX 376 2475 8 59 \n",
"3 2391 JFK TPA 135 1005 18 41 \n",
"4 3652 LGA ORF 50 296 11 2 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#If your solution above is correct you should see 5 rows of the table printed out by running this code\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2. Slicing and Lazy Evaluation\n",
"Dask looks really similar to Pandas! Let's try to see how it's different. Write code that slices the above DataFrame to extract only the flights that have delayed arrivals (arr_delay > 0):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"sliced = #FILL IN HERE"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dask DataFrame Structure:\n",
" year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute\n",
"npartitions=1 \n",
" int64 int64 int64 int64 int64 int64 int64 object object int64 object object int64 int64 int64 int64\n",
" ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n",
"Dask Name: getitem, 4 tasks\n"
]
}
],
"source": [
"#If your code above is correct, the output of this cell should return \"Dask DataFrame Structure:...\" and no data\n",
"print(sliced)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So why doesn't `sliced` return any data? Dask is a lazy execution framework (as we discussed in class!). You need to explcitly run `compute()` (get all rows) or `head()` to materialize the result."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>year</th>\n",
" <th>month</th>\n",
" <th>day</th>\n",
" <th>dep_time</th>\n",
" <th>dep_delay</th>\n",
" <th>arr_time</th>\n",
" <th>arr_delay</th>\n",
" <th>carrier</th>\n",
" <th>tailnum</th>\n",
" <th>flight</th>\n",
" <th>origin</th>\n",
" <th>dest</th>\n",
" <th>air_time</th>\n",
" <th>distance</th>\n",
" <th>hour</th>\n",
" <th>minute</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2013</td>\n",
" <td>5</td>\n",
" <td>7</td>\n",
" <td>1657</td>\n",
" <td>-3</td>\n",
" <td>2104</td>\n",
" <td>10</td>\n",
" <td>DL</td>\n",
" <td>N3760C</td>\n",
" <td>329</td>\n",
" <td>JFK</td>\n",
" <td>SJU</td>\n",
" <td>216</td>\n",
" <td>1598</td>\n",
" <td>16</td>\n",
" <td>57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2013</td>\n",
" <td>12</td>\n",
" <td>8</td>\n",
" <td>859</td>\n",
" <td>-1</td>\n",
" <td>1238</td>\n",
" <td>11</td>\n",
" <td>DL</td>\n",
" <td>N712TW</td>\n",
" <td>422</td>\n",
" <td>JFK</td>\n",
" <td>LAX</td>\n",
" <td>376</td>\n",
" <td>2475</td>\n",
" <td>8</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2013</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>1817</td>\n",
" <td>-3</td>\n",
" <td>2008</td>\n",
" <td>3</td>\n",
" <td>AA</td>\n",
" <td>N3AXAA</td>\n",
" <td>353</td>\n",
" <td>LGA</td>\n",
" <td>ORD</td>\n",
" <td>138</td>\n",
" <td>733</td>\n",
" <td>18</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>2013</td>\n",
" <td>12</td>\n",
" <td>9</td>\n",
" <td>1259</td>\n",
" <td>14</td>\n",
" <td>1617</td>\n",
" <td>22</td>\n",
" <td>WN</td>\n",
" <td>N218WN</td>\n",
" <td>1428</td>\n",
" <td>EWR</td>\n",
" <td>HOU</td>\n",
" <td>240</td>\n",
" <td>1411</td>\n",
" <td>12</td>\n",
" <td>59</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>2013</td>\n",
" <td>8</td>\n",
" <td>13</td>\n",
" <td>1920</td>\n",
" <td>85</td>\n",
" <td>2032</td>\n",
" <td>71</td>\n",
" <td>B6</td>\n",
" <td>N284JB</td>\n",
" <td>1407</td>\n",
" <td>JFK</td>\n",
" <td>IAD</td>\n",
" <td>48</td>\n",
" <td>228</td>\n",
" <td>19</td>\n",
" <td>20</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32726</th>\n",
" <td>2013</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>1558</td>\n",
" <td>-2</td>\n",
" <td>1854</td>\n",
" <td>4</td>\n",
" <td>DL</td>\n",
" <td>N3737C</td>\n",
" <td>1331</td>\n",
" <td>JFK</td>\n",
" <td>DEN</td>\n",
" <td>238</td>\n",
" <td>1626</td>\n",
" <td>15</td>\n",
" <td>58</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32728</th>\n",
" <td>2013</td>\n",
" <td>7</td>\n",
" <td>13</td>\n",
" <td>1923</td>\n",
" <td>18</td>\n",
" <td>2124</td>\n",
" <td>18</td>\n",
" <td>9E</td>\n",
" <td>N922XJ</td>\n",
" <td>3525</td>\n",
" <td>JFK</td>\n",
" <td>ORD</td>\n",
" <td>107</td>\n",
" <td>740</td>\n",
" <td>19</td>\n",
" <td>23</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32729</th>\n",
" <td>2013</td>\n",
" <td>1</td>\n",
" <td>28</td>\n",
" <td>706</td>\n",
" <td>36</td>\n",
" <td>909</td>\n",
" <td>22</td>\n",
" <td>EV</td>\n",
" <td>N13914</td>\n",
" <td>4419</td>\n",
" <td>EWR</td>\n",
" <td>IND</td>\n",
" <td>105</td>\n",
" <td>645</td>\n",
" <td>7</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32731</th>\n",
" <td>2013</td>\n",
" <td>7</td>\n",
" <td>7</td>\n",
" <td>812</td>\n",
" <td>-3</td>\n",
" <td>1043</td>\n",
" <td>8</td>\n",
" <td>DL</td>\n",
" <td>N6713Y</td>\n",
" <td>1429</td>\n",
" <td>JFK</td>\n",
" <td>LAS</td>\n",
" <td>286</td>\n",
" <td>2248</td>\n",
" <td>8</td>\n",
" <td>12</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32733</th>\n",
" <td>2013</td>\n",
" <td>10</td>\n",
" <td>15</td>\n",
" <td>844</td>\n",
" <td>56</td>\n",
" <td>1045</td>\n",
" <td>60</td>\n",
" <td>B6</td>\n",
" <td>N258JB</td>\n",
" <td>1273</td>\n",
" <td>JFK</td>\n",
" <td>CHS</td>\n",
" <td>93</td>\n",
" <td>636</td>\n",
" <td>8</td>\n",
" <td>44</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>13462 rows × 16 columns</p>\n",
"</div>"
],
"text/plain": [
" year month day dep_time dep_delay arr_time arr_delay carrier \\\n",
"1 2013 5 7 1657 -3 2104 10 DL \n",
"2 2013 12 8 859 -1 1238 11 DL \n",
"5 2013 1 1 1817 -3 2008 3 AA \n",
"6 2013 12 9 1259 14 1617 22 WN \n",
"7 2013 8 13 1920 85 2032 71 B6 \n",
"... ... ... ... ... ... ... ... ... \n",
"32726 2013 2 4 1558 -2 1854 4 DL \n",
"32728 2013 7 13 1923 18 2124 18 9E \n",
"32729 2013 1 28 706 36 909 22 EV \n",
"32731 2013 7 7 812 -3 1043 8 DL \n",
"32733 2013 10 15 844 56 1045 60 B6 \n",
"\n",
" tailnum flight origin dest air_time distance hour minute \n",
"1 N3760C 329 JFK SJU 216 1598 16 57 \n",
"2 N712TW 422 JFK LAX 376 2475 8 59 \n",
"5 N3AXAA 353 LGA ORD 138 733 18 17 \n",
"6 N218WN 1428 EWR HOU 240 1411 12 59 \n",
"7 N284JB 1407 JFK IAD 48 228 19 20 \n",
"... ... ... ... ... ... ... ... ... \n",
"32726 N3737C 1331 JFK DEN 238 1626 15 58 \n",
"32728 N922XJ 3525 JFK ORD 107 740 19 23 \n",
"32729 N13914 4419 EWR IND 105 645 7 6 \n",
"32731 N6713Y 1429 JFK LAS 286 2248 8 12 \n",
"32733 N258JB 1273 JFK CHS 93 636 8 44 \n",
"\n",
"[13462 rows x 16 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sliced.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise 3. Aggregation\n",
"Now that you have an inital idea on how to program with Dask, write the following code snippets. \n",
"\n",
"* Calculate the average distance flown for flights that are delayed moree than 10 minutes on arrival."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1019.3765413757243"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#FILL IN HERE\n",
"#if your code is correct the result should be a number in the 1000s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Calculate the average departure delay for each departure hour"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"hour\n",
"0 128.747126\n",
"1 202.360000\n",
"2 193.666667\n",
"3 286.500000\n",
"4 -5.384615\n",
"5 -4.478571\n",
"6 -1.005655\n",
"7 0.895131\n",
"8 0.919722\n",
"9 4.259020\n",
"10 5.277160\n",
"11 4.861985\n",
"12 7.487395\n",
"13 10.540070\n",
"14 7.545012\n",
"15 10.218016\n",
"16 14.008253\n",
"17 16.630182\n",
"18 18.576611\n",
"19 22.026598\n",
"20 29.882601\n",
"21 41.240824\n",
"22 66.673874\n",
"23 91.680297\n",
"24 49.000000\n",
"Name: dep_delay, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#FILL IN HERE\n",
"#if your code is correct the result should see averages for each of the 24 hours"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Calculate the average distance flown\n",
"* Calculate the number of flights that have a distance larger than the average distance\n",
"* Write your program in a way that Dask shares the average distance computation betweeen both queries"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1046.244050710249 12579\n"
]
}
],
"source": [
"#FILL IN HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment