{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Homework 4: Python Dask Lab\n", "\n", "*Due May 7th, 2021 11:59 PM*\n", "\n", "Dask is an open source library for parallel computing written in Python. We will use Dask over the next few weeks to illustrate the basics of parallel and distributed computation. This homework assignment will walk you through some of the basic syntax of Dask. \n", "\n", "It is your job to read the documentation and figure out how to do each step on your own. You are responsible for adding code in every \"FILL IN HERE\" statement below.\n", "\n", "## Installing Dask\n", "To get started, you need to install the dask packages. If you are using `pip`\n", "```\n", "pip install dask\n", "pip install \"dask[distributed]\"\n", "```\n", "If you are using, `conda`:\n", "```\n", "conda install numpy pandas h5py pillow matplotlib scipy toolz pytables snakeviz scikit-image dask distributed -c conda-forge\n", "```\n", "Let us know if you have any difficulties installing Dask.\n", "\n", "\n", "## Exercise 1. Loading Data Sets\n", "\n", "We've given you a sample dataset of flights from the JFK aiport (arrival, departure, delays, etc.). Dask is similar to Pandas as it exposes a DataFrame interface. Write code thee below to load the data in `nycflights.csv` into a Dask DataFrame" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import dask.dataframe as dd\n", "df = #FILL IN HERE" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
02013630940151216-4VXN626VA407JFKLAX3132475940
12013571657-3210410DLN3760C329JFKSJU21615981657
22013128859-1123811DLN712TW422JFKLAX3762475859
320135141841-42122-34DLN914DL2391JFKTPA13510051841
420137211102-31230-89EN823AY3652LGAORF50296112
\n", "
" ], "text/plain": [ " year month day dep_time dep_delay arr_time arr_delay carrier tailnum \\\n", "0 2013 6 30 940 15 1216 -4 VX N626VA \n", "1 2013 5 7 1657 -3 2104 10 DL N3760C \n", "2 2013 12 8 859 -1 1238 11 DL N712TW \n", "3 2013 5 14 1841 -4 2122 -34 DL N914DL \n", "4 2013 7 21 1102 -3 1230 -8 9E N823AY \n", "\n", " flight origin dest air_time distance hour minute \n", "0 407 JFK LAX 313 2475 9 40 \n", "1 329 JFK SJU 216 1598 16 57 \n", "2 422 JFK LAX 376 2475 8 59 \n", "3 2391 JFK TPA 135 1005 18 41 \n", "4 3652 LGA ORF 50 296 11 2 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#If your solution above is correct you should see 5 rows of the table printed out by running this code\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2. Slicing and Lazy Evaluation\n", "Dask looks really similar to Pandas! Let's try to see how it's different. Write code that slices the above DataFrame to extract only the flights that have delayed arrivals (arr_delay > 0):" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sliced = #FILL IN HERE" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dask DataFrame Structure:\n", " year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute\n", "npartitions=1 \n", " int64 int64 int64 int64 int64 int64 int64 object object int64 object object int64 int64 int64 int64\n", " ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...\n", "Dask Name: getitem, 4 tasks\n" ] } ], "source": [ "#If your code above is correct, the output of this cell should return \"Dask DataFrame Structure:...\" and no data\n", "print(sliced)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So why doesn't `sliced` return any data? Dask is a lazy execution framework (as we discussed in class!). You need to explcitly run `compute()` (get all rows) or `head()` to materialize the result." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yearmonthdaydep_timedep_delayarr_timearr_delaycarriertailnumflightorigindestair_timedistancehourminute
12013571657-3210410DLN3760C329JFKSJU21615981657
22013128859-1123811DLN712TW422JFKLAX3762475859
52013111817-320083AAN3AXAA353LGAORD1387331817
62013129125914161722WNN218WN1428EWRHOU24014111259
72013813192085203271B6N284JB1407JFKIAD482281920
...................................................
327262013241558-218544DLN3737C1331JFKDEN23816261558
3272820137131923182124189EN922XJ3525JFKORD1077401923
3272920131287063690922EVN139144419EWRIND10564576
32731201377812-310438DLN6713Y1429JFKLAS2862248812
327332013101584456104560B6N258JB1273JFKCHS93636844
\n", "

13462 rows × 16 columns

\n", "
" ], "text/plain": [ " year month day dep_time dep_delay arr_time arr_delay carrier \\\n", "1 2013 5 7 1657 -3 2104 10 DL \n", "2 2013 12 8 859 -1 1238 11 DL \n", "5 2013 1 1 1817 -3 2008 3 AA \n", "6 2013 12 9 1259 14 1617 22 WN \n", "7 2013 8 13 1920 85 2032 71 B6 \n", "... ... ... ... ... ... ... ... ... \n", "32726 2013 2 4 1558 -2 1854 4 DL \n", "32728 2013 7 13 1923 18 2124 18 9E \n", "32729 2013 1 28 706 36 909 22 EV \n", "32731 2013 7 7 812 -3 1043 8 DL \n", "32733 2013 10 15 844 56 1045 60 B6 \n", "\n", " tailnum flight origin dest air_time distance hour minute \n", "1 N3760C 329 JFK SJU 216 1598 16 57 \n", "2 N712TW 422 JFK LAX 376 2475 8 59 \n", "5 N3AXAA 353 LGA ORD 138 733 18 17 \n", "6 N218WN 1428 EWR HOU 240 1411 12 59 \n", "7 N284JB 1407 JFK IAD 48 228 19 20 \n", "... ... ... ... ... ... ... ... ... \n", "32726 N3737C 1331 JFK DEN 238 1626 15 58 \n", "32728 N922XJ 3525 JFK ORD 107 740 19 23 \n", "32729 N13914 4419 EWR IND 105 645 7 6 \n", "32731 N6713Y 1429 JFK LAS 286 2248 8 12 \n", "32733 N258JB 1273 JFK CHS 93 636 8 44 \n", "\n", "[13462 rows x 16 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sliced.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise 3. Aggregation\n", "Now that you have an inital idea on how to program with Dask, write the following code snippets. \n", "\n", "* Calculate the average distance flown for flights that are delayed moree than 10 minutes on arrival." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1019.3765413757243" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#FILL IN HERE\n", "#if your code is correct the result should be a number in the 1000s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Calculate the average departure delay for each departure hour" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "hour\n", "0 128.747126\n", "1 202.360000\n", "2 193.666667\n", "3 286.500000\n", "4 -5.384615\n", "5 -4.478571\n", "6 -1.005655\n", "7 0.895131\n", "8 0.919722\n", "9 4.259020\n", "10 5.277160\n", "11 4.861985\n", "12 7.487395\n", "13 10.540070\n", "14 7.545012\n", "15 10.218016\n", "16 14.008253\n", "17 16.630182\n", "18 18.576611\n", "19 22.026598\n", "20 29.882601\n", "21 41.240824\n", "22 66.673874\n", "23 91.680297\n", "24 49.000000\n", "Name: dep_delay, dtype: float64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#FILL IN HERE\n", "#if your code is correct the result should see averages for each of the 24 hours" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Calculate the average distance flown\n", "* Calculate the number of flights that have a distance larger than the average distance\n", "* Write your program in a way that Dask shares the average distance computation betweeen both queries" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1046.244050710249 12579\n" ] } ], "source": [ "#FILL IN HERE" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }