"How do we compute something over an infinitely big input data structure? The answer to this question underpins the design of programming languages and frameworks for working with large datasets. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Countable Collections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The mathematical notion of infinity is complex, and oddly enough, there are *degrees* of infinity. We are interested in objects that are infinitely big. Consider the following mathematical sets:\n",
"* The set of real numbers $\\mathbb{R}$ \n",
"* The set of natural numbers $\\mathbb{N}$ \n",
"* The set of numbers divisible by 3 and 17 $\\{51, 72, 153,...\\}$\n",
"\n",
"What do all of these sets have in common? First, all of these sets are infinitely big. Next, they are all ordered, i.e., for any two elements $e_1$ and $e_2$ there is a precise $>,<,=$ relationship. \n",
"\n",
"However, we intuitively feel that $\\mathbb{R}$ is bigger than the other two sets. One aspect that is different between $\\mathbb{R}$ and the other two sets is a clear definition of the \"next\" element. For example, if we take a real number $1.01$ what is the next real number? Is it $1.010001$ or $1.01000001$? We can go smaller *ad infinitum* making this process a fool's errand. \n",
"\n",
"This intuition gets at the notion of *countability*. A countable set is one that can be laid out in a precise sequence, where every element is assigned a location (e.g., the 5th element, or the 131st element):\n",
"$$[0,1,2,3,4,...]$$\n",
"$$[51, 72, 153,...]$$\n",
"In this sense, every countable set is indexed by the natural numbers. There is a unique correspondence between a position (described by a natural number) and an element. In summary, countability requires two conditions, a unique first element and for each element a unique next element. By induction, this creates a sequence like the ones above. \n",
"\n",
"Fun facts (prove them on your own):\n",
"* Every infinite countable set has the same cardinality (size) as $\\mathbb{N}$.\n",
"* The set of rational numbers is countable\n",
"* All finite sets are countable"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Iterators\n",
"An iterator is a programatic way of interacting with countably \"infinite\" data. Infinite is in quotes because we know that any realistic dataset will not be infinite but *you are programming as if the dataset were infinite*. Rather than loading the entire input into a program, we process the input element-by-element. The design of iterators in programming languages mirrors the mathematical definitions of countability described above. \n",
"\n",
"Let's start with some simple examples in Python. The iteration paradigm that most are used to is the \"for each\" loop. For example, suppose I have a Python list:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a\n",
"b\n",
"c\n"
]
}
],
"source": [
"lst = ['a', 'b', 'c']\n",
"for elem in lst:\n",
" print(elem)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python further allows one to explicitly track the indices of each fetched element of the list. Notice, how this definition mirrors the mathematical description above. We are assigning a natural number to each element of the list."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 a\n",
"1 b\n",
"2 c\n"
]
}
],
"source": [
"lst = ['a', 'b', 'c']\n",
"for index, elem in enumerate(lst):\n",
" print(index, elem)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The list above is *materialized* that means it is explicitly stored in memory with all of its elements. Suppose, we wanted to enumerate an infinite set, say the Fibonacci sequence, we could not do that by hand. To be able to write programs that do this, we will have to get under the hood of how Python handles iteration and enumeration. \n",
"\n",
"As you have probably already noticed, we can only apply `for` loops to certain objects in Python. This is because only some classes in Python are *iterable*. An iterable class defines two properties `__iter__` and `__next__`, which initialize an iteration and return the next element---exactly mirroring the discussion about countability above! Let's first revist iterating over `lst` with the iterator API: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a\n"
]
}
],
"source": [
"lst = ['a', 'b', 'c']\n",
"lst_iter = iter(lst) #return the iterator object\n",
"\n",
"\n",
"print(next(lst_iter)) #'a'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"b\n"
]
}
],
"source": [
"print(next(lst_iter)) #'b'"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"c\n"
]
}
],
"source": [
"print(next(lst_iter)) #'c'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When an iterator runs out of elements it raise a `StopIteration` exception:"
"\u001b[0;32m<ipython-input-6-e3cc24e2a13f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnext\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlst_iter\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m#'error'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mStopIteration\u001b[0m: "
]
}
],
"source": [
"print(next(lst_iter)) #'error'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Re-running `iter()` allows you to reset an iteration sequence:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"a\n",
"a\n",
"a\n"
]
}
],
"source": [
"lst = ['a', 'b', 'c']\n",
"lst_iter = iter(lst) #return the iterator object\n",
"print(next(lst_iter)) #'a'\n",
"lst_iter = iter(lst)\n",
"print(next(lst_iter)) #'b'\n",
"lst_iter = iter(lst) #return the iterator object\n",
"print(next(lst_iter)) #'a'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing your own iterators\n",
"This API is not that interesting for a materialized list, but is very powerful when we have to define complex sequences to iterate over. We do so by writing a class with the properties `__iter__` and `__next__`. Consider the following class that returns a sequence of every number divisible by 3."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"class ThreeMultIterator():\n",
" '''\n",
" A class that iterates over multiples of three\n",
" '''\n",
" \n",
" def __init__(self):\n",
" '''\n",
" Constructor, nothing here for now\n",
" '''\n",
" pass\n",
" \n",
" \n",
" def __iter__(self):\n",
" '''\n",
" Initialize the iterator with the first element\n",
" '''\n",
" self.current = 0\n",
" return self #have to return an object here!!\n",
" \n",
" def __next__(self):\n",
" '''\n",
" Returns the next element \n",
" '''\n",
" \n",
" elem = self.current #pin the current element\n",
" self.current += 3 #update the current element\n",
" \n",
" return elem #return pin"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ThreeMultIterator` class defines an infinite sequence of multiples of 3. To fetch the first 10 elements, we can write the following code:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"3\n",
"6\n",
"9\n",
"12\n",
"15\n",
"18\n",
"21\n",
"24\n",
"27\n"
]
}
],
"source": [
"threeMults = ThreeMultIterator()\n",
"three_iter = iter(threeMults)\n",
"\n",
"\n",
"limit = 10\n",
"\n",
"while limit > 0:\n",
" print(next(three_iter)) #print the next element\n",
" limit -= 1 #decrement limit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fortunately, you don't have to write code like this by hand. Python has a number of useful library functions to manipulate iterators. You can find these functions in the `itertools` module:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[30, 33, 36, 39, 42, 45, 48, 51, 54, 57]\n"
]
}
],
"source": [
"import itertools\n",
"threeMults = ThreeMultIterator()\n",
"three_iter = iter(threeMults)\n",
"print(list(itertools.islice(three_iter, 10,20))) #take the first 10 elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The important point to note is that `ThreeMultIterator` is an infinite sequence, however it can run just fine in Python with the iterator API. This is because the entire sequence is NOT materialized. We process just as much data as we need at each instant. Iterators can be more complex than simple numerical sequences, let's write an iterator that returns the Fibonacci series:\n",
"$$S_0 = 0$$\n",
"$$S_1 = 1$$\n",
"$$S_{i} = S_{i-1} + S_{i-2} \\text{ for i > 1}$$"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"class Fib():\n",
" '''\n",
" A class that iterates over fibonacci sequence\n",
" '''\n",
" \n",
" def __init__(self):\n",
" '''\n",
" Constructor, nothing here for now\n",
" '''\n",
" pass\n",
" \n",
" \n",
" def __iter__(self):\n",
" '''\n",
" Initialize the iterator with the first TWO elements\n",
" '''\n",
" self.i1 = 0\n",
" self.i2 = 1\n",
" self.current = 0\n",
" \n",
" return self #have to return an object here!!\n",
" \n",
" def __next__(self):\n",
" '''\n",
" Returns the next element \n",
" '''\n",
" \n",
" elem = self.current #pin the current element\n",
" \n",
" self.current = self.i1 + self.i2 #update the current element\n",
" \n",
" self.i2 = self.i1 #update the prev 2 elements\n",
" self.i1 = self.current\n",
" \n",
" return elem #return pin"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can run similar code to that above to iterate over the infinite Fibonacci series"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0, 1, 1, 2, 3, 5, 8, 13, 21, 34]\n"
]
}
],
"source": [
"import itertools\n",
"fib = Fib()\n",
"fib_iter = iter(fib)\n",
"print(list(itertools.islice(fib, 10))) #take the first 10 elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Fibonacci example is interesting because the iterator needs to \"remember\" its previous return values. It does so through class variables. The class variables initialized and updated in `__iter__` and `__next__` are called state variables. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practical Usage in Data Engineering\n",
"Going beyond the conceptual examples with numbers, let's discuss a concrete data engineering scenario. Often times we will be presented with a dataset that does not fit into main memory (but is stored as a file on a much bigger hard disk). For example, we might want to compute the total value (sum) of a very big file of numbers. We can use iterators to solve this problem. The file `my_file` contains a list of numbers. Let's write an iterator that \"scans\" this file:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"class FileScan:\n",
" \"\"\"Loads a large file into the\n",
" program line-by-line\"\"\"\n",
"\n",
" def __init__(self, filename):\n",
" self.filename = filename\n",
"\n",
" def __iter__(self):\n",
" self.file = open(self.filename, 'r')\n",
" self.line = self.file.readline()\n",
" return self\n",
"\n",
" def __next__(self):\n",
" if self.line != \"\":\n",
" result = int(self.line)\n",
" self.line = self.file.readline()\n",
" return result\n",
" else:\n",
" self.file.close()\n",
" raise StopIteration"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can write code to sum over all of the numbers:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"scan = FileScan(\"my_file\")\n",
"scan_iter = iter(scan)\n",
"\n",
"total = 0\n",
"while True:\n",
" try:\n",
" first = next(scan_iter)\n",
"\n",
" except StopIteration:\n",
" break\n",
"\n",
"print(total)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use a for loop as shorthand for the above code snippet:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"42\n"
]
}
],
"source": [
"total = 0\n",
"\n",
"for i in FileScan(\"my_file\"):\n",
" total += i\n",
" \n",
"print(total)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the takeaway messages? Even though this file is very big the amount of \"state\" that we need to compute the sum is really only two numbers (the running total and the current number). "