# Sketching
*Due 6/11/19 11:59 PM (for all students)*
*Due 6/6/19 11:59 PM (for all graduating seniors)*
In this assignment, you will revisit the IMDB dataset and explore approximate ways of counting distinct values.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw4` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `hyperloglog.py` this is the only file that you will modify. Finally, add `hyperloglog.py` using `git add`:
```
$ git add hyperloglog.py
$ git commit -m'initialized homework'
```
Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw4 folder:
https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0
DO NOT ADD title.csv to the git repo! After downloading the
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
```
>> for i in imdb_years():
... print(i)
1992
1986
```
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works. You will also find it useful to install NumPy for this project:
```
$ pip3 install numpy
```
## Helper Functions
You will first write a series of helper functions to manipulate hash codes.
### hashcode(s)
The first function to implement is `hashcode(s)`. This function takes in a string and returns a a 32-bit unsigned integer. You will use the python function `hash(obj)` to implement this function. Hint: Python uses 64 bit integers and you will find the numpy library useful to converting the hash code into a 32 bit number.
### code2bin(code)
The next function to implement is `code2bin(code)`. This function takes in a 32-bit unsigned code from the previous example and turns it into a binary list of 32 1/0 values.
The list must be ordered from most significant to least significant bit, i.e., the most significant bit is the first element of the list.
### bin2code(lst)
This function takes in a 32-bit binary list and returns a unsigned 32-bit integer; assuming most significant bit first as before. This should be able to retrive your original code value!
### rho(lst)
The last helper function you will write is `rho(lst)` which returns the position of the left-most bit that is one. If there is no bit with the value one, just return the length of lst.
## HyperLogLog Algorithm
The gist of HyperLogLog is that it maintains a compressed array (`self.M`) that summarizes the data. For each record in your dataset, it updates this array. `self.b` determines how big this compressed array is. At the end it tries to reconstruct the distinct count from this array. You will write the `update(s)` function for HyperLogLog. The update function takes an input string `s` and updates self.M accordingly:
1. Calculate the hash code of s
2. Calculate the binary representation of the hashcode (call it c)
3. Split this binary representation into two sublists b- (the first b elements) and b+ (the remaining elements)
4. Use your helper functions to calculate j: the integer representation of (b-), and w: the position of the left-most bit in b+.
5. Update with `self.M[j] = max(self.M[j], w)`
## Testing and Submission
We have provided a basic wrapper in `test.py`, the test is incomplete and is not meant to comprehensively grade your assignment. If you follow directions closely you should usually be within 50 of the true count for `imdb_years()`. After you finish the assignment you can submit your code with:
```
$ git push
```