README.md 3.59 KB
Newer Older
Sanjay Krishnan committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
# Sketching

*Due 6/11/19 11:59 PM (for all students)*
*Due 6/6/19 11:59 PM (for all graduating seniors)*

In this assignment, you will revisit the IMDB dataset and explore approximate ways of counting distinct values. 

## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw4` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `hyperloglog.py` this is the only file that you will modify. Finally, add `hyperloglog.py` using `git add`:
```
$ git add hyperloglog.py
$ git commit -m'initialized homework'
```

Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw4 folder:

https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0

DO NOT ADD title.csv to the git repo! After downloading the 
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
```
>> for i in imdb_years():
... print(i)

1992
1986
<so on>
```
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works. You will also find it useful to install NumPy for this project:
```
$ pip3 install numpy
```
## Helper Functions
You will first write a series of helper functions to manipulate hash codes.

### hashcode(s)
The first function to implement is `hashcode(s)`. This function takes in a string and returns a a 32-bit unsigned integer. You will use the python function `hash(obj)` to implement this function. Hint: Python uses 64 bit integers and you will find the numpy library useful to converting the hash code into a 32 bit number.

### code2bin(code)
The next function to implement is `code2bin(code)`. This function takes in a 32-bit unsigned code from the previous example and turns it into a binary list of 32 1/0 values.
The list must be ordered from most significant to least significant bit, i.e., the most significant bit is the first element of the list. 

### bin2code(lst)
This function takes in a 32-bit binary list and returns a unsigned 32-bit integer; assuming most significant bit first as before. This should be able to retrive your original code value!

### rho(lst)
The last helper function you will write is `rho(lst)` which returns the position of the left-most bit that is one. If there is no bit with the value one, just return the length of lst.

## HyperLogLog Algorithm
The gist of HyperLogLog is that it maintains a compressed array (`self.M`) that summarizes the data. For each record in your dataset, it updates this array. `self.b` determines how big this compressed array is. At the end it tries to reconstruct the distinct count from this array. You will write the `update(s)` function for HyperLogLog. The update function takes an input string `s` and updates self.M accordingly:
1. Calculate the hash code of s
2. Calculate the binary representation of the hashcode (call it c)
3. Split this binary representation into two sublists b- (the first b elements) and b+ (the remaining elements)
4. Use your helper functions to calculate j: the integer representation of (b-), and w: the position of the left-most bit in b+.
5. Update with `self.M[j] = max(self.M[j], w)`

## Testing and Submission
 We have provided a basic wrapper in `test.py`, the test is incomplete and is not meant to comprehensively grade your assignment. If you follow directions closely you should usually be within 50 of the true count for `imdb_years()`. After you finish the assignment you can submit your code with:
```
$ git push
```