README.md 4.05 KB
Newer Older
1 2 3
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter. 

Krishnan Sanjay committed
4
Due Date: *Friday May 1st, 2020 11:59 pm*
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```

Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
```
$ git add bloom.py 
```
After adding your files, to submit your code you must run:
```
$ git commit -m"My submission"
$ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]

## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.

Here's how the basic Bloom filter works:

### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.

### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).

### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.

## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.

* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?

* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.

* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take. 

## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1

## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true

## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.