README.md 4.05 KB
Newer Older
1 2
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter. 
Sanjay Krishnan committed
3

4
Due Date: *Friday May 1st, 2020 11:59 pm*
Sanjay Krishnan committed
5

6 7
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
Sanjay Krishnan committed
8
```
9
$ cd cmsc13600-materials
Sanjay Krishnan committed
10 11
$ git pull
```
12 13

Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
Sanjay Krishnan committed
14
```
15
$ git add bloom.py 
Sanjay Krishnan committed
16
```
17
After adding your files, to submit your code you must run:
Sanjay Krishnan committed
18
```
19 20
$ git commit -m"My submission"
$ git push
Sanjay Krishnan committed
21
```
22
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
Sanjay Krishnan committed
23

24 25
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
Sanjay Krishnan committed
26

27
Here's how the basic Bloom filter works:
Sanjay Krishnan committed
28

29 30 31
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
Sanjay Krishnan committed
32

33 34 35
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
Sanjay Krishnan committed
36

37 38 39
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
Sanjay Krishnan committed
40

41 42
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
Sanjay Krishnan committed
43

44
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
Sanjay Krishnan committed
45

46
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
Sanjay Krishnan committed
47

48
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take. 
Sanjay Krishnan committed
49

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1

## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true

## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.