Commit c5746212 by Sanjay Krishnan

updated

parent 566dda97
# Homework 4. Bloom Filter (WILL BE UPDATED, DONT START ON THIS)
This homework assignment introduces an advanced use of hashing called a Bloom filter.
Due Date: *Friday May 7, 11:59 pm*
## Initial Setup
Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw4`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw4`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
```
$ git add bloom.py
```
After adding your files, to submit your code you must run:
```
$ git commit -m"My submission"
$ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
Here's how the basic Bloom filter works:
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random
import string
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try:
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
try:
b = Bloom(100,10)
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
except:
print('[#5] Hash exceeds range')
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
assert(bh1 != bh2)
except:
print('[#6] Hashes generated are not independent')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
'''bloom.py defines a bloom filter which is an
approximate set membership data structure. You
will implement a full bloom filter in this module
'''
import array
import binascii
import random
class Bloom(object):
def __init__(self, m,k, seed=0):
'''Creates a bloom filter of size m with k
independent hash functions.
'''
self.array = array.array('B', [0] * m)
self.hashes = self.generate_hashes(m,k,seed)
#TODO
def generate_hashes(self, m, k, seed):
'''Generate *k* independent linear hash functions
each with the range 0,...,m.
m: the range of the hash functions
k: the number of hash functions
seed: a random seed that controls which A/B linear
parameters are used.
The output of this function should be a list of functions
'''
random.seed(seed)
#TODO
raise ValueError('Not Implemented')
def put(self, item):
'''Add a string to the bloom filter, returns void
'''
#TODO
def contains(self, item):
'''Test to see if the bloom filter could possibly
contain the string (true (possibly)), false (definitely).
'''
#TODO
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment