Commit 59bfe986 by Sanjay Krishnan

oops

parent f938fff8
# Homework 2. Bloom Filter # Homework 1. Introduction to Python and File I/O
This homework assignment introduces an advanced use of hashing called a Bloom filter. This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
Due Date: *Friday May 1st, 2020 11:59 pm* Due Date: *Friday April 17, 2020 11:59 pm*
## Initial Setup ## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one: These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
...@@ -10,9 +10,9 @@ $ cd cmsc13600-materials ...@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull $ git pull
``` ```
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git: Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
``` ```
$ git add bloom.py $ git add encoding.py
``` ```
After adding your files, to submit your code you must run: After adding your files, to submit your code you must run:
``` ```
...@@ -21,44 +21,65 @@ $ git push ...@@ -21,44 +21,65 @@ $ git push
``` ```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr] We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Bloom filter ## Delta Encoding
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*. Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
Here's how the basic Bloom filter works: The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
### Initialization ## TODO 1. Loading the data file
* An empty Bloom filter is initialized with an array of *m* elements each with value 0. In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
### Adding An Item e ## TODO 2. Compute the basic encoding
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m). In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
### Contains An Item e For example:
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m). ```
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past. > data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
```
## TODO 1. Generate K independent Hash Functions Or,
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function. ```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m? ## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works. ## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take. Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 2. Put ## TODO 5. Delta Decoding
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code: Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 3. Get For example:
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code: ```
* For each of the k hash functions: > enc = [1,2,1,-1]
* Compute the hash code of the string, and store the code in i > data = delta_decoding(enc)
* if the ith element is 0, return false 1,3,4,3
* if all code-indices are 1, return true ```
Or,
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
```
## Testing ## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work. We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random import random
import string from encoding import *
from bloom import * def test_load():
data = load_orig_file('data.txt')
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try: try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
for h in b.hashes: def test_encoding():
h(generate_random_string()) data = load_orig_file('data.txt')
encoded = delta_encoding(data)
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try: try:
for h in b.hashes: assert(sum(encoded) == data[-1])
assert(h(s) == h(s)) assert(sum(encoded) == 26)
except: assert(len(data) == len(encoded))
print('[#3] Hashes are not deterministic') except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try: try:
b = Bloom(100,10) assert(sum(shift(data, 10)) == N*10 + sum(data))
b1h = b.hashes[0](s) assert(all([d >=0 for d in shift(encoded,4)]))
b = Bloom(100,10) except AssertionError:
b2h = b.hashes[0](s) print('TODO 3. Failure check your shift function')
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try: try:
b = Bloom(100,10) assert(data == data_p)
except AssertionError:
for h in b.hashes: print('TODO 5. Cannot recover data with delta_decoding')
for i in range(10):
assert( h(generate_random_string())< 100 )
except: def generate_file(size, seed):
print('[#5] Hash exceeds range') FILE_NAME = 'data.gen.txt'
f = open(FILE_NAME,'w')
try: initial = seed
b = Bloom(1000,2) for i in range(size):
s = generate_random_string() f.write(str(initial) + '\n')
bh1 = b.hashes[0](s) initial += random.randint(-4, 4)
bh2 = b.hashes[1](s)
assert(bh1 != bh2) def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
except: cnt = 0
print('[#6] Hashes generated are not independent') for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
def test_put(): data = load_orig_file('data.gen.txt')
b = Bloom(100,10,seed=0) encoded = delta_encoding(data)
b.put('the') sencoded = shift(encoded ,4)
b.put('university') write_encoding(sencoded, 'data_out.txt')
b.put('of')
b.put('chicago')
try: loaded = unshift(read_encoding('data_out.txt'),4)
assert(sum(b.array) == 30) decoded = delta_decoding(loaded)
except:
print('[#7] Unexpected Put() Result') cnt += (decoded == data)
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try: try:
assert(results == [True, False, True, True, True, False]) assert(cnt == 10)
except: except AssertionError:
print('[#8] Unexpected contains result') print('Failed Random Tests', str(10-cnt), 'out of 10')
test_hash_generation() test_load()
test_put() test_encoding()
test_put_get() test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
# Homework 1. Introduction to Python and File I/O # Homework 2. Bloom Filter
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding. This homework assignment introduces an advanced use of hashing called a Bloom filter.
Due Date: *Friday April 17, 2020 11:59 pm* Due Date: *Friday May 1st, 2020 11:59 pm*
## Initial Setup ## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one: These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
...@@ -10,9 +10,9 @@ $ cd cmsc13600-materials ...@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull $ git pull
``` ```
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git: Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
``` ```
$ git add encoding.py $ git add bloom.py
``` ```
After adding your files, to submit your code you must run: After adding your files, to submit your code you must run:
``` ```
...@@ -21,65 +21,44 @@ $ git push ...@@ -21,65 +21,44 @@ $ git push
``` ```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr] We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Delta Encoding ## Bloom filter
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files. A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*. Here's how the basic Bloom filter works:
## TODO 1. Loading the data file ### Initialization
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception. * An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## TODO 2. Compute the basic encoding ### Adding An Item e
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding. * For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
For example: ### Contains An Item e
``` * For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
> data = [1,3,4,3] * Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
> enc = delta_encoding(data)
1,2,1,-1
```
Or, ## TODO 1. Generate K independent Hash Functions
``` Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
## TODO 3. Integer Shifting * Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
## TODO 4. Write Encoding * Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
Reading from such a file is a little tricky, so we've provided that function for you. * Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
## TODO 5. Delta Decoding ## TODO 2. Put
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing! Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
For example: ## TODO 3. Get
``` Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
> enc = [1,2,1,-1] * For each of the k hash functions:
> data = delta_decoding(enc) * Compute the hash code of the string, and store the code in i
1,3,4,3 * if the ith element is 0, return false
``` * if all code-indices are 1, return true
Or,
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
```
## Testing ## Testing
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work. We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random import random
from encoding import * import string
def test_load(): from bloom import *
data = load_orig_file('data.txt')
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try: try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
def test_encoding(): for h in b.hashes:
data = load_orig_file('data.txt') h(generate_random_string())
encoded = delta_encoding(data)
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try: try:
assert(sum(encoded) == data[-1]) for h in b.hashes:
assert(sum(encoded) == 26) assert(h(s) == h(s))
assert(len(data) == len(encoded)) except:
except AssertionError: print('[#3] Hashes are not deterministic')
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try: try:
assert(sum(shift(data, 10)) == N*10 + sum(data)) b = Bloom(100,10)
assert(all([d >=0 for d in shift(encoded,4)])) b1h = b.hashes[0](s)
except AssertionError: b = Bloom(100,10)
print('TODO 3. Failure check your shift function') b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try: try:
assert(data == data_p) b = Bloom(100,10)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding') for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
def generate_file(size, seed): except:
FILE_NAME = 'data.gen.txt' print('[#5] Hash exceeds range')
f = open(FILE_NAME,'w')
initial = seed try:
for i in range(size): b = Bloom(1000,2)
f.write(str(initial) + '\n') s = generate_random_string()
initial += random.randint(-4, 4) bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
def generate_random_tests(): assert(bh1 != bh2)
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
cnt = 0 except:
for trials in range(10): print('[#6] Hashes generated are not independent')
generate_file(random.choice(SIZES), random.choice(SEEDS))
data = load_orig_file('data.gen.txt') def test_put():
encoded = delta_encoding(data) b = Bloom(100,10,seed=0)
sencoded = shift(encoded ,4) b.put('the')
write_encoding(sencoded, 'data_out.txt') b.put('university')
b.put('of')
b.put('chicago')
loaded = unshift(read_encoding('data_out.txt'),4) try:
decoded = delta_decoding(loaded) assert(sum(b.array) == 30)
except:
cnt += (decoded == data) print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try: try:
assert(cnt == 10) assert(results == [True, False, True, True, True, False])
except AssertionError: except:
print('Failed Random Tests', str(10-cnt), 'out of 10') print('[#8] Unexpected contains result')
test_load() test_hash_generation()
test_encoding() test_put()
test_shift() test_put_get()
test_decoding()
generate_random_tests()
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment