Commit 7da0fcb6 by Sanjay Krishnan

Clean up for the new quarter

parent d4e2575b
......@@ -19,11 +19,11 @@ Git is installed on all of the CSIL computers, and to install git on your machin
[https://git-scm.com/book/en/v2/Getting-Started-Installing-Git]
Every student in the class has a git repository (a place where you can store completed assignments). This git repository can be accessed from:
[https://mit.cs.uchicago.edu/cmsc13600-spr-20/<your cnetid>.git]
[https://mit.cs.uchicago.edu/cmsc13600-spr-21/<your cnetid>.git]
The first thing to do is to open your terminal application, and ``clone`` this repository (NOTE skr is ME, replace it with your CNET id!!!):
```
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr.git cmsc13600-submit
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-21/skr.git cmsc13600-submit
```
Your username and id is your CNET id and CNET password. This will create a new folder that is empty titled cmsc13600-submit. There is similarly a course repository where all of the homework materials will stored. Youshould clone this repository as well:
```
......@@ -45,4 +45,4 @@ After adding your files, to submit your code you must run:
$ git commit -m"My submission"
$ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-19/skr]
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-21/skr]
# Homework 1. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
# Homework 1. Introduction to Data Extraction
In this assignment, you will extract meaningful information from unstructured data.
Due Date: *Friday April 17, 2020 11:59 pm*
Due Date: *Friday April 9, 2020 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
......@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
Copy the folder ``hw1`` to your submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``extract.py``. Once you are done, you must add 'extract.py' to git:
```
$ git add encoding.py
$ git add extract.py
```
After adding your files, to submit your code you must run:
```
......@@ -21,65 +21,88 @@ $ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
## Background
RSS is a web standard that allows users and applications to access updates to websites in a standardized, computer-readable format. These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. The news aggregator will automatically check the RSS feed for new content, allowing the list to be automatically passed from website to website or from website to user.
Recent events have shown how important tracking online media is for financial markets.
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
For example:
In this project, you will scan through a series of reddit posts and count the frequency that certain stock ticker symbols are mentioned. You're essentially implementing one important function that scans through posts in an XML file of posts and returns the number of posts in which a ticker symbol occurs:
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, '$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, '$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
```
However, before we get there we will break the implemention up into a few smaller parts.
Or,
## Data Files
You are given a data file to process `reddit.xml`. This file contains an RSS feed taken from a few Reddit pages covering stocks. RSS feeds are stored in a semi-structured format called XML. XML defines a tree elements, where you have items and subitems (which can be named). This example is shamelessly taken from this link: https://stackabuse.com/reading-and-writing-xml-files-in-python/
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
<data>
<items>
<item name="item1">item1abc</item>
<item name="item2">item2abc</item>
</items>
</data>
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
These tags can be extracted using built in modules in most programming languages. Let's what happens when we process this with python. I stored the above data in a test file (also included) test.xml. We can first try to extract the *item* tags.
```
from xml.dom import minidom
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
mydoc = minidom.parse('test.xml')
items = mydoc.getElementsByTagName('item')
for elem in items:
print(elem.firstChild.data)
```
The code above: (1) gets all the tags labeld *item*, (2) then iterates those those items, (3) gets the child data aka the data contained between `> * </`. The output is:
```
item1abc
item2abc
```
If I wanted to grab the names instead, I could write the following code:
```
from xml.dom import minidom
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
mydoc = minidom.parse('test.xml')
items = mydoc.getElementsByTagName('item')
for elem in items:
print(elem.attributes['name'].value)
```
The code above: (1) gets all the tags labeld *item*, (2) then iterates those those items, (3) gets the attribute data aka the data contained in `< * attr=value > `.
Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
### TODO 1. Extract Title, Links, and Post Times
Your first todo will be to use the examples above to extract *titles*,
```
<title>$EEENF Share Price Valuation Model | Low Range Estimate increase of 1700% in Current Share Price Equaling $0.49 Per Share| Average Range Estimate Increase in Current Share Price of 3600% Equaling $1.05 Per Share</title>
```
then extract *links* (only extract the URL),
```
<link href="https://www.reddit.com/r/pennystocks/comments/mdexsk/eeenf_share_price_valuation_model_low_range/" />
```
and *post times* for each reddit post in the RSS feed:
```
<updated>2021-03-26T02:44:10+00:00</updated>
```
It is up to you to read the documentation on the python xml module if you are confused on how to use it. You must write a helper function:
```
def _reddit_extract(file)
```
That returns a Pandas DataFrame with three columns (*title*, *link*, *updated*). On `reddit.xml` your output should be a 25 row, 3 column pandas data frame.
For example:
### TODO 2. Extract Ticker Symbols
Each title of a reddit post might mention a stock of interest and most use a consistent format to denote a ticker symbol (starting with a dollar sign). For example: "$ISWH Takes Center Stage at Crypto Conference". You will now write a function called extract ticker which given a single title extracts all of the ticker symbols present in the title:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
def _ticker_extract(title)
```
Or,
## TODO 3. Count Ticker Frequency
Using the two helper functions you defined above. Finally, you will count the frequency (the number of posts) in which each ticker symbol occurs.
```
def count_ticker(file)
```
Your result should be a dictionary of ticker to count and look as follows:
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, '$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, '$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
```
## Testing
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
We've provided a sample dataset ``reddit.xml`` which can be used to test your code by seeing if you can reproduce the output above. We have also provided an autograder that will check for some basic issues.
import random
from encoding import *
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
f = open(FILE_NAME,'w')
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
cnt += (decoded == data)
try:
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
from extract_sol import _reddit_extract, \
_ticker_extract, \
count_ticker
def testExtraction():
df = _reddit_extract('reddit.xml')
try:
cols = list(df.columns.values)
cols.sort()
except:
return "[ERROR] The output of _reddit_extract doesn't look like a dataframe"
if cols != ['link', 'title', 'updated']:
return "[ERROR] Expected ['link', 'title', 'updated'], got = " + str(cols)
N = len(df)
if N != 25:
return "[ERROR] Seems like you have too many rows"
return "[PASSED] testExtraction()"
def testTicker():
ticker = "The quick $BRWN fox jumped over the $LZY $DAWG"
val = _ticker_extract(ticker)
expected = set(['$BRWN','$LZY','$DAWG'])
if val != expected:
return "[ERROR] Expected set(['$BRWN','$LZY','$DAWG']), got = " + str(val)
return "[PASSED] testTicker()"
def testAll():
expected = {'$HITI': 1, '$GME': 3, '$MSFT': 1, '$ISWH': 1, \
'$ARBKF': 1, '$HCANF': 1, '$AMC': 1, '$OZOP': 1, \
'$VMNT': 2, '$CLIS': 1, '$EEENF': 2, '$GTII': 1}
val = count_ticker('reddit.xml')
if expected != val:
return "[ERROR] Expected " + str(expected) + " , got = " + str(val)
return "[PASSED] testAll()"
print(testExtraction())
print(testTicker())
print(testAll())
'''extract.py
In this first assignment, you will learn the basics of python
data manipulation. You will process an XML file of reddit posts
relating to stocks and count the frequency of certain ticker
symbols appearing
'''
#Two libraries that we will need for the helper functions
import pandas as pd
from xml.dom import minidom
'''count_ticker is the main function that you will implement.
Input: filename of a reddit RSS feed in XML format
Output: dictionary mapping
ticker symbols => frequency of occurance in the post titles
Example Usage:
>>> count_ticker('reddit.xml')
{'$HITI': 1, '$GME': 3, '$MSFT': 1,
'$ISWH': 1, '$ARBKF': 1, '$HCANF': 1,
'$AMC': 1, '$OZOP': 1, '$VMNT': 2,
'$CLIS': 1, '$EEENF': 2, '$GTII': 1}
'''
def count_ticker(file):
raise ValueError('Count Ticker Not Implemented')
# TODO1 Helper Function to Extract XML
'''_reddit_extract is a helper function that extracts
the post title, timestamp, and link into a pandas dataframe.
Input: filename of a reddit RSS feed in XML format
Output: 3 col pandas dataframe ('title', 'updated', 'link')
with each row a reddit post from the RSS XML file.
'''
def _reddit_extract(file):
raise ValueError('Count Ticker Not Implemented')
#TODO2 Helper Function to Extract Tickers
'''_ticker_extract is a helper function that extracts
the mentioned ticker symbols in each title.
Input: string representing a post title
Output: set of ticker symbols mentioned each in consistent
notation $XYZ
'''
def _ticker_extract(title):
raise ValueError('Count Ticker Not Implemented')
This diff is collapsed. Click to expand it.
<data>
<items>
<item name="item1">item1abc</item>
<item name="item2">item2abc</item>
</items>
</data>
\ No newline at end of file
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
# Homework 1. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
Due Date: *Friday May 1st, 2020 11:59 pm*
Due Date: *Friday April 17, 2020 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
......@@ -10,9 +10,9 @@ $ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
```
$ git add bloom.py
$ git add encoding.py
```
After adding your files, to submit your code you must run:
```
......@@ -21,44 +21,65 @@ $ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
Here's how the basic Bloom filter works:
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
For example:
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
```
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
Or,
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
For example:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
```
Or,
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
```
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random
import string
from encoding import *
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
b = Bloom(100,10)
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
except:
print('[#5] Hash exceeds range')
f = open(FILE_NAME,'w')
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
assert(bh1 != bh2)
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
except:
print('[#6] Hashes generated are not independent')
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
cnt += (decoded == data)
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
# HW3 String Matching
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
Due Date: *Friday May 1st, 2020 11:59 pm*
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
```
$ git add analyze.py
$ git commit -m'initialized homework'
$ git add bloom.py
```
You will also need to fetch the datasets used in this homework assignment:
After adding your files, to submit your code you must run:
```
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
$ git commit -m"My submission"
$ git push
```
Download each of the files and put it into your `hw3` folder.
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
... print(a)
... break
Here's how the basic Bloom filter works:
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
```
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
* Price and manufacturer will also be important attributes to use.
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
import random
import string
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try:
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
try:
b = Bloom(100,10)
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
except:
print('[#5] Hash exceeds range')
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
assert(bh1 != bh2)
except:
print('[#6] Hashes generated are not independent')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
# Out-of-Core Group By Aggregate
*Graduating Seniors: Due 6/5/20 11:59 PM*
*Everyone else: Due 6/8/20 11:59 PM*
In this assignment, you will implement an out-of-core
version of the group by aggregate (aggregation by key)
seen in lecture. You will have a set memory limit and
you will have to count the number of times a string shows
up in an iterator. Your program should work for any limit
greater than 20.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw5` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `countD.py` this is the only file that you will modify. Finally, add `countD.py` using `git add`:
```
$ git add countD.py
$ git commit -m'initialized homework'
```
Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw5 folder:
https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0
DO NOT ADD title.csv to the git repo! After downloading the
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
```
>> for i in imdb_years():
... print(i)
1992
1986
<so on>
```
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works.
## MemoryLimitedHashMap
In this project, the main data structure is the `MemoryLimitedHashMap`. This is a hash map that has an explicit limit on the number of keys it can store. To create one of these data structure, you can import it from core module:
```
from core import *
#create a memory limited hash map
m = MemoryLimitedHashMap()
```
To find out what the limit of this hash map is, you can:
```
print("The max size of m is: ", m.limit)
```
The data structure can be constructed with an explicit limit (the default is 1000), e.g., `MemoryLimitedHashMap(limit=10)`.
Adding data to this hash map is like you've probably seen before in a data structure class. There is a `put` function that takes in a key and assigns that key a value:
```
# put some keys
m.put('a', 1)
m.put('b', 45)
print("The size of m is: ", m.size())
```
You can fetch the data using the `get` function and `keys` function:
```
# get keys
for k in m.keys():
print("The value at key=", k, 'is', m.get(k))
# You can test to see if a key exists
print('Does m contain a?', m.contains('a'))
print('Does m contain c?', m.contains('c'))
```
When a key does not exist in the data structure the `get` function will raise an error:
```
#This gives an error:
m.get('c')
```
Similarly, if you assign too many unique keys (more than the limit) you will get an error:
```
for i in range(0,1001):
m.put(str(i), i)
```
The `MemoryLimitedHashMap` allows you to manage this limited storage with a `flush` function that allows you to persist a key and its assignment to disk. When you flush a key it removes it from the data structure and decrements the limit. Flush takes a key as a parameter.
```
m.flushKey('a')
print("The size of m is: ", m.size())
```
Note that the disk is not intelligent! If you flush a key multiple times it simply appends the flushed value to a file on disk:
```
m.flushKey('a')
<some work...>
m.flushKey('a')
```
Once a key has been flushed it can be read back using the `load` function (which takes a key as a parameter). This loads back *all* of the flushed values:
```
#You can also load values from disk
for k,v in m.load('a'):
print(k,v)
```
If you try to load a key that has not been flushed, you will get an error:
```
#Error!!
for k,v in m.load('d'):
print(k,v)
```
If you want multiple flushes of the same key to be differentiated, you can set a *subkey*:
```
#first flush
m.flushKey('a', '0')
<some work...>
#second flush
m.flushKey('a', '1')
```
The `load` function allows you to selectively pull
certain subkeys:
```
# pull only the first flush
m.load('a', '0')
```
We can similarly iterate over all of the flushed data (which optionally takes a subkey as well!):
```
for k,v in m.loadAll():
print(k,v)
```
It also takes in an optional parameter that includes the in memory keys as well:
```
for k,v in m.loadAll(subkey='myskey', inMemory=True):
print(k,v)
```
Since there are some keys in memory and some flushed to disk there are two commands to iterate over keys.
```
m.keys() #returns all keys that are in memory
```
There is also a way to iterate over all of the flushed keys (will strip out any subkeys):
```
m.flushed() #return keys that are flushed.
```
## Count Per Group
In this assignment, you will implement an out-of-core count operator which for all distinct strings in an iterator returns
the number of times it appears (in no particular order).
For example,
```
In: "the", "cow", "jumped", "over", "the", "moon"
Out: ("the",2), ("cow",1), ("jumped",1), ("over",1), ("moon",1)
```
Or,
```
In: "a", "b", "b", "a", "c"
Out: ("c",1),("b",2), ("a", 2)
```
The catch is that you CANNOT use a list, dictionary, or set from
Python. We provide a general purpose data structure called a MemoryLimitedHashMap (see ooc.py). You must maintain the iterator
state using this data structure.
The class that you will implement is called Count (in countD.py).
The constructor is written for you, and ittakes in an input iterator and a MemoryLimitedHashMap. You will use these objects
in your implementation. You will have to implement `__next__` and `__iter__`. Any solution using a list, dictionary, or set inside `Count` will recieve 0 points.
The hint is to do this in multiple passes and use a subkey to track keys flushed between different passes.
## Testing and Submission
We have provided a series of basic tests in `auto_grader.py`, these tests are incomplete and are not meant to comprehensively grade your assignment. There is a file `years.json` with an expected output. After you finish the assignment you can submit your code with:
```
$ git push
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment