Commit 958870a3 by Sanjay Krishnan
parents 2f2c443c 293bd130
......@@ -21,9 +21,9 @@ Git is installed on all of the CSIL computers, and to install git on your machin
Every student in the class has a git repository (a place where you can store completed assignments). This git repository can be accessed from:
[https://mit.cs.uchicago.edu/cmsc13600-spr-21/<your cnetid>.git]
The first thing to do is to open your terminal application, and ``clone`` this repository (NOTE skr is ME, replace it with your CNET id!!!):
The first thing to do is to open your terminal application, and ``clone`` this repository (NOTE replace < > with your CNET id!!!):
```
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-21/skr.git cmsc13600-submit
$ git clone https://mit.cs.uchicago.edu/cmsc13600-spr-21/<your cnet id>.git cmsc13600-submit
```
Your username and id is your CNET id and CNET password. This will create a new folder that is empty titled cmsc13600-submit. There is similarly a course repository where all of the homework materials will stored. Youshould clone this repository as well:
```
......
# Homework 1. Introduction to Data Extraction
In this assignment, you will extract meaningful information from unstructured data.
Due Date: *Friday April 9, 2020 11:59 pm*
Due Date: *Friday April 9, 2021 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
......@@ -85,7 +85,8 @@ It is up to you to read the documentation on the python xml module if you are co
```
def _reddit_extract(file)
```
That returns a Pandas DataFrame with three columns (*title*, *link*, *updated*). On `reddit.xml` your output should be a 25 row, 3 column pandas data frame.
That returns a Pandas DataFrame with three columns (*title*, *link*, *updated*). On `reddit.xml` your output should be a 25 row, 3 column pandas data frame.
Hint: if you are getting 26 rows, you are probably extracting the first dummy header row as well--this can be safely skipped.
### TODO 2. Extract Ticker Symbols
Each title of a reddit post might mention a stock of interest and most use a consistent format to denote a ticker symbol (starting with a dollar sign). For example: "$ISWH Takes Center Stage at Crypto Conference". You will now write a function called extract ticker which given a single title extracts all of the ticker symbols present in the title:
......
# Homework 1. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
# HW2 String Matching
Due Date: *Friday April 17, 2020 11:59 pm*
*Friday April 23, 2020 11:59 PM*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw1`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw1`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
Then, copy the `hw2` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
```
$ git add encoding.py
$ git add analyze.py
$ git commit -m'initialized homework'
```
After adding your files, to submit your code you must run:
You will also need to fetch the datasets used in this homework assignment:
```
$ git commit -m"My submission"
$ git push
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
Download each of the files and put it into your `hw2` folder.
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
... print(a)
... break
For example:
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
Or,
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
Reading from such a file is a little tricky, so we've provided that function for you.
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
For example:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
```
* Price and manufacturer will also be important attributes to use.
Or,
## Submission
After you finish the assignment you can submit your code with:
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
$ git push
```
## Testing
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random
from encoding import *
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
f = open(FILE_NAME,'w')
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
cnt += (decoded == data)
try:
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
# Homework 2. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
# Homework 3. Introduction to Python and File I/O
This homework assignment is meant to be an introduction to Python programming and introduces some basic concepts of encoding and decoding.
Due Date: *Friday May 1st, 2020 11:59 pm*
Due Date: *Friday April 30, 2020 11:59 pm*
## Initial Setup
These initial setup instructions assume you've done ``hw0``. Before you start an assingment you should sync your cloned repository with the online one:
Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```
Copy the folder ``hw2`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw2`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
Copy the folder ``hw3`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw3`` folder. In this homework assignment, you will only modify ``encoding.py``. Once you are done, you must add 'encoding.py' to git:
```
$ git add bloom.py
$ git add encoding.py
```
After adding your files, to submit your code you must run:
```
......@@ -21,44 +21,65 @@ $ git push
```
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
## Delta Encoding
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files.
In this first assignment, you will implement a delta encoding module in python.
The module will:
* Load a file of integers
* Delta encode them
* Write back a file in binary form
Here's how the basic Bloom filter works:
The instructions in this assignment are purposefully incomplete for you to read Python's API and to understand how the different functions work. All of the necessary parts that you need to write are marked with *TODO*.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## TODO 1. Loading the data file
In `encoding.py`, your first task is to write `load_orig_file`. This function reads from a specified filename and returns a list of integers in the file. You may assume the file is formatted like ``data.txt`` provided with the code, where each line contains a single integer number. The input of this function is a filename and the output is a list of numbers. If the file does not exist you must raise an exception.
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
## TODO 2. Compute the basic encoding
In `encoding.py`, your next task is to write `delta_encoding`. This function takes a list of numbers and computes the delta encoding. The delta encoding encodes the list in terms of successive differences from the previous element. The first element is kept as is in the encoding.
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
For example:
```
> data = [1,3,4,3]
> enc = delta_encoding(data)
1,2,1,-1
```
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
Or,
```
> data = [1,0,6,1]
> enc = delta_encoding(data)
1,-1,6,-5
```
Your job is to write a function that computes this encoding. Pay close attention to how python passes around references and where you make copies of lists v.s. modify a list in place.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
## TODO 3. Integer Shifting
When we write this data to a file, we will want to represent each encoded value as an unsigned short integer (1 single byte of data). To do so, we have to "shift" all of the values upwards so there are no negatives. You will write a function `shift` that adds a pre-specified offset to each value.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
## TODO 4. Write Encoding
Now, we are ready to write the encoded data to disk. In the function `write_encoding`, you will do the following steps:
* Open the specified filename in the function arguments for writing
* Convert the encoded list of numbers into a bytearray
* Write the bytearray to the file
* Close the file
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
Reading from such a file is a little tricky, so we've provided that function for you.
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 5. Delta Decoding
Finally, you will write a function that takes a delta encoded list and recovers the original data. This should do the opposite of what you did before. Don't forget to unshift the data when you are testing!
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
For example:
```
> enc = [1,2,1,-1]
> data = delta_decoding(enc)
1,3,4,3
```
Or,
```
> data = [1,-1,6,-5]
> data = delta_decoding(enc)
1,0,6,1
```
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
We've provided a sample dataset ``data.txt`` which can be used to test your code as well as an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import random
import string
from encoding import *
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
def test_load():
data = load_orig_file('data.txt')
try:
assert(sum(data) == 1778744)
except AssertionError:
print('TODO 1. Failure check your load_orig_file function')
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
def test_encoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
assert(sum(encoded) == data[-1])
assert(sum(encoded) == 26)
assert(len(data) == len(encoded))
except AssertionError:
print('TODO 2. Failure check your delta_encoding function')
def test_shift():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
N = len(data)
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
assert(sum(shift(data, 10)) == N*10 + sum(data))
assert(all([d >=0 for d in shift(encoded,4)]))
except AssertionError:
print('TODO 3. Failure check your shift function')
def test_decoding():
data = load_orig_file('data.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
data_p = delta_decoding(unshift(sencoded,4))
try:
b = Bloom(100,10)
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
assert(data == data_p)
except AssertionError:
print('TODO 5. Cannot recover data with delta_decoding')
except:
print('[#5] Hash exceeds range')
def generate_file(size, seed):
FILE_NAME = 'data.gen.txt'
f = open(FILE_NAME,'w')
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
initial = seed
for i in range(size):
f.write(str(initial) + '\n')
initial += random.randint(-4, 4)
assert(bh1 != bh2)
def generate_random_tests():
SIZES = (1,1000,16,99)
SEEDS = (240,-3, 9, 1)
except:
print('[#6] Hashes generated are not independent')
cnt = 0
for trials in range(10):
generate_file(random.choice(SIZES), random.choice(SEEDS))
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
data = load_orig_file('data.gen.txt')
encoded = delta_encoding(data)
sencoded = shift(encoded ,4)
write_encoding(sencoded, 'data_out.txt')
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
loaded = unshift(read_encoding('data_out.txt'),4)
decoded = delta_decoding(loaded)
cnt += (decoded == data)
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
assert(cnt == 10)
except AssertionError:
print('Failed Random Tests', str(10-cnt), 'out of 10')
test_load()
test_encoding()
test_shift()
test_decoding()
generate_random_tests()
\ No newline at end of file
# HW3 String Matching
# Homework 4. Bloom Filter
This homework assignment introduces an advanced use of hashing called a Bloom filter.
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
Due Date: *Friday May 7, 11:59 pm*
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
## Initial Setup
Before you start an assingment you should sync your cloned repository with the online one:
```
$ cd cmsc13600-materials
$ git pull
```
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
Copy the folder ``hw4`` to your newly cloned submission repository. Enter that repository from the command line and enter the copied ``hw4`` folder. In this homework assignment, you will only modify ``bloom.py``. Once you are done, you must add 'bloom.py' to git:
```
$ git add analyze.py
$ git commit -m'initialized homework'
$ git add bloom.py
```
You will also need to fetch the datasets used in this homework assignment:
After adding your files, to submit your code you must run:
```
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
$ git commit -m"My submission"
$ git push
```
Download each of the files and put it into your `hw3` folder.
We will NOT grade any code that is not added, committed, and pushed to your submission repository. You can confirm your submission by visiting the web interface[https://mit.cs.uchicago.edu/cmsc13600-spr-20/skr]
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
## Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set." Elements can be added to the set, but not removed (though this can be addressed with the counting Bloom filter variant); the more items added, the larger the probability of false positives. All of the necessary parts that you need to write are marked with *TODO*.
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
... print(a)
... break
Here's how the basic Bloom filter works:
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
### Initialization
* An empty Bloom filter is initialized with an array of *m* elements each with value 0.
* Generate *k* independent hash functions whose output domain are integers {0,...,m}.
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
```
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
### Adding An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and set each corresponding index in the array to 1 (if it is already 1 from a previous addition keep it as is).
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
### Contains An Item e
* For each hash function calculate the hash value of the item "e" (should be a number from 0 to m).
* Treat those calculated hash values as indices for the array and retrieve the array value for each corresponding index. If any of the values is 0, we know that "e" could not have possibly been inserted in the past.
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
## TODO 1. Generate K independent Hash Functions
Your first task is to write the function `generate_hashes`. This function is a higher-order function that returns a list of *k* random hash functions each with a range from 0 to *m*. Here are some hints that will help you write this function.
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
* Step 1. Review the "linear" hash function described in lecture and write a helper function that generates such a hash function for a pre-defined A and B. How would you restrict the domain of this hash function to be with 0 to m?
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
* Step 2. Generate k of such functions with different random settings of A and B. Pay close attention to how many times you call "random.x" because of how the seeded random variable works.
* Price and manufacturer will also be important attributes to use.
* Step 3. Return the functions themselves so they can be applied to data. Look at the autograder to understand what inputs these functions should take.
## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```
## TODO 2. Put
Write a function that uses the algorithm listed above to add a string to the bloom filter. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* Set the ith element of the array to 1
## TODO 3. Get
Write a function that uses the algorithm listed above to test whether the bloom filter possibly contains the string. In pseudo-code:
* For each of the k hash functions:
* Compute the hash code of the string, and store the code in i
* if the ith element is 0, return false
* if all code-indices are 1, return true
## Testing
We've provided an autograder script `autograder.py` which runs a bunch of interesting tests. The autograder is not comprehensive but it is a good start. It's up to you to figure out what the test do and why they work.
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in your_matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
import random
import string
from bloom import *
def generate_random_string(seed=True):
chars = string.ascii_uppercase + string.digits
size = 10
return ''.join(random.choice(chars) for x in range(size))
def test_hash_generation():
b = Bloom(5,10)
try:
assert(len(b.hashes) == 10)
except:
print('[#1] Failure the number of generated hashes is wrong')
try:
for h in b.hashes:
h(generate_random_string())
except:
print('[#2] The hashes are not properly represented as a lambda')
s = generate_random_string()
try:
for h in b.hashes:
assert(h(s) == h(s))
except:
print('[#3] Hashes are not deterministic')
try:
b = Bloom(100,10)
b1h = b.hashes[0](s)
b = Bloom(100,10)
b2h = b.hashes[0](s)
assert(b1h == b2h)
except:
print('[#4] Seeds are not properly set')
try:
b = Bloom(100,10)
for h in b.hashes:
for i in range(10):
assert( h(generate_random_string())< 100 )
except:
print('[#5] Hash exceeds range')
try:
b = Bloom(1000,2)
s = generate_random_string()
bh1 = b.hashes[0](s)
bh2 = b.hashes[1](s)
assert(bh1 != bh2)
except:
print('[#6] Hashes generated are not independent')
def test_put():
b = Bloom(100,10,seed=0)
b.put('the')
b.put('university')
b.put('of')
b.put('chicago')
try:
assert(sum(b.array) == 30)
except:
print('[#7] Unexpected Put() Result')
def test_put_get():
b = Bloom(100,5,seed=0)
b.put('the')
b.put('quick')
b.put('brown')
b.put('fox')
b.put('jumped')
b.put('over')
b.put('the')
b.put('lazy')
b.put('dog')
results = [b.contains('the'),\
b.contains('cow'), \
b.contains('jumped'), \
b.contains('over'),\
b.contains('the'), \
b.contains('moon')]
try:
assert(results == [True, False, True, True, True, False])
except:
print('[#8] Unexpected contains result')
test_hash_generation()
test_put()
test_put_get()
# Extract-Transform-Load
*Due Friday 5/22/20 11:59 PM*
*Extra Credit Assignment*
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s). In this project, you will write some of the core primitives in an ETL system.
......
# Out-of-Core Group By Aggregate
*Graduating Seniors: Due 6/5/20 11:59 PM*
*Everyone else: Due 6/8/20 11:59 PM*
*Due Friday May 21, 11:59 pm*
In this assignment, you will implement an out-of-core
version of the group by aggregate (aggregation by key)
......
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Python Document Search Engine\n",
"\n",
"Now, we will start to put together all of the topics that we have studied so far into a series of \"Python Recipes\"---coding examples that illustrate the power of thinking hard about how data is organized and structured. In the first example, we will consider a \"Python Search Engine\" that will identify relevant items given a query string.\n",
"\n",
"We're going to start with a dataset of tweets about airlines:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number: 14640 \t Size: 2.54372 MB \t Bytes per tweet: 173.7513661202186\n"
]
}
],
"source": [
"import csv\n",
"\n",
"def load_data(filename):\n",
" \n",
" rtn = []\n",
" #open the file with the csv reader\n",
" with open(filename, newline='') as csvfile:\n",
" tweets = csv.reader(csvfile, delimiter=',', quotechar='\"')\n",
" \n",
" next(tweets)#skip the header\n",
" \n",
" for row in tweets:\n",
" rtn.append(row[10])\n",
" \n",
" return rtn\n",
"\n",
"tweets = load_data('Tweets.csv')\n",
"\n",
"#figure out how much data we have\n",
"size = sum([i.__sizeof__() for i in tweets]) + tweets.__sizeof__()\n",
"\n",
"print('Number: ', len(tweets), '\\t Size:', size/1e6,'MB','\\t Bytes per tweet:', size/len(tweets))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset contains a large list of tweets represented as string. We want to be able to search for phrases in these tweets. Of course, the first thing that we can do is the simple naive search routine where we scan through the entire dataset.\n",
"\n",
"## Naive Search\n",
"Suppose, we wanted to find a substring in this collection of tweets, we could write the following code that iterates through each tweet and searches for a substring:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Find() elapsed time: 0.002858\n",
"Find() elapsed time: 0.0032\n"
]
},
{
"data": {
"text/plain": [
"[\"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo\",\n",
" '@VirginAmerica LAX to EWR - Middle seat on a red eye. Such a noob maneuver. #sendambien #andchexmix',\n",
" '@VirginAmerica help, left expensive headphones on flight 89 IAD to LAX today. Seat 2A. No one answering L&amp;F number at LAX!',\n",
" '@VirginAmerica plz help me win my bid upgrade for my flight 2/27 LAX---&gt;SEA!!! 🍷👍💺✈️',\n",
" '@VirginAmerica just landed in LAX, an hour after I should of been here. Your no Late Flight bag check is not business travel friendly #nomorevirgin',\n",
" '@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM',\n",
" '@VirginAmerica Can you find us a flt out of LAX that is sooner than midnight on Monday? That would be great customer service 😃',\n",
" '@VirginAmerica congrats, you just got all my business from EWR to SFO/LAX. Fuck you @united fl1289 SFO/EWR was the clincher...',\n",
" '@VirginAmerica nervous about my flight from DC to LAX getting Cancelled Flightled tomorrow! Just sent you a DM to help me!',\n",
" '@VirginAmerica @VirginAtlantic I have just checked in flight to SFO from LAX &amp; been told as Atlantic Flying Club Gold I get no benefits?!',\n",
" '@VirginAmerica I applied for a position in @flyLAXairport ,and I was wondering if you guys received my application.',\n",
" \"@VirginAmerica lost my luggage 4 days ago on flight VX 112 from LAX to IAD &amp; I'm calling every day, no response.Please give me back my stuff\",\n",
" '@VirginAmerica Nice, Lofty View @flyLAXairport. #SilverStatus http://t.co/F4Tp0dAwbd',\n",
" '@VirginAmerica you should have 39 dollar LAX-Las fares!!!',\n",
" '@VirginAmerica Thanks for making my flight from LAX to JFK a nightmare by forcing me to check my carry on bag at the gate. (1)',\n",
" '@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.',\n",
" '@united our travel booked thru United group dept. Okc ticket agent less than willing to help with our connection in LAX.',\n",
" '@united I took the exact same aircraft in to LAX 3 days ago. It fit, no problem. The agent today told some nonsense about a policy change',\n",
" '@united the person is currently bettween gates 71A and 73 in LAX',\n",
" '@united I need the phone number to baggage claim in LAX, my mom left her phone and someone called saying they would put it there but on',\n",
" \"@united is doing musicians real dirty at LAX. I've never been blocked from getting on a flight with my bass.\",\n",
" '@united Hi, Im flying SFO-LAX-SAL-CLO. My connecting time in LAX is 1h45m. Is it enough time? Do I have to collect my bag and recheck on AV?',\n",
" '@united Maybe be hiring your own ground staff at LAX when multiple gate agents tell you your baggage is loaded you expect it to be. HOPELESS',\n",
" '@united SF crew lack a lot of customer service, LAX employees are a lot better. Wonder why...',\n",
" \"@united UA flight 1247. SFO to LAX took my carry on at gate. I'm Group 2, overhead bins are empty\",\n",
" '@united maybemange the airline alittlebetter. Arrived at LAX and no GATE! #howisthatpossible always the same thing w/u',\n",
" '@united #LAX #sunrise UAL212 LAX-JFK',\n",
" '@united Never had a flight delayed an hour due to an unbalanced load. And more delays at @flyLAXairport. Great job idiots.',\n",
" '@United is offering to reroute my SFO flight to LAX. Might be geography class time.',\n",
" \"@united I missed my connection already. Then I missed the next flight they put me on. Now I'm going to LAX instead of Hawaii. :(\",\n",
" '@united thanks for leaving our 3 year old in his own row flight 360 LAX-IAD',\n",
" \"@united - you sure missed the mark on tonight's redeye from LAX to Chicago. What a mess! You can do better!\",\n",
" '@united Flight has been delayed for another hour so only have 24 mins to transit at LAX... Extremely unlikely I will make it!',\n",
" '@united really fucked my day up Hilo to LAX 2hr30min delay because of software? missed connection, getting home 8hrs Late Flightr no upgrade nothin',\n",
" \"@united DEN-PHX flight tomorrow Cancelled Flighted. Asked for overnight 2nite in LAX/SNA. Told not without paying. That's wrong\",\n",
" \"@united Hi have a question re future Flight Booking Problems. DUB-JAC 29/9 JAC-LAX 8/10 LAX-DUB 13/10. I'm *G. What is checked bag allowance for JAC-LAX?\",\n",
" '@united now arrives LAX @ 8:03 am',\n",
" '@united I do I was on UA 495 LAX TO DEN - we are scheduled to land LAX @ 7:38 am - please rebook to Denver - best flight',\n",
" \"@united any chance you'll ever do CPUs on your JFK-LAX like @AmericanAir?\",\n",
" '@united Old school ride home to LAX from Houston #flyingRetro http://t.co/6asuwx3Kv0',\n",
" \"@united I think this is the best first class I have ever gotten!! Denver to LAX and it's wonderful!!!\",\n",
" '@united they did on a delta flight out of LAX which is why I should be compensated for my rental car there.',\n",
" '@united care less about the person - although he walked away while I was complaining. A man at 10p at LAX club. More.....',\n",
" '@united There is only one club at LAX - in terminal 7 across from gate 71',\n",
" '@united Joni did a great job on flight 5653 to LAX. Thanks for a great flight.',\n",
" '@united KOA-LAX should have fresh food service, right?',\n",
" \"@united I forgot that Intl flights out of LAX don't go from Intl Terminal! Easiest re-check in ever! woo!\",\n",
" '@united you should tell that to the staff at LAX then. I boarded group 5 at the very end of the queue as a Gold member. Thanks for nothing.',\n",
" '@united are my bags here yet? They were at Palm Springs airport. I was at LAX. How come I beat my bags here.',\n",
" \"@united Where are my bags!!! They weren't in LAX like your promised. 9 out of 10 things today were a mess today because of you.\",\n",
" '@united Ugh. My bags were sent to Palm Springs and not to LAX as promised. They better be at the hotel when I get there.',\n",
" '@united 12/13EWR-LAX UA1151 my seat/armrest broken discover after takeoff. flight full.FA filed report, who to chat with for partial refund?',\n",
" \"@united I'm on one of your 757-300 between JFK and LAX.When r u upgrading planes?Plane has no Screens,lousy seats And that's in UnitedFirst.\",\n",
" \"@United I'm hoping we don't miss our LAX - ITO connection. Not looking forward to being stuck at LAX overnight with our team....AGAIN!\",\n",
" '@united flight 86 LAX-IAD, back rows NOT CLEANED prior to boarding. How gross is that to find used tissues in your seat? Please.',\n",
" \"@united of course I need help. I've been DMing you ladies and gents all day. Your only solution is hope for the best and LAX.\",\n",
" '@SouthwestAir Flight 1700. (PHX TO LAX) Wheels stop. Glad to be home! Thanks to the professionals both up front and in the cabin!!!',\n",
" '@SouthwestAir another great trip! LAX 823 - LAS 3075- BNA. Thanks so much!!!',\n",
" '@SouthwestAir Seriously? FOUR DELAYS? Only takes 42 minutes to get to Vegas from @flyLAXairport &amp; I have a connecting flight. #ridiculous',\n",
" '@SouthwestAir last week I flew from DAL to LAX. You got us in almost an hour early. Thank You.',\n",
" \"@SouthwestAir I start a new job tomorrow &amp; you Cancelled Flight my flight (1629 BWI-LAX) and you really can't get me on another flight today ?!\",\n",
" '@SouthwestAir @ LAX is almost a mess. For some reason the express bag drop is slower than the full service line. http://t.co/ORY89eEGek',\n",
" '@SouthwestAir appreciate the reply, hopefully those LAX agents get the memo. Cheers!',\n",
" '@SouthwestAir any idea if there will be any \"spring sales\" soon for travel from Late Flight August to early September? Going from PA to LAX.',\n",
" '@SouthwestAir if you are giving tix to #DestinationDragons show would appreciate one or two for LA😄Flying from PHL to LAX on Friday',\n",
" '@southwestair Amazing view on the approach to LAX tonight. http://t.co/a68d5fULmH',\n",
" '@SouthwestAir So I am flying Chicago-LAX-PHX just to go spotting at LAX and PHX airports, then I am flying back to Chicago :)',\n",
" '@SouthwestAir think flight 1945 from BNA to LAX will get off the ground tomorrow??? #please #snowbama',\n",
" '@SouthwestAir any chance of adding LAX-&gt;JFK direct any time in the future?',\n",
" \"@JetBlue ....you haven't got me just yet\\n\\nCan a 1 way LAX-NYC cost me under 190?\",\n",
" '@JetBlue Or...\\n\\n....how about a 1way LAX-NYC(area) under 190?! Is this possible ?',\n",
" \"@JetBlue wondering if it's possible for my colleague and I to get on an earlier flight LAX&gt;JFK tomorrow. Can you help?\",\n",
" \"@JetBlue I'm #MakingLoveOutofNothingAtAll on my #brandloveaffair to #LAX https://t.co/kdHRUF54sW\",\n",
" \"@JetBlue gr8 #Mint crew on #flight 123 to #LAX they're #Mintalicious #TrueBlueLove #ShelleyandMarcRock #travel #air\",\n",
" '@JetBlue service for baggage at JFK is incomprehensible. No employee knows where our luggage is frM flight 424 LAX to NYC ITS AN HR WAIT NOW',\n",
" '@jetblue rqstd upgrade to mint at LAX and was told no because i used points! Why turn down $1600 bcz I used points? #trueblue',\n",
" '@JetBlue received horrible customer service at LAX on 2/11. Reservation Cancelled Flighted without notification, despite having confirmation number.',\n",
" '@USAirways honest question - how is a 1-way ticket from Charlotte (your hub) to LAX (2nd biggest city in USA) almost $600?????',\n",
" '@USAirways Another great flight #FunFlightAttendants. Thanks for showing my dad wonderful customer service. #flt635 #LAX #PHX #SundayFunday',\n",
" '@usairways @AmericanAir LAX connect from term 6 to term 4. 55min layover due to delay. US755-AA2595. Is this realistic? Can 2595 be held?',\n",
" '@USAirways Everyone on Flight 669 from LAX to RDU enjoyed waiting an hour &amp; a half in baggage claim for their bags just now',\n",
" '@USAirways your staff at LAX really messed up on this one. Failing to scan my suitcase tag.',\n",
" \"@USAirways doesn't seem likely bc your team failed to scan my bag in LAX and you recycle bagtag numbers that doesn't help #usairwaysfail\",\n",
" \"@USAirways what's happening with 1217 Phl to LAX? Now 3 hr delay. Poor communication!\",\n",
" \"@USAirways I have been doing that all day. Can't find my bag anywhere bc they're saying it was never scanned &amp; technically never left LAX.\",\n",
" \"@USAirways Hey! I booked a flight (Isabelle Gramp, Boston to LAX), and it said that it charged my credit card but the transaction didn't go\",\n",
" '@USAirways I will be traveling from LAX to CLT to HTS, I have been rebooked for tomorrow due to the travel advisory.',\n",
" '@americanair thanks for no fresh food on my cross country flight and for making my connection so close No time to eat. TPA-DFW-LAX',\n",
" '@AmericanAir - Please find my bag!! In Singapore for three days already without my bag. Last known destination LAX Tag: 580815 Please help.',\n",
" '@AmericanAir None of the #LAX flights into #DFW have been Cancelled Flightled. Those landing before and after ours are fine. Completely arbitrary.',\n",
" '@americanair thanks for no fresh food on my cross country flight and for making my connection so close No time to eat. TPA-DFW-LAX',\n",
" '@AmericanAir - Please find my bag!! In Singapore for three days already without my bag. Last known destination LAX Tag: 580815 Please help.',\n",
" '@AmericanAir None of the #LAX flights into #DFW have been Cancelled Flightled. Those landing before and after ours are fine. Completely arbitrary.',\n",
" '@AmericanAir Yes I am. 2495/1170. RNO departure at 1229 on 2/25 w/connection at DFW to LGA. I can do the 1120am to LAX and then to JFK',\n",
" '@AmericanAir @USAirways #Boo! Wack ass terminal 6 @flyLAXairport. No food. No lounge. No Bueno!! Never again!!!',\n",
" \"@AmericanAir again, no special meal catered for me in F JFK-LAX. thankfully i'm on qantas the rest of way-i fear what youd NOT cater on that\",\n",
" '@AmericanAir LAX-OGG-LAX using hard earned miles and was given lousy service, faulty seats on both legs and damaged bag. Your customer care',\n",
" '@AmericanAir they were no where to be found at Midnight Last Night! I would think the agent in LAX could have relayed that info.Bag on flt',\n",
" \"@AmericanAir yes it is in Dulles and I need it delivered to the Embassy Suites in Herndon, VA. I'm still in Chicago from the fiasco in LAX\",\n",
" \"@AmericanAir another generic response guys? Cmon. You're terrible. How about an actual helpful person? Not all the rude employees at LAX\",\n",
" '@AmericanAir Cancelled Flightled flight from fresno then rebooked for LAX now flight Cancelled Flightled again and its midnight with no more hotel available???',\n",
" '@AmericanAir If SNA curfew causes diversion, do you provide transportation from LAX? On AA1237 now, pilot not sure if we have time.',\n",
" \"@AmericanAir - you broke my sick wife's luggage handle going from JFK to LAX...she had to drag her bag thru the airport! #customerservice\",\n",
" '@AmericanAir originating at SFO and going to LAX.',\n",
" \"@AmericanAir because your plane's toilet wasn't working and they needed gas. This is flight 1081 leaving Dulles going to LAX. Do some...\",\n",
" \"@AmericanAir it's always nice coming home but I wish you'd fly LAX-MAD and keep me away from Iberia 😜✈️ #GoingForGreat\",\n",
" '@AmericanAir SJC-&gt;LAX. After the fourth time, I gave up!',\n",
" '@AmericanAir the most stressful morning and still had to pay to check a bag. LAX is a madhouse with a lot of angry customers. Yikes',\n",
" '@AmericanAir thanks for the DM rescheduling. Unfortunately your operations process at LAX is chaos &amp; the reps refused to print the ticket',\n",
" \"@AmericanAir at LAX &amp; just got off the phone w/reservations. Every flight that'd get me to BOS before 11 am tmrw is apparently unavailable 😐\",\n",
" \"@AmericanAir at LAX and your service reps just hand out the 800 number to call. So that's not helpful.\",\n",
" '@AmericanAir Hi. I have KOA-LAX-PHL-ORD booked as a 1-way savr awrd. If I called to chnge it to KOA-LAX-PHX-ORD would I have to pay any fees',\n",
" \"@AmericanAir it was my friend. She shouldn't have been scheduled so close together at LAX then not reimbursed. It's costing her 12 hrs &amp;$250\",\n",
" \"@AmericanAir when will tomorrow's flight Cancelled Flightlations at Dfw for AA flights be posted? We are on 2424 at 7am from LAX!\",\n",
" \"@AmericanAir usually raving about the service to LAX. Your nbr1 and helper can't figure out how to hang a coat and serve a drink. 6F\",\n",
" '@AmericanAir at JFK flight to Boston delayed 20 min waiting for catering they are boarding flight to LAX waiting for over and hour WTF?',\n",
" '@AmericanAir Flight AA1691 LAX to LAS closes too early and gate agents give us hassles #PatheticCX',\n",
" '@AmericanAir please help my flight appears Cancelled Flight 1503 from ft. Lauderdale to Dallas to LAX. Is there anything leaving from Mia or FLL? ThxU',\n",
" '@AmericanAir no one met flight 1081 to LAX to tell passengers where to go or what flights they were rebooked on #badmgmt #AmericanAirlines',\n",
" \"@AmericanAir flight 1081 from IAD to LAX sat for more than 3hrs because ground crew couldn't drive in snow #badmgmt #AmericanAirlines\",\n",
" '@AmericanAir @dotnetnate Two hour wait for EXPs as I sit on a JFK PHX flight because US computers are down. Any shot at an LAX flight?',\n",
" '@AmericanAir flight Cancelled Flighted out of LAX for tomorrow due to connection in DFW. Help please? we can go out of orange or burbank',\n",
" \"@AmericanAir I've been trying to change frm AA 2401 to LAX at 6:50am MONDAY morning then AA 2586 from LAX to FAT to flight AA 1359?#helpAA\",\n",
" '@AmericanAir a friend is having flight Cancelled Flightlations out of LAX to CMH on Feb 23. Anyway to help her? 800 number has been no help',\n",
" '@AmericanAir Love the new planes for the JFK-LAX run. Maybe one day I will be on one where the amenities all function. #NoCharge #Ever']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import datetime\n",
"\n",
"def find(phrase, tweets):\n",
" #Naive full scan approach\n",
" \n",
" start = datetime.datetime.now()\n",
" \n",
" rtn = []\n",
" \n",
" for t in tweets:\n",
" if phrase in t:\n",
" rtn.append(t)\n",
" \n",
" \n",
" print('Find() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
" \n",
" return rtn\n",
"\n",
"\n",
"find('choppy landing', tweets)\n",
"\n",
"find('LAX', tweets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's pretty fast (about 3 ms!) But imagine if you had to run a million of such lookups, that would be 3000 seconds! At scale, small overheads add up. \n",
"\n",
"Now, we use our \"inverted indexing\" trick to make such searches faster.\n",
"\n",
"## Inverted Index\n",
"Next, we will try to do the same search with an inverted index. The indexing structure that we will use is a python dictionary."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"build_index() elapsed time: 0.220968\n"
]
}
],
"source": [
"import string \n",
"\n",
"def build_index(tweets):\n",
" start = datetime.datetime.now()\n",
" \n",
" index = {}\n",
" \n",
" #some code to deal with punctuation\n",
" table = str.maketrans('', '', string.punctuation)\n",
"\n",
" for i,t in enumerate(tweets):\n",
" \n",
" words = t.translate(table).split() \n",
" \n",
" for w in words:\n",
" \n",
" if w not in index:\n",
" index[w] = set()\n",
" \n",
" index[w].add(i) #add a pointer to the relevant tweet\n",
" \n",
" print('build_index() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
" \n",
" return index\n",
"\n",
"index = build_index(tweets)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that build_index is about a 100x slower than a single query. What does this mean? Basically, indexing is only valuable if you run a lot of queries! \n",
"\n",
"The next challenge is how to use an inverted index to answer general substring queries. In class, we showed how to do exact keyword lookup but the phrase 'choppy landing' is actually two words. This is actually not a problem, and we can use the inverted index to retrieve a set of candidates and then use the naive find method among just those candidates.\n",
"\n",
"So, let's write a new find function that can use this index:\n",
"* It splits the phrase into its constituent words\n",
"* Searches each word in the inverted index, finds a set of possibly relevant tweets (that match on a single word)\n",
"* Then double checks that set."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Find() elapsed time: 4e-06\n",
"Find() elapsed time: 2.9e-05\n"
]
},
{
"data": {
"text/plain": [
"[\"@united is doing musicians real dirty at LAX. I've never been blocked from getting on a flight with my bass.\",\n",
" \"@United I'm hoping we don't miss our LAX - ITO connection. Not looking forward to being stuck at LAX overnight with our team....AGAIN!\",\n",
" \"@united I forgot that Intl flights out of LAX don't go from Intl Terminal! Easiest re-check in ever! woo!\",\n",
" \"@united - you sure missed the mark on tonight's redeye from LAX to Chicago. What a mess! You can do better!\",\n",
" \"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo\",\n",
" '@VirginAmerica LAX to EWR - Middle seat on a red eye. Such a noob maneuver. #sendambien #andchexmix',\n",
" '@VirginAmerica help, left expensive headphones on flight 89 IAD to LAX today. Seat 2A. No one answering L&amp;F number at LAX!',\n",
" \"@USAirways I have been doing that all day. Can't find my bag anywhere bc they're saying it was never scanned &amp; technically never left LAX.\",\n",
" '@JetBlue received horrible customer service at LAX on 2/11. Reservation Cancelled Flighted without notification, despite having confirmation number.',\n",
" '@united #LAX #sunrise UAL212 LAX-JFK']"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def find_index(phrase, tweets, index):\n",
" start = datetime.datetime.now()\n",
" \n",
" words = phrase.split()\n",
" \n",
" #find tweets that contain all words\n",
" candidates = None\n",
" for w in words: #for each words in the phrase\n",
" try:\n",
" \n",
" if candidates is None:\n",
" candidates = index[w] #return the set of tweets for w\n",
" else:\n",
" candidates = candidates.intersection(index[w])\n",
" \n",
" except KeyError:\n",
" return []\n",
" \n",
" candidate_tweets = [tweets[ref] for ref in candidates]\n",
" return find(phrase, candidate_tweets)\n",
"\n",
" print('find_index() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
"\n",
"find_index('choppy landing', tweets, index)\n",
"find_index('LAX', tweets, index)[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In essence, you are paying a small upfront cost for greatly improved find performance (nearly a 1000x faster!). Speed is only aspect of search engine performance. We also like to support situations where a user mistypes a phrase. For example, if we mistype choppy landing:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"find_index('chopy landing', tweets, index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our system returns nothing. Can we write a fast suggestion utility that can quickly identify typos."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Did you mean? \n",
"So now we are going to write a utility that can identify mispelling and typos and suggest potential alternatives. So let's start off with a naive approach that simply finds the closest word in the index in terms of edit distance:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import distance\n",
"\n",
"#distance.jaccard('a b', 'b c')\n",
"#distance.levenshtein('a b','b c') \"called edit distance\""
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"did_you_mean_naive() elapsed time: 0.992237\n"
]
},
{
"data": {
"text/plain": [
"'choppy'"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def did_you_mean_naive(word, index):\n",
" start = datetime.datetime.now()\n",
" \n",
" if word in index:\n",
" return word\n",
" \n",
" else:\n",
" \n",
" distances = [(distance.levenshtein(word, iw), iw) for iw in index]\n",
" distances.sort()\n",
" \n",
" print('did_you_mean_naive() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
" \n",
" return distances[0][1]\n",
" \n",
" \n",
"\n",
"did_you_mean_naive('chopy', index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The suggestion utility runs much slower than the actual query!!! How do we fix this? We can use the same trick as before: a fast algorithm to find reasonable candidates and a slower algorithm to refine those candidates.\n",
"\n",
"In fact, we will use an inverted index again. Just this time over sub-sequences of letters and not words. The first thing that we are going to do is to calculate n-grams these are contiguous sub-sequences of letters."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('d', 'a'), ('a', 'v'), ('v', 'e')]"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#ngram\n",
"def find_ngrams(word, n):\n",
" return list(zip(*[word[i:] for i in range(n)]))\n",
"\n",
"find_ngrams('dave', 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we are going to build a \"word\" index, an indexing structure that maps ngrams to words that contain them."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"build_word_index() elapsed time: 0.195927\n"
]
}
],
"source": [
"def build_word_index(index, n):\n",
" start = datetime.datetime.now()\n",
" \n",
" word_index = {}\n",
" \n",
" for word in index:\n",
" ngrams = find_ngrams(word, n)\n",
" \n",
" for subseq in ngrams:\n",
" \n",
" if subseq not in word_index:\n",
" word_index[subseq] = set()\n",
" \n",
" word_index[subseq].add(word) #add a pointer to the relevant word\n",
" \n",
" print('build_word_index() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
" \n",
" return word_index\n",
"\n",
"word_index = build_word_index(index, 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use this word index to build a more sophisticated search:\n",
"* Only consider words that share a minimum number of ngrams with the lookup word"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"did_you_mean_better() elapsed time: 0.003581\n"
]
},
{
"data": {
"text/plain": [
"'choppy'"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def did_you_mean_better(word, word_index, n, thresh=1):\n",
" start = datetime.datetime.now()\n",
" \n",
" candidate_words = {}\n",
" ngrams = find_ngrams(word, n)\n",
" \n",
" for ngram in ngrams:\n",
" candidates = word_index.get(ngram, set())\n",
" \n",
" for candidate in candidates:\n",
" candidate_words[candidate] = candidate_words.get(candidate,0) + 1\n",
" \n",
" \n",
" \n",
" distances = [(distance.levenshtein(word, iw), iw) for iw in candidate_words if candidate_words[iw] >= thresh]\n",
" distances.sort()\n",
" \n",
" print('did_you_mean_better() elapsed time: ', (datetime.datetime.now()-start).total_seconds())\n",
" \n",
" return distances[0][1]\n",
" \n",
"\n",
"did_you_mean_better('chopy', word_index, 3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how much faster this approach is!! 0.992237 secs v.s. 0.003581 seconds."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Putting it all together\n",
"\n",
"Now, let's write the full program and try out some queries"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching for...choppy landing in 14640 tweets\n",
"Find() elapsed time: 8e-06\n",
"Found 1 matches\n",
"['@VirginAmerica pilot says we expect a choppy landing in NYC due to some gusty winds w/a temperature of about 5 degrees &amp; w/the windchill -8']\n"
]
}
],
"source": [
"def find_final(phrase, \\\n",
" tweets, \\\n",
" index, \\\n",
" word_index, \\\n",
" n=3, \\\n",
" thresh=1):\n",
" print('Searching for...' + phrase + \" in \" + str(len(tweets)) + \" tweets\")\n",
" out = find_index(phrase, tweets, index)\n",
" print('Found ' + str(len(out)) + ' matches')\n",
" \n",
" if len(out) == 0:\n",
" for word in phrase.split():\n",
" if word not in index:\n",
" print('Did you mean: ' + did_you_mean_better(word, word_index, n, thresh) + ' instead of ' + word + '?')\n",
" else:\n",
" print(out)\n",
"\n",
"find_final('choppy landing', tweets, index, word_index)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching for...chopy landing in 14640 tweets\n",
"Found 0 matches\n",
"did_you_mean_better() elapsed time: 0.002837\n",
"Did you mean: choppy instead of chopy?\n"
]
}
],
"source": [
"find_final('chopy landing', tweets, index, word_index)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching for...choppy landig in 14640 tweets\n",
"Found 0 matches\n",
"did_you_mean_better() elapsed time: 0.032908\n",
"Did you mean: landing instead of landig?\n"
]
}
],
"source": [
"find_final('choppy landig', tweets, index, word_index)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching for...LAX in 14640 tweets\n",
"Find() elapsed time: 2.8e-05\n",
"Found 95 matches\n",
"[\"@united is doing musicians real dirty at LAX. I've never been blocked from getting on a flight with my bass.\", \"@United I'm hoping we don't miss our LAX - ITO connection. Not looking forward to being stuck at LAX overnight with our team....AGAIN!\", \"@united I forgot that Intl flights out of LAX don't go from Intl Terminal! Easiest re-check in ever! woo!\", \"@united - you sure missed the mark on tonight's redeye from LAX to Chicago. What a mess! You can do better!\", \"@VirginAmerica So excited for my first cross country flight LAX to MCO I've heard nothing but great things about Virgin America. #29DaysToGo\", '@VirginAmerica LAX to EWR - Middle seat on a red eye. Such a noob maneuver. #sendambien #andchexmix', '@VirginAmerica help, left expensive headphones on flight 89 IAD to LAX today. Seat 2A. No one answering L&amp;F number at LAX!', \"@USAirways I have been doing that all day. Can't find my bag anywhere bc they're saying it was never scanned &amp; technically never left LAX.\", '@JetBlue received horrible customer service at LAX on 2/11. Reservation Cancelled Flighted without notification, despite having confirmation number.', '@united #LAX #sunrise UAL212 LAX-JFK', '@united our travel booked thru United group dept. Okc ticket agent less than willing to help with our connection in LAX.', '@USAirways Another great flight #FunFlightAttendants. Thanks for showing my dad wonderful customer service. #flt635 #LAX #PHX #SundayFunday', '@SouthwestAir another great trip! LAX 823 - LAS 3075- BNA. Thanks so much!!!', '@united Hi, Im flying SFO-LAX-SAL-CLO. My connecting time in LAX is 1h45m. Is it enough time? Do I have to collect my bag and recheck on AV?', \"@AmericanAir at LAX &amp; just got off the phone w/reservations. Every flight that'd get me to BOS before 11 am tmrw is apparently unavailable 😐\", '@AmericanAir originating at SFO and going to LAX.', '@VirginAmerica just landed in LAX, an hour after I should of been here. Your no Late Flight bag check is not business travel friendly #nomorevirgin', '@USAirways your staff at LAX really messed up on this one. Failing to scan my suitcase tag.', '@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM', \"@AmericanAir at LAX and your service reps just hand out the 800 number to call. So that's not helpful.\", '@united they did on a delta flight out of LAX which is why I should be compensated for my rental car there.', '@USAirways honest question - how is a 1-way ticket from Charlotte (your hub) to LAX (2nd biggest city in USA) almost $600?????', '@VirginAmerica Can you find us a flt out of LAX that is sooner than midnight on Monday? That would be great customer service 😃', '@AmericanAir flight Cancelled Flighted out of LAX for tomorrow due to connection in DFW. Help please? we can go out of orange or burbank', '@united you should tell that to the staff at LAX then. I boarded group 5 at the very end of the queue as a Gold member. Thanks for nothing.', '@SouthwestAir appreciate the reply, hopefully those LAX agents get the memo. Cheers!', '@SouthwestAir @ LAX is almost a mess. For some reason the express bag drop is slower than the full service line. http://t.co/ORY89eEGek', \"@AmericanAir I've been trying to change frm AA 2401 to LAX at 6:50am MONDAY morning then AA 2586 from LAX to FAT to flight AA 1359?#helpAA\", '@AmericanAir - Please find my bag!! In Singapore for three days already without my bag. Last known destination LAX Tag: 580815 Please help.', '@AmericanAir they were no where to be found at Midnight Last Night! I would think the agent in LAX could have relayed that info.Bag on flt', \"@united of course I need help. I've been DMing you ladies and gents all day. Your only solution is hope for the best and LAX.\", '@usairways @AmericanAir LAX connect from term 6 to term 4. 55min layover due to delay. US755-AA2595. Is this realistic? Can 2595 be held?', \"@AmericanAir yes it is in Dulles and I need it delivered to the Embassy Suites in Herndon, VA. I'm still in Chicago from the fiasco in LAX\", \"@AmericanAir usually raving about the service to LAX. Your nbr1 and helper can't figure out how to hang a coat and serve a drink. 6F\", \"@USAirways doesn't seem likely bc your team failed to scan my bag in LAX and you recycle bagtag numbers that doesn't help #usairwaysfail\", '@USAirways Everyone on Flight 669 from LAX to RDU enjoyed waiting an hour &amp; a half in baggage claim for their bags just now', '@VirginAmerica nervous about my flight from DC to LAX getting Cancelled Flightled tomorrow! Just sent you a DM to help me!', '@united Flight has been delayed for another hour so only have 24 mins to transit at LAX... Extremely unlikely I will make it!', '@United is offering to reroute my SFO flight to LAX. Might be geography class time.', '@AmericanAir None of the #LAX flights into #DFW have been Cancelled Flightled. Those landing before and after ours are fine. Completely arbitrary.', \"@AmericanAir another generic response guys? Cmon. You're terrible. How about an actual helpful person? Not all the rude employees at LAX\", '@jetblue rqstd upgrade to mint at LAX and was told no because i used points! Why turn down $1600 bcz I used points? #trueblue', '@united care less about the person - although he walked away while I was complaining. A man at 10p at LAX club. More.....', \"@USAirways Hey! I booked a flight (Isabelle Gramp, Boston to LAX), and it said that it charged my credit card but the transaction didn't go\", '@AmericanAir at JFK flight to Boston delayed 20 min waiting for catering they are boarding flight to LAX waiting for over and hour WTF?', \"@united I missed my connection already. Then I missed the next flight they put me on. Now I'm going to LAX instead of Hawaii. :(\", '@AmericanAir Cancelled Flightled flight from fresno then rebooked for LAX now flight Cancelled Flightled again and its midnight with no more hotel available???', '@SouthwestAir any idea if there will be any \"spring sales\" soon for travel from Late Flight August to early September? Going from PA to LAX.', '@AmericanAir If SNA curfew causes diversion, do you provide transportation from LAX? On AA1237 now, pilot not sure if we have time.', '@AmericanAir a friend is having flight Cancelled Flightlations out of LAX to CMH on Feb 23. Anyway to help her? 800 number has been no help', '@united Old school ride home to LAX from Houston #flyingRetro http://t.co/6asuwx3Kv0', '@VirginAmerica @VirginAtlantic I have just checked in flight to SFO from LAX &amp; been told as Atlantic Flying Club Gold I get no benefits?!', \"@VirginAmerica lost my luggage 4 days ago on flight VX 112 from LAX to IAD &amp; I'm calling every day, no response.Please give me back my stuff\", '@united Maybe be hiring your own ground staff at LAX when multiple gate agents tell you your baggage is loaded you expect it to be. HOPELESS', '@AmericanAir - Please find my bag!! In Singapore for three days already without my bag. Last known destination LAX Tag: 580815 Please help.', '@SouthwestAir think flight 1945 from BNA to LAX will get off the ground tomorrow??? #please #snowbama', '@united There is only one club at LAX - in terminal 7 across from gate 71', \"@united I think this is the best first class I have ever gotten!! Denver to LAX and it's wonderful!!!\", '@united Joni did a great job on flight 5653 to LAX. Thanks for a great flight.', '@united SF crew lack a lot of customer service, LAX employees are a lot better. Wonder why...', '@united are my bags here yet? They were at Palm Springs airport. I was at LAX. How come I beat my bags here.', '@AmericanAir Flight AA1691 LAX to LAS closes too early and gate agents give us hassles #PatheticCX', \"@united Where are my bags!!! They weren't in LAX like your promised. 9 out of 10 things today were a mess today because of you.\", \"@united UA flight 1247. SFO to LAX took my carry on at gate. I'm Group 2, overhead bins are empty\", '@AmericanAir None of the #LAX flights into #DFW have been Cancelled Flightled. Those landing before and after ours are fine. Completely arbitrary.', '@united maybemange the airline alittlebetter. Arrived at LAX and no GATE! #howisthatpossible always the same thing w/u', '@united Ugh. My bags were sent to Palm Springs and not to LAX as promised. They better be at the hotel when I get there.', \"@JetBlue I'm #MakingLoveOutofNothingAtAll on my #brandloveaffair to #LAX https://t.co/kdHRUF54sW\", '@united really fucked my day up Hilo to LAX 2hr30min delay because of software? missed connection, getting home 8hrs Late Flightr no upgrade nothin', \"@AmericanAir because your plane's toilet wasn't working and they needed gas. This is flight 1081 leaving Dulles going to LAX. Do some...\", '@SouthwestAir last week I flew from DAL to LAX. You got us in almost an hour early. Thank You.', \"@JetBlue gr8 #Mint crew on #flight 123 to #LAX they're #Mintalicious #TrueBlueLove #ShelleyandMarcRock #travel #air\", \"@AmericanAir it was my friend. She shouldn't have been scheduled so close together at LAX then not reimbursed. It's costing her 12 hrs &amp;$250\", \"@AmericanAir when will tomorrow's flight Cancelled Flightlations at Dfw for AA flights be posted? We are on 2424 at 7am from LAX!\", '@united I took the exact same aircraft in to LAX 3 days ago. It fit, no problem. The agent today told some nonsense about a policy change', '@USAirways I will be traveling from LAX to CLT to HTS, I have been rebooked for tomorrow due to the travel advisory.', '@AmericanAir please help my flight appears Cancelled Flight 1503 from ft. Lauderdale to Dallas to LAX. Is there anything leaving from Mia or FLL? ThxU', \"@USAirways what's happening with 1217 Phl to LAX? Now 3 hr delay. Poor communication!\", '@united the person is currently bettween gates 71A and 73 in LAX', '@VirginAmerica Thanks for making my flight from LAX to JFK a nightmare by forcing me to check my carry on bag at the gate. (1)', '@united I need the phone number to baggage claim in LAX, my mom left her phone and someone called saying they would put it there but on', '@SouthwestAir if you are giving tix to #DestinationDragons show would appreciate one or two for LA😄Flying from PHL to LAX on Friday', '@AmericanAir no one met flight 1081 to LAX to tell passengers where to go or what flights they were rebooked on #badmgmt #AmericanAirlines', '@southwestair Amazing view on the approach to LAX tonight. http://t.co/a68d5fULmH', \"@AmericanAir flight 1081 from IAD to LAX sat for more than 3hrs because ground crew couldn't drive in snow #badmgmt #AmericanAirlines\", '@SouthwestAir So I am flying Chicago-LAX-PHX just to go spotting at LAX and PHX airports, then I am flying back to Chicago :)', '@SouthwestAir Flight 1700. (PHX TO LAX) Wheels stop. Glad to be home! Thanks to the professionals both up front and in the cabin!!!', '@AmericanAir @dotnetnate Two hour wait for EXPs as I sit on a JFK PHX flight because US computers are down. Any shot at an LAX flight?', '@AmericanAir Yes I am. 2495/1170. RNO departure at 1229 on 2/25 w/connection at DFW to LGA. I can do the 1120am to LAX and then to JFK', '@AmericanAir the most stressful morning and still had to pay to check a bag. LAX is a madhouse with a lot of angry customers. Yikes', '@JetBlue service for baggage at JFK is incomprehensible. No employee knows where our luggage is frM flight 424 LAX to NYC ITS AN HR WAIT NOW', '@VirginAmerica Flying LAX to SFO and after looking at the awesome movie lineup I actually wish I was on a long haul.', '@united now arrives LAX @ 8:03 am', '@united I do I was on UA 495 LAX TO DEN - we are scheduled to land LAX @ 7:38 am - please rebook to Denver - best flight', '@AmericanAir thanks for the DM rescheduling. Unfortunately your operations process at LAX is chaos &amp; the reps refused to print the ticket']\n"
]
}
],
"source": [
"find_final('LAX', tweets, index, word_index)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Searching for...LAXS in 14640 tweets\n",
"Found 0 matches\n",
"did_you_mean_better() elapsed time: 0.001225\n",
"Did you mean: LAX instead of LAXS?\n"
]
}
],
"source": [
"find_final('LAXS', tweets, index, word_index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment