Commit 8f3bf250 by Sanjay Krishnan

Added the extra credit assignment

parent b1ecb073
Showing with 165 additions and 0 deletions
# String Matching Extra Credit
*Due 6/7/19 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `ec` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzy.py`. You can the files to the repository using `git add`:
```
$ git add analyze.py
$ git commit -m'initialized homework'
```
You will also need to fetch the datasets used in this homework assignment:
```
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
```
Download each of the files and put it into your `ec` folder.
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
... print(a)
... break
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
... print(a)
... break
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 analyze.py
```
Running the code will print out a result report as follows:
```
----Accuracy----
{'false positive': 0.690576652601969, 'false negative': 0.4926979246733282, 'accuracy': 0.38439138031450204}
---- Timing ----
114.487954 seconds
```
*For full extra credit, you must write a program that achieves at least 35% accuracy in less than 3 mins on a standard laptop.*
## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```
from core import *
import datetime
def match():
'''
Match must return a list of tuples of amazon ids and google ids.
For example:
[('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),....]
'''
#YOUR CODE GOES HERE
return []
#prints out the accuracy
now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out)
print("---- Timing ----")
print(timing,"seconds")
\ No newline at end of file
'''
The core module sets up the data structures and
and references for this programming assignment.
2010
'''
import platform
import csv
if platform.system() == 'Windows':
print("This assignment will not work on a windows computer")
exit()
#defines an iterator over the google catalog
class Catalog():
def __init__(self, filename):
self.filename = filename
def __iter__(self):
f = open(self.filename, 'r', encoding = "ISO-8859-1")
self.reader = csv.reader(f, delimiter=',', quotechar='"')
next(self.reader)
return self
def __next__(self):
row = next(self.reader)
return {'id': row[0],
'title': row[1],
'description': row[2],
'mfg': row[3],
'price': row[4]
}
def google_catalog():
return Catalog('GoogleProducts.csv')
def amazon_catalog():
return Catalog('Amazon.csv')
def eval_matching(matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
proposed_matches = set()
tp = set()
fp = set()
fn = set()
tn = set()
for row in reader:
matches.add((row[0],row[1]))
for m in matching:
proposed_matches.add(m)
if m in matches:
tp.add(m)
else:
fp.add(m)
for m in matches:
if m not in proposed_matches:
fn.add(m)
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'false positive': 1-prec,
'false negative': 1-rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment