Commit e8fc68ed by Sanjay Krishnan

Updated the linkage assignment

parent 111ea51a
Showing with 29 additions and 13 deletions
# HW3 String Matching
*Due 5/6/20 11:59 PM*
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.
## Getting Started
......@@ -49,15 +49,23 @@ Your job is write the `match` function in `analzye.py`. You can run your code by
```
python3 auto_grader.py
```
Running the code will print out a result report as follows:
Running the code will print out a result report as follows (accuracy, precision, and recall):
```
----Accuracy----
{'false positive': 0.690576652601969, 'false negative': 0.4926979246733282, 'accuracy': 0.38439138031450204}
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
114.487954 seconds
168.670348 seconds
```
*For full extra credit, you must write a program that achieves at least 40% accuracy in less than 3 mins on a standard laptop.*
*For full extra credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
* Price and manufacturer will also be important attributes to use.
## Submission
After you finish the assignment you can submit your code with:
......
def eval_matching(matching):
import datetime
import csv
from analyze import match
def eval_matching(your_matching):
f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
reader = csv.reader(f, delimiter=',', quotechar='"')
matches = set()
......@@ -11,8 +15,9 @@ def eval_matching(matching):
for row in reader:
matches.add((row[0],row[1]))
#print((row[0],row[1]))
for m in matching:
for m in your_matching:
proposed_matches.add(m)
if m in matches:
......@@ -24,11 +29,15 @@ def eval_matching(matching):
if m not in proposed_matches:
fn.add(m)
prec = len(tp)/(len(tp) + len(fp))
if len(your_matching) == 0:
prec = 1.0
else:
prec = len(tp)/(len(tp) + len(fp))
rec = len(tp)/(len(tp) + len(fn))
return {'false positive': 1-prec,
'false negative': 1-rec,
return {'precision': prec,
'recall': rec,
'accuracy': 2*(prec*rec)/(prec+rec) }
#prints out the accuracy
......@@ -36,6 +45,6 @@ now = datetime.datetime.now()
out = eval_matching(match())
timing = (datetime.datetime.now()-now).total_seconds()
print("----Accuracy----")
print(out['accuracy'])
print(out['accuracy'], out['precision'] ,out['recall'])
print("---- Timing ----")
print(timing,"seconds")
\ No newline at end of file
print(timing,"seconds")
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment