Updated the linkage assignment

e8fc68ed · Sanjay Krishnan · 111ea51a · e8fc68ed · e8fc68ed
Commit e8fc68ed authored May 01, 2020 by Sanjay Krishnan
Showing with 29 additions and 13 deletions
hw3/README.md
hw3/auto_grader.py
--- a/hw3/README.md
+++ b/hw3/README.md
 # HW3 String Matching

-*Due 5/6/20 11:59 PM*
+*Due 5/14/20 11:59 PM*
 Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.

 ## Getting Started
@@ -49,15 +49,23 @@ Your job is write the `match` function in `analzye.py`. You can run your code by
 ```
 python3 auto_grader.py
 ```
-Running the code will print out a result report as follows:
+Running the code will print out a result report as follows (accuracy, precision, and recall):
 ```
 ----Accuracy----
-{'false positive': 0.690576652601969, 'false negative': 0.4926979246733282, 'accuracy': 0.38439138031450204}
+0.5088062622309197 0.6998654104979811 0.3996925441967717
 ---- Timing ----
-114.487954 seconds
+168.670348 seconds

 ```
-*For full extra credit, you must write a program that achieves at least 40% accuracy in less than 3 mins on a standard laptop.*
+*For full extra credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
+
+The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
+
+* The amazon product database is redundant (multiple same products), the google database is essentially unique. 
+
+* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
+
+* Price and manufacturer will also be important attributes to use.

 ## Submission
 After you finish the assignment you can submit your code with:

--- a/hw3/auto_grader.py
+++ b/hw3/auto_grader.py
-def eval_matching(matching):
+import datetime
+import csv
+from analyze import match
+
+def eval_matching(your_matching):
    f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
    reader = csv.reader(f, delimiter=',', quotechar='"')
    matches = set()
@@ -11,8 +15,9 @@ def eval_matching(matching):

    for row in reader:
        matches.add((row[0],row[1]))
+        #print((row[0],row[1]))

-    for m in matching:
+    for m in your_matching:
        proposed_matches.add(m)

        if m in matches:
@@ -24,11 +29,15 @@ def eval_matching(matching):
        if m not in proposed_matches:
            fn.add(m)

-    prec = len(tp)/(len(tp) + len(fp))
+    if len(your_matching) == 0:
+        prec = 1.0
+    else:
+        prec = len(tp)/(len(tp) + len(fp))
+
    rec = len(tp)/(len(tp) + len(fn))

-    return {'false positive': 1-prec, 
-            'false negative': 1-rec,
+    return {'precision': prec, 
+            'recall': rec,
            'accuracy': 2*(prec*rec)/(prec+rec) }

 #prints out the accuracy
@@ -36,6 +45,6 @@ now = datetime.datetime.now()
 out = eval_matching(match())
 timing = (datetime.datetime.now()-now).total_seconds()
 print("----Accuracy----")
-print(out['accuracy'])
+print(out['accuracy'], out['precision'] ,out['recall'])
 print("---- Timing ----")
-print(timing,"seconds")
\ No newline at end of file
+print(timing,"seconds")