Added the extra credit assignment

8f3bf250 · Sanjay Krishnan · b1ecb073 · 8f3bf250 · 8f3bf250 · 8f3bf250
Commit 8f3bf250 authored May 21, 2019 by Sanjay Krishnan
Showing with 165 additions and 0 deletions
ec/README.md
ec/analyze.py
ec/core.py
--- a/ec/README.md
+++ b/ec/README.md
+# String Matching Extra Credit
+
+*Due 6/7/19 11:59 PM*
+Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.
+
+## Getting Started
+First, pull the most recent changes from the cmsc13600-public repository:
+```
+$ git pull
+```
+Then, copy the `ec` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzy.py`. You can the files to the repository using `git add`:
+```
+$ git add analyze.py
+$ git commit -m'initialized homework'
+```
+You will also need to fetch the datasets used in this homework assignment:
+```
+https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
+https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
+https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
+```
+Download each of the files and put it into your `ec` folder.
+
+Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
+```
+>>>for a in amazon_catalog():
+...  print(a)
+...  break
+
+{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
+```
+You can similarly, do the same for the Google catalog:
+```
+>>>for a in google_catalog():
+...  print(a)
+...  break
+
+{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
+```
+A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
+```
+>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
+>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
+```
+False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
+
+## Assignment
+Your job is write the `match` function in `analzye.py`. You can run your code by running:
+```
+python3 analyze.py
+```
+Running the code will print out a result report as follows:
+```
+----Accuracy----
+{'false positive': 0.690576652601969, 'false negative': 0.4926979246733282, 'accuracy': 0.38439138031450204}
+---- Timing ----
+114.487954 seconds
+
+```
+*For full extra credit, you must write a program that achieves at least 35% accuracy in less than 3 mins on a standard laptop.*
+
+## Submission
+After you finish the assignment you can submit your code with:
+```
+$ git push
+```
--- a/ec/analyze.py
+++ b/ec/analyze.py
+from core import *
+import datetime
+
+def match():
+    '''
+    Match must return a list of tuples of amazon ids and google ids.
+    For example:
+    [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),....]
+
+    '''
+
+    #YOUR CODE GOES HERE
+
+    return []
+
+#prints out the accuracy
+now = datetime.datetime.now()
+out = eval_matching(match())
+timing = (datetime.datetime.now()-now).total_seconds()
+print("----Accuracy----")
+print(out)
+print("---- Timing ----")
+print(timing,"seconds")
\ No newline at end of file
--- a/ec/core.py
+++ b/ec/core.py
+'''
+The core module sets up the data structures and 
+and references for this programming assignment.
+
+2010
+'''
+
+import platform
+import csv
+
+if platform.system() == 'Windows':
+  print("This assignment will not work on a windows computer")
+  exit()
+
+
+#defines an iterator over the google catalog
+class Catalog():
+
+    def __init__(self, filename):
+      self.filename = filename
+
+    def __iter__(self):
+      f = open(self.filename, 'r', encoding = "ISO-8859-1")
+      self.reader = csv.reader(f, delimiter=',', quotechar='"')
+      next(self.reader)
+      return self
+
+    def __next__(self):
+      row = next(self.reader)
+      return {'id': row[0],
+               'title': row[1],
+               'description': row[2],
+               'mfg': row[3],
+               'price': row[4]
+              }
+
+def google_catalog():
+    return Catalog('GoogleProducts.csv')
+
+def amazon_catalog():
+    return Catalog('Amazon.csv')
+
+
+def eval_matching(matching):
+    f = open('Amzon_GoogleProducts_perfectMapping.csv', 'r', encoding = "ISO-8859-1")
+    reader = csv.reader(f, delimiter=',', quotechar='"')
+    matches = set()
+    proposed_matches = set()
+
+    tp = set()
+    fp = set()
+    fn = set()
+    tn = set()
+
+    for row in reader:
+        matches.add((row[0],row[1]))
+
+    for m in matching:
+        proposed_matches.add(m)
+
+        if m in matches:
+            tp.add(m)
+        else:
+            fp.add(m)
+
+    for m in matches:
+        if m not in proposed_matches:
+            fn.add(m)
+
+    prec = len(tp)/(len(tp) + len(fp))
+    rec = len(tp)/(len(tp) + len(fn))
+
+    return {'false positive': 1-prec, 
+            'false negative': 1-rec,
+            'accuracy': 2*(prec*rec)/(prec+rec) }