README.md 3.78 KB
Newer Older
Sanjay Krishnan committed
1 2
# HW3 String Matching

3
*Due 5/14/20 11:59 PM*
Sanjay Krishnan committed
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.

## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
```
$ git add analyze.py
$ git commit -m'initialized homework'
```
You will also need to fetch the datasets used in this homework assignment:
```
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
```
Download each of the files and put it into your `hw3` folder.

Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
...  print(a)
...  break

{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
...  print(a)
...  break

{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
```
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
```
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.

## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
```
52
Running the code will print out a result report as follows (accuracy, precision, and recall):
Sanjay Krishnan committed
53 54
```
----Accuracy----
55
0.5088062622309197 0.6998654104979811 0.3996925441967717
Sanjay Krishnan committed
56
---- Timing ----
57
168.670348 seconds
Sanjay Krishnan committed
58 59

```
Krishnan Sanjay committed
60
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
61 62 63 64 65 66 67 68

The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:

* The amazon product database is redundant (multiple same products), the google database is essentially unique. 

* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.

* Price and manufacturer will also be important attributes to use.
Sanjay Krishnan committed
69 70 71 72 73 74

## Submission
After you finish the assignment you can submit your code with:
```
$ git push
```