README.md 3.76 KB
Newer Older
1
# HW3 String Matching
2

3 4
*Due 5/14/20 11:59 PM*
Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this assignment, you will link two product catalogs.
5

6 7
## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
8 9 10
```
$ git pull
```
11
Then, copy the `hw3` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `analzey.py`. You can the files to the repository using `git add`:
12
```
13 14
$ git add analyze.py
$ git commit -m'initialized homework'
15
```
16
You will also need to fetch the datasets used in this homework assignment:
17
```
18 19 20
https://www.dropbox.com/s/vq5dyl5hwfhbw98/Amazon.csv?dl=0
https://www.dropbox.com/s/fbys7cqnbl3ch1s/Amzon_GoogleProducts_perfectMapping.csv?dl=0
https://www.dropbox.com/s/o6rqmscmv38rn1v/GoogleProducts.csv?dl=0
21
```
22
Download each of the files and put it into your `hw3` folder.
23

24 25 26 27 28
Before we can get started, let us understand the main APIs in this project. We have provided a file named `core.py` for you. This file loads and processes the data that you've just downloaded. For example, you can load the Amazon catalog with the `amazon_catalog()` function. This returns an iterator over data tuples in the Amazon catalog. The fields are id, title, description, mfg (manufacturer), and price if any:
```
>>>for a in amazon_catalog():
...  print(a)
...  break
29

30 31 32 33 34 35 36
{'id': 'b000jz4hqo', 'title': 'clickart 950 000 - premier image pack (dvd-rom)', 'description': '', 'mfg': 'broderbund', 'price': '0'}
```
You can similarly, do the same for the Google catalog:
```
>>>for a in google_catalog():
...  print(a)
...  break
37

38
{'id': 'http://www.google.com/base/feeds/snippets/11125907881740407428', 'title': 'learning quickbooks 2007', 'description': 'learning quickbooks 2007', 'mfg': 'intuit', 'price': '38.99'}
Sanjay Krishnan committed
39
```
40
A matching is a pairing between id's in the Google catalog and the Amazon catalog that refer to the same product. The ground truth is listed in the file `Amzon_GoogleProducts_perfectMapping.csv`. Your job is to construct a list of pairs (or iterator of pairs) of `(amazon.id, google.id)`. These matchings can be evaluated for accuracy using the `eval_matching` function:
Sanjay Krishnan committed
41
```
42 43 44 45
>>> my_matching = [('b000jz4hqo', http://www.google.com/base/feeds/snippets/11125907881740407428'),...]
>>> {'false positive': 0.9768566493955095, 'false negative': 0.43351268255188313, 'accuracy': 0.04446992095577143}
```
False positive refers to the false positive rate, false negative refers to the false negative rate, and accuracy refers to the overall accuracy.
46

47 48 49 50
## Assignment
Your job is write the `match` function in `analzye.py`. You can run your code by running:
```
python3 auto_grader.py
Sanjay Krishnan committed
51
```
52
Running the code will print out a result report as follows (accuracy, precision, and recall):
Sanjay Krishnan committed
53
```
54 55 56 57
----Accuracy----
0.5088062622309197 0.6998654104979811 0.3996925441967717
---- Timing ----
168.670348 seconds
58

59 60
```
*For full credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
61

62
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
63

64
* The amazon product database is redundant (multiple same products), the google database is essentially unique. 
65

66
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
67

68
* Price and manufacturer will also be important attributes to use.
Sanjay Krishnan committed
69

70 71
## Submission
After you finish the assignment you can submit your code with:
Sanjay Krishnan committed
72
```
73
$ git push
Sanjay Krishnan committed
74
```