Entity Resolution is the task of disambiguating manifestations of real world entities in various records or mentions by linking and grouping. For example, there could be different ways of addressing the same person in text, different addresses for businesses, or photos of a particular object. In this extra credit assignment, you will link two product catalogs.
## Getting Started
...
...
@@ -49,15 +49,23 @@ Your job is write the `match` function in `analzye.py`. You can run your code by
```
python3 auto_grader.py
```
Running the code will print out a result report as follows:
Running the code will print out a result report as follows (accuracy, precision, and recall):
*For full extra credit, you must write a program that achieves at least 40% accuracy in less than 3 mins on a standard laptop.*
*For full extra credit, you must write a program that achieves at least 50% accuracy in less than 5 mins on a standard laptop.*
The project is complete unstructured and it is up to you to figure out how to make this happen. Here are some hints:
* The amazon product database is redundant (multiple same products), the google database is essentially unique.
* Jaccard similarity will be useful but you may have to consider "n-grams" of words (look at the lecture notes!) and "cleaning" up the strings to strip formatting and punctuation.
* Price and manufacturer will also be important attributes to use.
## Submission
After you finish the assignment you can submit your code with: