README.md 5.83 KB
Newer Older
Sanjay Krishnan committed
1
# Out-of-Core Group By Aggregate
Sanjay Krishnan committed
2

Krishnan Sanjay committed
3
*Due Thursday June 3, 11:59 pm*
Sanjay Krishnan committed
4

Sanjay Krishnan committed
5 6 7 8 9 10
In this assignment, you will implement an out-of-core
version of the group by aggregate (aggregation by key)
seen in lecture. You will have a set memory limit and 
you will have to count the number of times a string shows 
up in an iterator. Your program should work for any limit 
greater than 20.
Sanjay Krishnan committed
11 12 13 14 15 16

## Getting Started
First, pull the most recent changes from the cmsc13600-public repository:
```
$ git pull
```
Sanjay Krishnan committed
17
Then, copy the `hw5` folder to your submission repository. Change directories to enter your submission repository. Your code will go into `countD.py` this is the only file that you will modify. Finally, add `countD.py` using `git add`:
Sanjay Krishnan committed
18
```
Sanjay Krishnan committed
19
$ git add countD.py
Sanjay Krishnan committed
20 21 22
$ git commit -m'initialized homework'
```

Sanjay Krishnan committed
23
Now, you will need to fetch the data used in this assignment. Download title.csv put it in the hw5 folder:
Sanjay Krishnan committed
24

Sanjay Krishnan committed
25
https://www.dropbox.com/s/zl7yt8cl0lvajxg/title.csv?dl=0
Sanjay Krishnan committed
26

Sanjay Krishnan committed
27 28 29
DO NOT ADD title.csv to the git repo! After downloading the 
dataset, there is a python module provided for you called `core.py`, which reads the dataset. This module loads the data in as
an iterator in two functions `imdb_years()` and `imdb_title_words()`:
Sanjay Krishnan committed
30
```
Sanjay Krishnan committed
31 32 33 34 35 36
>> for i in imdb_years():
... print(i)

1992
1986
<so on>
Sanjay Krishnan committed
37
```
Sanjay Krishnan committed
38
Play around with both `imdb_years()` and `imdb_title_words()` to get a feel for how the data works. 
39

Sanjay Krishnan committed
40 41
## MemoryLimitedHashMap
In this project, the main data structure is the `MemoryLimitedHashMap`. This is a hash map that has an explicit limit on the number of keys it can store. To create one of these data structure, you can import it from core module:
Sanjay Krishnan committed
42
```
Sanjay Krishnan committed
43 44 45
from core import *
#create a memory limited hash map
m = MemoryLimitedHashMap()
Sanjay Krishnan committed
46
```
Sanjay Krishnan committed
47
To find out what the limit of this hash map is, you can:
Sanjay Krishnan committed
48
```
Sanjay Krishnan committed
49
print("The max size of m is: ", m.limit)
Sanjay Krishnan committed
50
```
Sanjay Krishnan committed
51 52
The data structure can be constructed with an explicit limit (the default is 1000), e.g., `MemoryLimitedHashMap(limit=10)`.
Adding data to this hash map is like you've probably seen before in a data structure class. There is a `put` function that takes in a key and assigns that key a value:
Sanjay Krishnan committed
53
```
Sanjay Krishnan committed
54 55 56 57
# put some keys
m.put('a', 1)
m.put('b', 45)
print("The size of m is: ", m.size())
Sanjay Krishnan committed
58
```
Sanjay Krishnan committed
59 60 61 62 63
You can fetch the data using the `get` function and `keys` function:
```
# get keys
for k in m.keys():
    print("The value at key=", k, 'is', m.get(k))    
64

Sanjay Krishnan committed
65 66 67
# You can test to see if a key exists
print('Does m contain a?', m.contains('a'))
print('Does m contain c?', m.contains('c'))
Sanjay Krishnan committed
68
```
Sanjay Krishnan committed
69
When a key does not exist in the data structure the `get` function will raise an error:
Sanjay Krishnan committed
70
```
Sanjay Krishnan committed
71 72
#This gives an error: 
m.get('c')
Sanjay Krishnan committed
73
```
Sanjay Krishnan committed
74
Similarly, if you assign too many unique keys (more than the limit) you will get an error:
Sanjay Krishnan committed
75
```
Sanjay Krishnan committed
76 77
for i in range(0,1001):
    m.put(str(i), i)
Sanjay Krishnan committed
78
```
Sanjay Krishnan committed
79
The `MemoryLimitedHashMap` allows you to manage this limited storage with a `flush` function that allows you to persist a key and its assignment to disk. When you flush a key it removes it from the data structure and decrements the limit. Flush takes a key as a parameter.
Sanjay Krishnan committed
80
```
Sanjay Krishnan committed
81 82
m.flushKey('a')
print("The size of m is: ", m.size())
Sanjay Krishnan committed
83
```
Sanjay Krishnan committed
84
Note that the disk is not intelligent! If you flush a key multiple times it simply appends the flushed value to a file on disk:
Sanjay Krishnan committed
85
```
Sanjay Krishnan committed
86 87 88
m.flushKey('a')
<some work...>
m.flushKey('a')
Sanjay Krishnan committed
89
```
Sanjay Krishnan committed
90
Once a key has been flushed it can be read back using the `load` function (which takes a key as a parameter). This loads back *all* of the flushed values:
Sanjay Krishnan committed
91
```
Sanjay Krishnan committed
92 93 94
#You can also load values from disk
for k,v in m.load('a'):
    print(k,v)
Sanjay Krishnan committed
95
```
Sanjay Krishnan committed
96
If you try to load a key  that has not been flushed, you will get an error:
Sanjay Krishnan committed
97
```
Sanjay Krishnan committed
98 99 100
#Error!!
for k,v in m.load('d'):
	print(k,v)
101
```
Sanjay Krishnan committed
102 103

If you want multiple flushes of the same key to be differentiated, you can set a *subkey*:
Sanjay Krishnan committed
104
```
Sanjay Krishnan committed
105 106
#first flush
m.flushKey('a', '0')
Sanjay Krishnan committed
107

Sanjay Krishnan committed
108
<some work...>
Sanjay Krishnan committed
109

Sanjay Krishnan committed
110 111
#second flush
m.flushKey('a', '1')
Sanjay Krishnan committed
112
```
Sanjay Krishnan committed
113 114 115 116 117
The `load` function allows you to selectively pull 
certain subkeys:
```
# pull only the first flush
m.load('a', '0')
Sanjay Krishnan committed
118 119
```

Sanjay Krishnan committed
120
We can similarly iterate over all of the flushed data (which optionally takes a subkey as well!):
Sanjay Krishnan committed
121
```
Sanjay Krishnan committed
122 123
for k,v in m.loadAll():
    print(k,v)
Sanjay Krishnan committed
124
```
Sanjay Krishnan committed
125
It also takes in an optional parameter that includes the in memory keys as well:
Krishnan Sanjay committed
126
```
Sanjay Krishnan committed
127 128
for k,v in m.loadAll(subkey='myskey', inMemory=True):
    print(k,v)
Krishnan Sanjay committed
129
```
Krishnan Sanjay committed
130

Sanjay Krishnan committed
131
Since there are some keys in memory and some flushed to disk there are two commands to iterate over keys.
Krishnan Sanjay committed
132
```
Sanjay Krishnan committed
133
m.keys() #returns all keys that are in memory
Krishnan Sanjay committed
134
```
Sanjay Krishnan committed
135
There is also a way to iterate over all of the flushed keys (will strip out any subkeys):
Sanjay Krishnan committed
136
```
Sanjay Krishnan committed
137
m.flushed() #return keys that are flushed.
Sanjay Krishnan committed
138 139
```

Sanjay Krishnan committed
140 141 142 143
## Count Per Group
In this assignment, you will implement an out-of-core count operator which for all distinct strings in an iterator returns
the number of times it appears (in no particular order). 
For example,
Sanjay Krishnan committed
144
```
Sanjay Krishnan committed
145 146
In: "the", "cow", "jumped", "over", "the", "moon"
Out: ("the",2), ("cow",1), ("jumped",1), ("over",1), ("moon",1)
Sanjay Krishnan committed
147
```
Sanjay Krishnan committed
148
Or,
Sanjay Krishnan committed
149
```
Sanjay Krishnan committed
150 151
In: "a", "b", "b", "a", "c"
Out: ("c",1),("b",2), ("a", 2) 
Sanjay Krishnan committed
152
```
Sanjay Krishnan committed
153 154 155
The catch is that you CANNOT use a list, dictionary, or set from 
Python. We provide a general purpose data structure called a MemoryLimitedHashMap (see ooc.py). You must maintain the iterator
state using this data structure. 
Sanjay Krishnan committed
156

Sanjay Krishnan committed
157 158 159
The class that you will implement is called Count (in countD.py).
The constructor is written for you, and ittakes in an input iterator and a MemoryLimitedHashMap. You will use these objects
in your implementation. You will have to implement `__next__` and `__iter__`. Any solution using a list, dictionary, or set inside `Count` will recieve 0 points.
Sanjay Krishnan committed
160

Sanjay Krishnan committed
161
The hint is to do this in multiple passes and use a subkey to track keys flushed between different passes.
Sanjay Krishnan committed
162

Sanjay Krishnan committed
163 164
## Testing and Submission
 We have provided a series of basic tests in `auto_grader.py`, these tests are incomplete and are not meant to comprehensively grade your assignment. There is a file `years.json` with an expected output. After you finish the assignment you can submit your code with:
Sanjay Krishnan committed
165 166 167
```
$ git push
```