hyperloglog.py 2.78 KB
Newer Older
Sanjay Krishnan committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
"""
The HyperLogLog algorithm is an algorithm designed to count the number 
of distinct values in a set when the set is too large to fit in memory. 
Like MinHash, it uses probabilistic methods to approximate the true 
value of the distinct cardinality.

Write your implementation of the HyperLogLog algorithm here. The outline 
of the algorithm can be found in the paper, 

    "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" 
        by Flajolet et al. 
            (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf)

The assignment asks you to implement a simplification of this algorithm. Figure 3 in the 
paper describes the whole algorithm

We will test your implementation against the IMDB dataset (title.csv) used in HW1. View 
the tests.py file to see how the class will be used. The method skeletons provided are 
for your convenience, but only the __init__, count, and update methods will be used for 
tests.

Your implementation should run in less than 5 minutes. Do not use Python counting data 
structures such as Counter, histograms, or dictionaries adapted for counting purposes.
"""
import numpy as np

# Step 1.
def hashcode(s):
    '''
    This function takes in a string and returns a a 32-bit 
    unsigned integer.
    '''

    #YOUR CODE GOES HERE!!!
    pass


# Step 2.
def code2bin(code):
    '''
    This function takes in a 32-bit unsigned code from 
    the previous example and turns it into a binary 
    list of 32 1/0 values. The most significant bit is 
    the first element of the list.
    '''

    #YOUR CODE GOES HERE!!!


# Step 3.
def bin2code(lst):
    '''
    This function takes in a 32-bit binary list
    and returns a unsigned 32-bit integer. Assuming
    most significant bit first.
    '''

    #YOUR CODE GOES HERE!!!

#Step 4.
def rho(lst):
    '''
    Returns the position of the left-most bit
    that is one.
    '''

    #YOUR CODE GOES HERE!!!



class HyperLogLog(object):
    '''
    HyperLogLog is an approximate distinct count 
    data structure. 
    '''

    def __init__(self, b):
        '''
        Takes in an integer parameter b > 4 and less than 16
        '''
        self.b = b
        self.M = [0]*(2 ** b)
        

    #given to you don't worry about it   
    def get_alpha(self):
        m = 2 ** self.b
        if m == 16:
            return 0.673
        elif m == 32:
            return 0.697
        elif m == 64:
            return 0.709
        else:
            return 0.7213/(1+ 1.079/m)


    #given to you don't worry about it  
    def count(self):
        m = 2 ** self.b
        a = self.get_alpha()

        reg_sum = 0
        for j in range(0,m):
            reg_sum += 2 ** (- self.M[j])

        E = 2*a*(m**2)/reg_sum

        return E


    #your code
    def update(self, s):
        #YOUR CODE GOES HERE!!!