hyperloglog.py 2.78 KB
 Sanjay Krishnan committed May 29, 2019 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 """ The HyperLogLog algorithm is an algorithm designed to count the number of distinct values in a set when the set is too large to fit in memory. Like MinHash, it uses probabilistic methods to approximate the true value of the distinct cardinality. Write your implementation of the HyperLogLog algorithm here. The outline of the algorithm can be found in the paper, "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm" by Flajolet et al. (http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf) The assignment asks you to implement a simplification of this algorithm. Figure 3 in the paper describes the whole algorithm We will test your implementation against the IMDB dataset (title.csv) used in HW1. View the tests.py file to see how the class will be used. The method skeletons provided are for your convenience, but only the __init__, count, and update methods will be used for tests. Your implementation should run in less than 5 minutes. Do not use Python counting data structures such as Counter, histograms, or dictionaries adapted for counting purposes. """ import numpy as np # Step 1. def hashcode(s): ''' This function takes in a string and returns a a 32-bit unsigned integer. ''' #YOUR CODE GOES HERE!!! pass # Step 2. def code2bin(code): ''' This function takes in a 32-bit unsigned code from the previous example and turns it into a binary list of 32 1/0 values. The most significant bit is the first element of the list. ''' #YOUR CODE GOES HERE!!! # Step 3. def bin2code(lst): ''' This function takes in a 32-bit binary list and returns a unsigned 32-bit integer. Assuming most significant bit first. ''' #YOUR CODE GOES HERE!!! #Step 4. def rho(lst): ''' Returns the position of the left-most bit that is one. ''' #YOUR CODE GOES HERE!!! class HyperLogLog(object): ''' HyperLogLog is an approximate distinct count data structure. ''' def __init__(self, b): ''' Takes in an integer parameter b > 4 and less than 16 ''' self.b = b self.M = [0]*(2 ** b) #given to you don't worry about it def get_alpha(self): m = 2 ** self.b if m == 16: return 0.673 elif m == 32: return 0.697 elif m == 64: return 0.709 else: return 0.7213/(1+ 1.079/m) #given to you don't worry about it def count(self): m = 2 ** self.b a = self.get_alpha() reg_sum = 0 for j in range(0,m): reg_sum += 2 ** (- self.M[j]) E = 2*a*(m**2)/reg_sum return E #your code def update(self, s): #YOUR CODE GOES HERE!!!