Hi,
  I have three rdds.. X,y and p
X is matrix rdd (mXn), y is (mX1) dimension vector
and p is (mX1) dimension probability vector.
Now, I am trying to sample k rows from X and corresponding entries in y
based on probability vector p.
Here is the python implementation

import randomfrom bisect import bisectfrom operator import itemgetter

def sample(population, k, prob):

    def cdf(population, k, prob):
        population = map(itemgetter(1), sorted(zip(prob, population)))
        cumm = [prob[0]]
        for i in range(1, len(prob)):

            cumm.append(_cumm[-1] + prob[i])
        return [population[bisect(cumm, random.random())] for i in range(k)]


     return cdf(population, k, prob)

Reply via email to