Hi, I have three rdds.. X,y and p X is matrix rdd (mXn), y is (mX1) dimension vector and p is (mX1) dimension probability vector. Now, I am trying to sample k rows from X and corresponding entries in y based on probability vector p. Here is the python implementation
import randomfrom bisect import bisectfrom operator import itemgetter def sample(population, k, prob): def cdf(population, k, prob): population = map(itemgetter(1), sorted(zip(prob, population))) cumm = [prob[0]] for i in range(1, len(prob)): cumm.append(_cumm[-1] + prob[i]) return [population[bisect(cumm, random.random())] for i in range(k)] return cdf(population, k, prob)