_cumm = [p[0]] for i in range(1, len(p)): _cumm.append(_cumm[-1] + p[i]) index = set([bisect(_cumm, random.random()) for i in range(k)])
chosed_x = X.zipWithIndex().filter(lambda (v, i): i in index).map(lambda (v, i): v) chosed_y = [v for i, v in enumerate(y) if i in index] On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Oops, the reference for the above code: > http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 > > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> > wrote: >> >> Hi, >> I have three rdds.. X,y and p >> X is matrix rdd (mXn), y is (mX1) dimension vector >> and p is (mX1) dimension probability vector. >> Now, I am trying to sample k rows from X and corresponding entries in y >> based on probability vector p. >> Here is the python implementation >> >> import random >> from bisect import bisect >> from operator import itemgetter >> >> def sample(population, k, prob): >> >> def cdf(population, k, prob): >> population = map(itemgetter(1), sorted(zip(prob, population))) >> cumm = [prob[0]] >> for i in range(1, len(prob)): >> >> cumm.append(_cumm[-1] + prob[i]) >> return [population[bisect(cumm, random.random())] for i in >> range(k)] >> >> >> return cdf(population, k, prob) > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org