Is there an equivalent way of doing the following: a = [1,2,3,4]
reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:] ?? The issue with above suggestion is that population is a hefty data structure :-/ On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu <dav...@databricks.com> wrote: > _cumm = [p[0]] > for i in range(1, len(p)): > _cumm.append(_cumm[-1] + p[i]) > index = set([bisect(_cumm, random.random()) for i in range(k)]) > > chosed_x = X.zipWithIndex().filter(lambda (v, i): i in > index).map(lambda (v, i): v) > chosed_y = [v for i, v in enumerate(y) if i in index] > > > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> > wrote: > > Oops, the reference for the above code: > > > http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945 > > > > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com> > > wrote: > >> > >> Hi, > >> I have three rdds.. X,y and p > >> X is matrix rdd (mXn), y is (mX1) dimension vector > >> and p is (mX1) dimension probability vector. > >> Now, I am trying to sample k rows from X and corresponding entries in y > >> based on probability vector p. > >> Here is the python implementation > >> > >> import random > >> from bisect import bisect > >> from operator import itemgetter > >> > >> def sample(population, k, prob): > >> > >> def cdf(population, k, prob): > >> population = map(itemgetter(1), sorted(zip(prob, population))) > >> cumm = [prob[0]] > >> for i in range(1, len(prob)): > >> > >> cumm.append(_cumm[-1] + prob[i]) > >> return [population[bisect(cumm, random.random())] for i in > >> range(k)] > >> > >> > >> return cdf(population, k, prob) > > > > >