Re: sampling in spark

Chengi Liu Tue, 28 Oct 2014 00:55:36 -0700

Is there an equivalent way of doing the following:

a = [1,2,3,4]


reduce(lambda x, y: x+[x[-1]+y], a, [0])[1:]

??


The issue with above suggestion is that population is a hefty data
structure :-/

On Tue, Oct 28, 2014 at 12:42 AM, Davies Liu <dav...@databricks.com> wrote:

>         _cumm = [p[0]]
>         for i in range(1, len(p)):
>             _cumm.append(_cumm[-1] + p[i])
>         index = set([bisect(_cumm, random.random()) for i in range(k)])
>
>         chosed_x = X.zipWithIndex().filter(lambda (v, i): i in
> index).map(lambda (v, i): v)
>         chosed_y = [v for i, v in enumerate(y) if i in index]
>
>
> On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com>
> wrote:
> > Oops, the reference for the above code:
> >
> http://stackoverflow.com/questions/26583462/selecting-corresponding-k-rows-from-matrix-and-vector/26583945#26583945
> >
> > On Tue, Oct 28, 2014 at 12:26 AM, Chengi Liu <chengi.liu...@gmail.com>
> > wrote:
> >>
> >> Hi,
> >>   I have three rdds.. X,y and p
> >> X is matrix rdd (mXn), y is (mX1) dimension vector
> >> and p is (mX1) dimension probability vector.
> >> Now, I am trying to sample k rows from X and corresponding entries in y
> >> based on probability vector p.
> >> Here is the python implementation
> >>
> >> import random
> >> from bisect import bisect
> >> from operator import itemgetter
> >>
> >> def sample(population, k, prob):
> >>
> >>     def cdf(population, k, prob):
> >>         population = map(itemgetter(1), sorted(zip(prob, population)))
> >>         cumm = [prob[0]]
> >>         for i in range(1, len(prob)):
> >>
> >>             cumm.append(_cumm[-1] + prob[i])
> >>         return [population[bisect(cumm, random.random())] for i in
> >> range(k)]
> >>
> >>
> >>      return cdf(population, k, prob)
> >
> >
>

Re: sampling in spark

Reply via email to