Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Sonal Goyal
Filter into true rdd and false rdd. Union true rdd and sample of false rdd. On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote: Hey, I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and randomly keep some Boolean:false rows. And hope in the final result,

Re: Any quick method to sample rdd based on one filed?

2015-08-28 Thread Alexey Grishchenko
In my opinion aggragate+flatMap would work faster as it would make less passes through the data. Would work like this: import random def agg(x,y): x[0] += 1 if not y[1] else 0 x[1] += 1 if y[1] else 0 return x # Source data rdd = sc.parallelize(xrange(10), 5) rdd2 =