In my opinion aggragate+flatMap would work faster as it would make less passes through the data. Would work like this:
import random def agg(x,y): x[0] += 1 if not y[1] else 0 x[1] += 1 if y[1] else 0 return x # Source data rdd = sc.parallelize(xrange(100000), 5) rdd2 = rdd.map(lambda x: (x, random.choice([True, False]))).cache() # Calculate counts for True and False counts = rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1])) # If filtering is needed if counts[0]*10 > counts[1]: # Calculate sampling ratio prob0 = float(counts[1])/10.0 / float(counts[0]) # Filter falses rdd2 = rdd2.flatMap(lambda x: [x] if (x[1] or x[0] and random.random() < prob0) else []) # Count True and False again for validation - falses should be 10% of trues rdd2.aggregate([0,0], agg, lambda x,y: (x[0]+y[0],x[1]+y[1])) On Fri, Aug 28, 2015 at 6:39 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Filter into true rdd and false rdd. Union true rdd and sample of false rdd. > On Aug 28, 2015 2:57 AM, "Gavin Yue" <yue.yuany...@gmail.com> wrote: > >> Hey, >> >> >> I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and >> randomly keep some Boolean:false rows. And hope in the final result, the >> negative ones could be 10 times more than positive ones. >> >> >> What would be most efficient way to do this? >> >> Thanks, >> >> >> >> -- Best regards, Alexey Grishchenko phone: +353 (87) 262-2154 email: programme...@gmail.com web: http://0x0fff.com