Filter into true rdd and false rdd. Union true rdd and sample of false rdd.
On Aug 28, 2015 2:57 AM, Gavin Yue yue.yuany...@gmail.com wrote:
Hey,
I have a RDD[(String,Boolean)]. I want to keep all Boolean: True rows and
randomly keep some Boolean:false rows. And hope in the final result,
In my opinion aggragate+flatMap would work faster as it would make less
passes through the data. Would work like this:
import random
def agg(x,y):
x[0] += 1 if not y[1] else 0
x[1] += 1 if y[1] else 0
return x
# Source data
rdd = sc.parallelize(xrange(10), 5)
rdd2 =