bq. the train and test have overlap in the numbers being outputted

Can the call to repartition explain the above ?

Which release of Spark are you using ?

Thanks

On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar <gauravkuma...@gmail.com>
wrote:

> Hi,
>
> I noticed an inconsistent behavior when using rdd.randomSplit when the
> source rdd is repartitioned, but only in YARN mode. It works fine in local
> mode though.
>
> *Code:*
> val rdd = sc.parallelize(1 to 1000000)
> val rdd2 = rdd.repartition(64)
> rdd.partitions.size
> rdd2.partitions.size
> val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1)
> train.takeOrdered(10)
> test.takeOrdered(10)
>
> *Master: local*
> Both the take statements produce consistent results and have no overlap in
> numbers being outputted.
>
>
> *Master: YARN*However, when these are run on YARN mode, these produce
> random results every time and also the train and test have overlap in the
> numbers being outputted.
> If I use *rdd*.randomSplit, then it works fine even on YARN.
>
> So, it concludes that the repartition is being evaluated every time the
> splitting occurs.
>
> Interestingly, if I cache the rdd2 before splitting it, then we can expect
> consistent behavior since repartition is not evaluated again and again.
>
> Best Regards,
> Gaurav Kumar
> Big Data • Data Science • Photography • Music
> +91 9953294125
>

Reply via email to