bq. the train and test have overlap in the numbers being outputted Can the call to repartition explain the above ?
Which release of Spark are you using ? Thanks On Sun, Dec 27, 2015 at 9:56 PM, Gaurav Kumar <gauravkuma...@gmail.com> wrote: > Hi, > > I noticed an inconsistent behavior when using rdd.randomSplit when the > source rdd is repartitioned, but only in YARN mode. It works fine in local > mode though. > > *Code:* > val rdd = sc.parallelize(1 to 1000000) > val rdd2 = rdd.repartition(64) > rdd.partitions.size > rdd2.partitions.size > val Array(train, test) = *rdd2*.randomSplit(Array(70, 30), 1) > train.takeOrdered(10) > test.takeOrdered(10) > > *Master: local* > Both the take statements produce consistent results and have no overlap in > numbers being outputted. > > > *Master: YARN*However, when these are run on YARN mode, these produce > random results every time and also the train and test have overlap in the > numbers being outputted. > If I use *rdd*.randomSplit, then it works fine even on YARN. > > So, it concludes that the repartition is being evaluated every time the > splitting occurs. > > Interestingly, if I cache the rdd2 before splitting it, then we can expect > consistent behavior since repartition is not evaluated again and again. > > Best Regards, > Gaurav Kumar > Big Data • Data Science • Photography • Music > +91 9953294125 >