Sampled RDDs contain the number of expected elements, but RDD.subtract(sample)
does nothing; the returned RDDs have the same # of elements as the originalRDD.
Here is the code...
long count0 = resultRDD.count();
JavaPairRDD<KeyBytesWritable, AggregationWritable> r1 =
resultRDD.sample(false, 0.50, 7);
JavaPairRDD<KeyBytesWritable, AggregationWritable> r2 =
resultRDD.subtract(r1);
long count1 = r1.count();
long count2 = r2.count(); /* count0 == count2 ?! */
System.out.printf("RESULT-RDD.COUNT=%d ... COUNT1=%d ... COUNT2=%d \n",
count0, count1, count2);
The printf output shows nothing was subtracted from the original rdd. Am I
using the subtract method incorrectly?
The goal is to split a very large Pair-RDD containing 1 partition into a
Pair-RDD containing many smaller partitions in hopes that this will fix OOM
errors during a reduceByKey task.
After much tweaking of jvm GC parameters there is improvement, but eventually,
the job just stalls out - no OOM, but no progress (jstack and jmap shows
workers' new-gen is maxed out during RDD de-serialization).
Does anyone have any suggestions?
Thanks,
Stan