Hi Flavio,

Doing batch += ... shouldn't work. It will create new batch for each
element in the myRDD (also val initializes an immutable variable, var is
for mutable variables). You can use something like accumulators
<http://spark.apache.org/docs/latest/programming-guide.html#accumulators>.

val accum = sc.accumulator(0, "Some accumulator")
val NUM = 100
val SIZE = myRDD.count()
val LAST = SIZE % NUM
val MULTIPLE = (SIZE / NUM) * NUM // should return 300 when SIZE is 350
myRDD.map(x => {
accum++
if (accum++ < MULTIPLE)
null
else x
}).

So if there are 350 elements, it should return null for first 300 elements
and actual values for last 50 elements. Then we can apply a filter for null
and get remainder elements (and then finally flush the new RDD containing
the remainder elements).

On map vs mapPartitions, mapPartitions is used mainly for efficiency (as
you can see here
<http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf>,
here
<http://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions>
and here
<http://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/>).
So for simpler code, you can go with map, and for efficiency, you can go
with mapPartitions.

Regards,
Kamal

Reply via email to