Hi to all,
I'm trying to convert my old mapreduce job to a spark one but I have some
doubts..
My application basically buffers a batch of updates and every 100 elements
it flushes the batch to a server. This is very easy in mapreduce but I
don't know how you can do that in scala..
For example, if I do:
myRdd.map(x => {
val batch = new ArrayBuffer[someClass]()
batch += someClassFactory.fromString(x._2, x._3)
if (batch.size % 100 == 0) {
println("Flush legacy entities");
batch.clear
}
I still have two problems:
- how can I commit the remainder elements (in mapreduce those elements
still in the batch array within cleanup() method)?
- if I have to create a connection to a server for pushing updates, is it
better to use mapPartitions instead of map?
Best,
Flavio