Re: Batch of updates

Sean Owen Tue, 28 Oct 2014 08:55:58 -0700

You should use foreachPartition, and take care to open and close your
connection following the pattern described in:


http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E

Within a partition, you iterate over elements and call whatever code you
want to output results whenever you want. You can output every 100 elements
and at the end. It is just your function to do as you like.

On Mon, Oct 27, 2014 at 11:45 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:
> Hi to all,
> I'm trying to convert my old mapreduce job to a spark one but I have some
> doubts..
> My application basically buffers a batch of updates and every 100 elements
> it flushes the batch to a server. This is very easy in mapreduce but I
don't
> know how you can do that in scala..
>
> For example, if I do:
>
> myRdd.map(x => {
> val batch = new ArrayBuffer[someClass]()
> batch += someClassFactory.fromString(x._2, x._3)
> if (batch.size % 100 == 0) {
> println("Flush legacy entities");
> batch.clear
> }
>
> I still have two problems:
>
> - how can I commit the remainder elements (in mapreduce those elements
> still in the batch array within cleanup() method)?
> - if I have to create a connection to a server for pushing updates, is it
> better to use mapPartitions instead of map?
>
> Best,
> Flavio
>
>

Re: Batch of updates

Reply via email to