Re: RDD resiliency -- does it keep state?

Steve Loughran Sat, 28 Mar 2015 06:22:40 -0700

It's worth adding that there's no guaranteed that re-evaluated work would be on 
the same host as before, and in the case of node failure, it is not guaranteed 
to be elsewhere.


this means things that depend on host-local information is going to generate 
different numbers even if there are no other side effects. random number 
generation for seeding RDD.sample() would be a case in point here. 

There's also the fact that if you enable speculative execution, then operations 
may be repeated —even in the absence of any failure. If you are doing side 
effect work, or don't have an output committer whose actions are guaranteed to 
be atomic then you want to turn that option off.

> On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote:
> 
> If you invoke this, you will get at-least-once semantics on failure.
> For instance, if a machine dies in the middle of executing the foreach
> for a single partition, that will be re-executed on another machine.
> It could even fully complete on one machine, but the machine dies
> immediately before reporting the result back to the driver.
> 
> This means you need to make sure the side-effects are idempotent, or
> use some transactional locking. Spark's own output operations, such as
> saving to Hadoop, use such mechanisms. For instance, in the case of
> Hadoop it uses the OutputCommitter classes.
> 
> - Patrick
> 
> On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com> wrote:
>> Hi Spark group,
>> 
>> We haven't been able to find clear descriptions of how Spark handles the
>> resiliency of RDDs in relationship to executing actions with side-effects.
>> If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect
>> for each element in the RDD. If a partition goes down -- the resiliency
>> rebuilds the data,  but did it keep track of how far it go in the
>> partition's set of data or will it start from the beginning again. So will
>> it do at-least-once execution of foreach closures or at-most-once?
>> 
>> thanks,
>> Michal
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD resiliency -- does it keep state?

Reply via email to