It's worth adding that there's no guaranteed that re-evaluated work would be on the same host as before, and in the case of node failure, it is not guaranteed to be elsewhere.
this means things that depend on host-local information is going to generate different numbers even if there are no other side effects. random number generation for seeding RDD.sample() would be a case in point here. There's also the fact that if you enable speculative execution, then operations may be repeated —even in the absence of any failure. If you are doing side effect work, or don't have an output committer whose actions are guaranteed to be atomic then you want to turn that option off. > On 27 Mar 2015, at 19:46, Patrick Wendell <pwend...@gmail.com> wrote: > > If you invoke this, you will get at-least-once semantics on failure. > For instance, if a machine dies in the middle of executing the foreach > for a single partition, that will be re-executed on another machine. > It could even fully complete on one machine, but the machine dies > immediately before reporting the result back to the driver. > > This means you need to make sure the side-effects are idempotent, or > use some transactional locking. Spark's own output operations, such as > saving to Hadoop, use such mechanisms. For instance, in the case of > Hadoop it uses the OutputCommitter classes. > > - Patrick > > On Fri, Mar 27, 2015 at 12:36 PM, Michal Klos <michal.klo...@gmail.com> wrote: >> Hi Spark group, >> >> We haven't been able to find clear descriptions of how Spark handles the >> resiliency of RDDs in relationship to executing actions with side-effects. >> If you do an `rdd.foreach(someSideEffect)`, then you are doing a side-effect >> for each element in the RDD. If a partition goes down -- the resiliency >> rebuilds the data, but did it keep track of how far it go in the >> partition's set of data or will it start from the beginning again. So will >> it do at-least-once execution of foreach closures or at-most-once? >> >> thanks, >> Michal > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org