That's a good question. You have to
1. You to ensure that the stats is fault-tolerant, that if the executor
dies or task get relaunched you dont loose priori state. This is a
fundamental challenge for all streaming systems, and its pretty hard to do
without having some persistence story outside the memory of the executor,
so that state is not lost when the executor fails. The updateStateByKey
effectively persists intermediate state RDDs to HDFS using periodic RDD
checkpointing for the exact reason.
2. Assuming 1 is fulfilled somehow, you have to ensure maintain versioning
(that is, batch time) information along with the in-memory state. So that
when tasks of previous batches or event current batch gets executed, you
can check the version (that is, batch time) to see when the state has
already been updated or not. Updating the batch time and state should be
atomic.

TD

On Thu, Jul 9, 2015 at 12:30 PM, micvog <mich...@micvog.com> wrote:

> For example in the simplest word count example, I want to update the count
> in
> memory and always have the same word getting updated by the same task - not
> use any distributed memstore.
>
> I know that updateStateByKey should guarantee that, but how do you approach
> this problem outside of spark streaming?
>
> Thanks,
> Michael
>
>
>
> -----
> Michael Vogiatzis
> @mvogiatzis
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-guarantee-that-the-same-task-will-process-the-same-key-over-time-tp23753.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to