That's a good question. You have to 1. You to ensure that the stats is fault-tolerant, that if the executor dies or task get relaunched you dont loose priori state. This is a fundamental challenge for all streaming systems, and its pretty hard to do without having some persistence story outside the memory of the executor, so that state is not lost when the executor fails. The updateStateByKey effectively persists intermediate state RDDs to HDFS using periodic RDD checkpointing for the exact reason. 2. Assuming 1 is fulfilled somehow, you have to ensure maintain versioning (that is, batch time) information along with the in-memory state. So that when tasks of previous batches or event current batch gets executed, you can check the version (that is, batch time) to see when the state has already been updated or not. Updating the batch time and state should be atomic.
TD On Thu, Jul 9, 2015 at 12:30 PM, micvog <mich...@micvog.com> wrote: > For example in the simplest word count example, I want to update the count > in > memory and always have the same word getting updated by the same task - not > use any distributed memstore. > > I know that updateStateByKey should guarantee that, but how do you approach > this problem outside of spark streaming? > > Thanks, > Michael > > > > ----- > Michael Vogiatzis > @mvogiatzis > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-guarantee-that-the-same-task-will-process-the-same-key-over-time-tp23753.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >