Thanks for the pointers. I meant previous run of spark-submit.

For 1: This would be a bit more computation in every batch.

2: Its a good idea, but it may be inefficient to retrieve each value.

In general, for a generic state machine the initialization and input
sequence is critical for correctness.




On Sat, Sep 13, 2014 at 12:17 PM, qihong <qc...@pivotal.io> wrote:

> I'm not sure what you mean by "previous run". Is it previous batch? or
> previous run of spark-submit?
>
> If it's "previous batch" (spark streaming creates a batch every batch
> interval), then there's nothing to do.
>
> If it's previous run of spark-submit (assuming you are able to save the
> result somewhere), then I can think of two possible ways to do it:
>
> 1. read saved result as RDD (just do this once), and join the RDD with each
> RDD of the stateStream.
>
> 2. add extra logic to updateFunction: when the previous state is None (one
> of two Option type values), you get save state for the given key from saved
> result somehow, then your original logic to create new state object based
> on
> Seq[V] and previous state. note that you need use this version of
> updateFunction: "updateFunc: (Iterator[(K, Seq[V], Option[S])]) =>
> Iterator[(K, S)]", which make key available to the update function.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14176.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to