Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch.
2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for correctness. On Sat, Sep 13, 2014 at 12:17 PM, qihong <qc...@pivotal.io> wrote: > I'm not sure what you mean by "previous run". Is it previous batch? or > previous run of spark-submit? > > If it's "previous batch" (spark streaming creates a batch every batch > interval), then there's nothing to do. > > If it's previous run of spark-submit (assuming you are able to save the > result somewhere), then I can think of two possible ways to do it: > > 1. read saved result as RDD (just do this once), and join the RDD with each > RDD of the stateStream. > > 2. add extra logic to updateFunction: when the previous state is None (one > of two Option type values), you get save state for the given key from saved > result somehow, then your original logic to create new state object based > on > Seq[V] and previous state. note that you need use this version of > updateFunction: "updateFunc: (Iterator[(K, Seq[V], Option[S])]) => > Iterator[(K, S)]", which make key available to the update function. > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14176.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >