You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Here the state is essentially a running count of the words seen. So the value type (i.e, V) is Int (count of a word in each batch) and the state type (i.e. S) is also a Int (running count). The updateFunction essentially sums up the running count with the new count and to generate a new running count. TD On Tue, Apr 29, 2014 at 1:49 PM, Sean Owen <so...@cloudera.com> wrote: > The original DStream is of (K,V). This function creates a DStream of > (K,S). Each time slice brings one or more new V for each K. The old > state S (can be different from V!) for each K -- possibly non-existent > -- is updated in some way by a bunch of new V, to produce a new state > S -- which also might not exist anymore after update. That's why the > function is from a Seq[V], and an Option[S], to an Option[S]. > > If you RDD has value type V = Double then your function needs to > update state based on a new Seq[Double] at each time slice, since > Doubles are the new thing arriving for each key at each time slice. > > > On Tue, Apr 29, 2014 at 7:50 PM, Adrian Mocanu > <amoc...@verticalscope.com> wrote: > > What is Seq[V] in updateStateByKey? > > > > Does this store the collected tuples of the RDD in a collection? > > > > > > > > Method signature: > > > > def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) => > > Option[S] ): DStream[(K, S)] > > > > > > > > In my case I used Seq[Double] assuming a sequence of Doubles in the RDD; > the > > moment I switched to a different type like Seq[(String, Double)] the code > > didn’t compile. > > > > > > > > -Adrian > > > > >