I have much the same issue. While I haven't totally solved it yet, I have found the "window" method useful for batching up archive blocks - but updateStateByKey is probably what we want to use, perhaps multiple times. If that works.
My bigger worry now is storage. Unlike non-streaming apps, we tend to build up state that cannot be regenerated, and hadoop files don't seem to be the best solution. Jeremy Lee BCompSci (Hons) The Unorthodox Engineers > On 10 Jun 2014, at 11:00 am, Henggang Cui <cuihengg...@gmail.com> wrote: > > Hi, > > I'm wondering whether it's possible to continuously merge the RDDs coming > from a stream into a single RDD efficiently. > > One thought is to use the union() method. But using union, I will get a new > RDD each time I do a merge. I don't know how I should name these RDDs, > because I remember Spark does not encourage users to create an array of RDDs. > > Another possible solution is to follow the example of > "StatefulNetworkWordCount", which uses the updateStateByKey() method. But my > RDD type is not key value pairs (it's a struct with multiple fields). Is > there a workaround? > > Thanks, > Cui