Re: Merging all Spark Streaming RDDs to one RDD

unorthodox . engineers Thu, 12 Jun 2014 11:44:24 -0700

I have much the same issue.

While I haven't totally solved it yet, I have found the "window" method useful 
for batching up archive blocks - but updateStateByKey is probably what we want 
to use, perhaps multiple times. If that works.


My bigger worry now is storage. Unlike non-streaming apps, we tend to build up 
state that cannot be regenerated, and hadoop files don't seem to be the best 
solution.

Jeremy Lee   BCompSci (Hons)
The Unorthodox Engineers

> On 10 Jun 2014, at 11:00 am, Henggang Cui <cuihengg...@gmail.com> wrote:
> 
> Hi,
> 
> I'm wondering whether it's possible to continuously merge the RDDs coming 
> from a stream into a single RDD efficiently.
> 
> One thought is to use the union() method. But using union, I will get a new 
> RDD each time I do a merge. I don't know how I should name these RDDs, 
> because I remember Spark does not encourage users to create an array of RDDs.
> 
> Another possible solution is to follow the example of 
> "StatefulNetworkWordCount", which uses the updateStateByKey() method. But my 
> RDD type is not key value pairs (it's a struct with multiple fields). Is 
> there a workaround?
> 
> Thanks,
> Cui

Re: Merging all Spark Streaming RDDs to one RDD

Reply via email to