RE: adding new elements to batch RDD from DStream RDD

Evo Eftimov Wed, 15 Apr 2015 12:08:07 -0700

Hi Sean well there is certainly a difference between "batch" RDD and 
"streaming" RDD and in the previous reply you have already outlined some. Other 
differences are in the Object Oriented Model / API of Spark, which also matters 
besides the RDD / Spark Cluster Platform architecture.


Secondly, in the previous em I have clearly described what I mean by "update" 
and that it is a result of RDD transformation and hence a new RDD derived from 
the previously joined/union/cogrouped one - ie not "mutating" an existing RDD

Lets also leave aside the architectural goal why I want to keep updating a 
batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make 
streaming RDDs long living"  

Let me now go back to the overall objective - the app context is Spark 
Streaming job. I want to "update" / "add" the content of incoming streaming 
RDDs (e.g. JavaDStreamRDDs) to an already loaded (e.g. from HDFS file) batch 
RDD e.g. JavaRDD - the only way to union / join / cogroup from DSTreamRDD to 
batch RDD is via the "transform" method which always returns DStream RDD NOT 
batch RDD - check the API

On a separate note - your suggestion to keep reloading a Batch RDD from a file 
- it may have some applications in other scenarios so lets drill down into it - 
in the context of Spark Streaming app where the driver launches a DAG pipeline 
and then just essentially hangs, I guess the only way to keep reloading a batch 
RDD from file is from a separate thread still using the same spark context. The 
thread will reload the batch RDD with the same reference ie reassign the 
reference to the newly instantiated/loaded batch RDD - is that what you mean by 
reloading batch RDD from file   

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Wednesday, April 15, 2015 7:43 PM
To: Evo Eftimov
Cc: user@spark.apache.org
Subject: Re: adding new elements to batch RDD from DStream RDD

What do you mean by "batch RDD"? they're just RDDs, though store their data in 
different ways and come from different sources. You can union an RDD from an 
HDFS file with one from a DStream.

It sounds like you want streaming data to live longer than its batch interval, 
but that's not something you can expect the streaming framework to provide. 
It's perfectly possible to save the RDD's data to persistent store and use it 
later.

You can't update RDDs; they're immutable. You can re-read data from persistent 
store by making a new RDD at any time.

On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote:
> The only way to join / union /cogroup a DStream RDD with Batch RDD is 
> via the "transform" method, which returns another DStream RDD and 
> hence it gets discarded at the end of the micro-batch.
>
> Is there any way to e.g. union Dstream RDD with Batch RDD which 
> produces a new Batch RDD containing the elements of both the DStream 
> RDD and the Batch RDD.
>
> And once such Batch RDD is created in the above way, can it be used by 
> other DStream RDDs to e.g. join with as this time the result can be 
> another DStream RDD
>
> Effectively the functionality described above will result in 
> periodical updates (additions) of elements to a Batch RDD - the 
> additional elements will keep coming from DStream RDDs which keep 
> streaming in with every micro-batch.
> Also newly arriving DStream RDDs will be able to join with the thus 
> previously updated BAtch RDD and produce a result DStream RDD
>
> Something almost like that can be achieved with updateStateByKey, but 
> is there a way to do it as described here
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-element
> s-to-batch-RDD-from-DStream-RDD-tp22504.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: adding new elements to batch RDD from DStream RDD

Reply via email to