Hi Sean well there is certainly a difference between "batch" RDD and "streaming" RDD and in the previous reply you have already outlined some. Other differences are in the Object Oriented Model / API of Spark, which also matters besides the RDD / Spark Cluster Platform architecture.
Secondly, in the previous em I have clearly described what I mean by "update" and that it is a result of RDD transformation and hence a new RDD derived from the previously joined/union/cogrouped one - ie not "mutating" an existing RDD Lets also leave aside the architectural goal why I want to keep updating a batch RDD with new data coming from DStream RDDs - fyi it is NOT to "make streaming RDDs long living" Let me now go back to the overall objective - the app context is Spark Streaming job. I want to "update" / "add" the content of incoming streaming RDDs (e.g. JavaDStreamRDDs) to an already loaded (e.g. from HDFS file) batch RDD e.g. JavaRDD - the only way to union / join / cogroup from DSTreamRDD to batch RDD is via the "transform" method which always returns DStream RDD NOT batch RDD - check the API On a separate note - your suggestion to keep reloading a Batch RDD from a file - it may have some applications in other scenarios so lets drill down into it - in the context of Spark Streaming app where the driver launches a DAG pipeline and then just essentially hangs, I guess the only way to keep reloading a batch RDD from file is from a separate thread still using the same spark context. The thread will reload the batch RDD with the same reference ie reassign the reference to the newly instantiated/loaded batch RDD - is that what you mean by reloading batch RDD from file -----Original Message----- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 7:43 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD What do you mean by "batch RDD"? they're just RDDs, though store their data in different ways and come from different sources. You can union an RDD from an HDFS file with one from a DStream. It sounds like you want streaming data to live longer than its batch interval, but that's not something you can expect the streaming framework to provide. It's perfectly possible to save the RDD's data to persistent store and use it later. You can't update RDDs; they're immutable. You can re-read data from persistent store by making a new RDD at any time. On Wed, Apr 15, 2015 at 7:37 PM, Evo Eftimov <evo.efti...@isecc.com> wrote: > The only way to join / union /cogroup a DStream RDD with Batch RDD is > via the "transform" method, which returns another DStream RDD and > hence it gets discarded at the end of the micro-batch. > > Is there any way to e.g. union Dstream RDD with Batch RDD which > produces a new Batch RDD containing the elements of both the DStream > RDD and the Batch RDD. > > And once such Batch RDD is created in the above way, can it be used by > other DStream RDDs to e.g. join with as this time the result can be > another DStream RDD > > Effectively the functionality described above will result in > periodical updates (additions) of elements to a Batch RDD - the > additional elements will keep coming from DStream RDDs which keep > streaming in with every micro-batch. > Also newly arriving DStream RDDs will be able to join with the thus > previously updated BAtch RDD and produce a result DStream RDD > > Something almost like that can be achieved with updateStateByKey, but > is there a way to do it as described here > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/adding-new-element > s-to-batch-RDD-from-DStream-RDD-tp22504.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For > additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org