Anand,

AFAIK, you will need to change two settings:

spark.streaming.unpersist = false // in order for SStreaming to not drop
the raw RDD data
spark.cleaner.ttl = <some reasonable value in seconds>

Also be aware that the lineage of your union RDD will grow with each batch
interval. You will need to break lineage often with cache(), and rely on
the ttl for clean up.
You will probably be in some tricky ground with this approach.

A more reliable way would be to do dstream.window(...) for the length of
time you want to keep the data and then union that data with your RDD for
further processing using transform.
Something like:
dstream.window(Seconds(900), Seconds(900)).transform(rdd => rdd union
otherRdd)...

If you need an unbound amount of dstream batch intervals, considering
writing the data to secondary storage instead.

-kr, Gerard.



On Tue, Jul 7, 2015 at 1:34 PM, Anand Nalya <anand.na...@gmail.com> wrote:

> Hi,
>
> Suppose I have an RDD that is loaded from some file and then I also have a
> DStream that has data coming from some stream. I want to keep union some of
> the tuples from the DStream into my RDD. For this I can use something like
> this:
>
>   var myRDD: RDD[(String, Long)] = sc.fromText...
>   dstream.foreachRDD{ rdd =>
>     myRDD = myRDD.union(rdd.filter(myfilter))
>   }
>
> My questions is that for how long spark will keep RDDs underlying the
> dstream around? Is there some configuratoin knob that can control that?
>
> Regards,
> Anand
>

Reply via email to