I think you could use checkpoint to cut the lineage of `MyRDD`, I have a 
similar scenario and I use checkpoint to workaround this problem :)

Thanks
Jerry

-----Original Message-----
From: yaochunnan [mailto:yaochun...@gmail.com] 
Sent: Friday, May 8, 2015 1:57 PM
To: user@spark.apache.org
Subject: Possible long lineage issue when using DStream to update a normal RDD

Hi all,
Recently in our project, we need to update a RDD using data regularly received 
from DStream, I plan to use "foreachRDD" API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd =>
  MyRDD = MyRDD.join(rdd).......
  ...
}

Is this usage correct? My concern is, as I am repeatedly and endlessly 
reassigning MyRDD in order to update it, will it create a too long RDD lineage 
to process when I want to query MyRDD later on (similar as
https://issues.apache.org/jira/browse/SPARK-4672) ? 

Maybe I should:
1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a 
dstream comes in.
2. use the unpublished IndexedRDD
(https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD update.

As I lack experience using Spark Streaming and indexedRDD, I am here to make 
sure my thoughts are on the right track. Your wise suggestions will be greatly 
appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp22812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to