Hi all, 
Recently in our project, we need to update a RDD using data regularly
received from DStream, I plan to use "foreachRDD" API to achieve this:
var MyRDD = ...
dstream.foreachRDD { rdd => 
  MyRDD = MyRDD.join(rdd).......
  ...
}

Is this usage correct? My concern is, as I am repeatedly and endlessly
reassigning MyRDD in order to update it, will it create a too long RDD
lineage to process when I want to query MyRDD later on (similar as
https://issues.apache.org/jira/browse/SPARK-4672) ? 

Maybe I should:
1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a
dstream comes in.
2. use the unpublished IndexedRDD
(https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD
update.

As I lack experience using Spark Streaming and indexedRDD, I am here to make
sure my thoughts are on the right track. Your wise suggestions will be
greatly appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp22812.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to