I’m facing the same situation. It would be great if someone could provide a code snippet as example.
On Jun 28, 2014, at 12:36 PM, Nilesh Chakraborty <nil...@nileshc.com> wrote: > Hello, > > In a thread about "java.lang.StackOverflowError when calling count()" [1] I > saw Tathagata Das share an interesting approach for truncating RDD lineage - > this helps prevent StackOverflowErrors in high iteration jobs while avoiding > the disk-writing performance penalty. Here's an excerpt from TD's post: > > If you are brave enough, you can try the following. Instead of relying on > checkpointing to HDFS for truncating lineage, you can do the following. > 1. Persist Nth RDD with replication (see different StorageLevels), this > would replicated the in-memory RDD between workers within Spark. Lets call > this RDD as R. > 2. Force it materialize in the memory. > 3. Create a modified RDD R` which has the same data as RDD R but does not > have the lineage. This is done by creating a new BlockRDD using the ids of > blocks of data representing the in-memory R (can elaborate on that if you > want). > > This will avoid writing to HDFS (replication in the Spark memory), but > truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow > error. > > --------------------------------------------------------------------- > > Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too. > > Cheers, > Nilesh > > [1]: > http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3ccamwrk0kiqxhktfuaamhborov5lv+d8y+c5nycmsxtqasze4...@mail.gmail.com%3E > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-checkpointing-and-materialization-for-truncating-lineage-in-high-iteration-jobs-tp8488.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.