Re: Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

Baoxu Shi(Dash) Sat, 28 Jun 2014 10:22:17 -0700

I’m facing the same situation. It would be great if someone could provide a 
code snippet as example.


On Jun 28, 2014, at 12:36 PM, Nilesh Chakraborty <nil...@nileshc.com> wrote:

> Hello,
> 
> In a thread about "java.lang.StackOverflowError when calling count()" [1] I
> saw Tathagata Das share an interesting approach for truncating RDD lineage -
> this helps prevent StackOverflowErrors in high iteration jobs while avoiding
> the disk-writing performance penalty. Here's an excerpt from TD's post:
> 
> If you are brave enough, you can try the following. Instead of relying on
> checkpointing to HDFS for truncating lineage, you can do the following.
> 1. Persist Nth RDD with replication (see different StorageLevels), this
> would replicated the in-memory RDD between workers within Spark. Lets call
> this RDD as R.
> 2. Force it materialize in the memory.
> 3. Create a modified RDD R` which has the same data as RDD R but does not
> have the lineage. This is done by creating a new BlockRDD using the ids of
> blocks of data representing the in-memory R (can elaborate on that if you
> want).
> 
> This will avoid writing to HDFS (replication in the Spark memory), but
> truncate the lineage (by creating new BlockRDDs), and avoid stackoverflow
> error.
> 
> ---------------------------------------------------------------------
> 
> Now I'm not sure how to do no. 3. Any ideas? I'm CC'ing Tathagata too.
> 
> Cheers,
> Nilesh
> 
> [1]:
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3ccamwrk0kiqxhktfuaamhborov5lv+d8y+c5nycmsxtqasze4...@mail.gmail.com%3E
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-checkpointing-and-materialization-for-truncating-lineage-in-high-iteration-jobs-tp8488.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

Reply via email to