Thanks for the info Burak!

I filed a bug on myself at https://issues.apache.org/jira/browse/SPARK-3631
to turn this information into a new section on the programming guide.
Thanks for the explanation it's very helpful.

Andrew

On Wed, Sep 17, 2014 at 12:08 PM, Burak Yavuz <bya...@stanford.edu> wrote:

> Yes, writing to HDFS is more expensive, but I feel it is still a small
> price to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
>
> The main goal of checkpointing is to truncate the lineage. Clearing up
> shuffle writes come as a bonus to checkpointing, it is not the main goal.
> The
> subtlety here is that .checkpoint() is just like .cache(). Until you call
> an action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle
> happens, you will still get a StackOverflowError, because the lineage is
> too long.
>
> I went through some of the code for checkpointing. As far as I can tell,
> it materializes the data in HDFS, and resets all its dependencies, so you
> start
> a fresh lineage. My understanding would be that checkpointing still should
> be done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
>
> I believe it would be nice to write up checkpointing in the programming
> guide. The reason that it's not there yet I believe is that most
> applications don't
> grow such a long lineage, except in Spark Streaming, and some MLlib
> algorithms. If you can help with the guide, I think it would be a nice
> feature to have!
>
> Burak
>
>
> ----- Original Message -----
> From: "Andrew Ash" <and...@andrewash.com>
> To: "Burak Yavuz" <bya...@stanford.edu>
> Cc: "Макар Красноперов" <connector....@gmail.com>, "user" <
> user@spark.apache.org>
> Sent: Wednesday, September 17, 2014 11:04:02 AM
> Subject: Re: Spark and disk usage.
>
> Thanks for the info!
>
> Are there performance impacts with writing to HDFS instead of local disk?
>  I'm assuming that's why ALS checkpoints every third iteration instead of
> every iteration.
>
> Also I can imagine that checkpointing should be done every N shuffles
> instead of every N operations (counting maps), since only the shuffle
> leaves data on disk.  Do you have any suggestions on this?
>
> We should write up some guidance on the use of checkpointing in the
> programming
> guide <https://spark.apache.org/docs/latest/programming-guide.html> - I
> can
> help with this
>
> Andrew
>
>

Reply via email to