Thanks for the info Burak! I filed a bug on myself at https://issues.apache.org/jira/browse/SPARK-3631 to turn this information into a new section on the programming guide. Thanks for the explanation it's very helpful.
Andrew On Wed, Sep 17, 2014 at 12:08 PM, Burak Yavuz <bya...@stanford.edu> wrote: > Yes, writing to HDFS is more expensive, but I feel it is still a small > price to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. > The > subtlety here is that .checkpoint() is just like .cache(). Until you call > an action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle > happens, you will still get a StackOverflowError, because the lineage is > too long. > > I went through some of the code for checkpointing. As far as I can tell, > it materializes the data in HDFS, and resets all its dependencies, so you > start > a fresh lineage. My understanding would be that checkpointing still should > be done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > > I believe it would be nice to write up checkpointing in the programming > guide. The reason that it's not there yet I believe is that most > applications don't > grow such a long lineage, except in Spark Streaming, and some MLlib > algorithms. If you can help with the guide, I think it would be a nice > feature to have! > > Burak > > > ----- Original Message ----- > From: "Andrew Ash" <and...@andrewash.com> > To: "Burak Yavuz" <bya...@stanford.edu> > Cc: "Макар Красноперов" <connector....@gmail.com>, "user" < > user@spark.apache.org> > Sent: Wednesday, September 17, 2014 11:04:02 AM > Subject: Re: Spark and disk usage. > > Thanks for the info! > > Are there performance impacts with writing to HDFS instead of local disk? > I'm assuming that's why ALS checkpoints every third iteration instead of > every iteration. > > Also I can imagine that checkpointing should be done every N shuffles > instead of every N operations (counting maps), since only the shuffle > leaves data on disk. Do you have any suggestions on this? > > We should write up some guidance on the use of checkpointing in the > programming > guide <https://spark.apache.org/docs/latest/programming-guide.html> - I > can > help with this > > Andrew > >