Ah, for #3, maybe this is what *rdd.checkpoint *does! https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Thomas On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber <thomas.ger...@radius.com> wrote: > Hello, > > It is my understanding that shuffle are written on disk and that they act > as checkpoints. > > I wonder if this is true only within a job, or across jobs. Please note > that I use the words job and stage carefully here. > > 1. can a shuffle created during JobN be used to skip many stages from > JobN+1? Or is the lifecycle of the shuffle files bound to the job that > created them? > > 2. when are shuffle files actually deleted? Is it TTL based or is it > cleaned when the job is over? > > 3. we have a very long batch application, and as it goes on, the number of > total tasks for each job gets larger and larger. It is not really a > problem, because most of those tasks will be skipped since we cache RDDs. > We noticed however that there is a delay in the actual start of a job of 1 > min for every 2M tasks in your job. Are there suggested workarounds to > avoid that delay? Maybe saving the RDD and re-loading it? > > Thanks > Thomas > >