Thanks Silvio.
On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Regarding 1 and 2, yes shuffle output is stored on the worker local > disks and will be reused across jobs as long as they’re available. You can > identify when they’re used by seeing skipped stages in the job UI. They are > periodically cleaned up based on available space of the configured > spark.local.dirs paths. > > From: Thomas Gerber > Date: Monday, June 29, 2015 at 10:12 PM > To: user > Subject: Shuffle files lifecycle > > Hello, > > It is my understanding that shuffle are written on disk and that they > act as checkpoints. > > I wonder if this is true only within a job, or across jobs. Please note > that I use the words job and stage carefully here. > > 1. can a shuffle created during JobN be used to skip many stages from > JobN+1? Or is the lifecycle of the shuffle files bound to the job that > created them? > > 2. when are shuffle files actually deleted? Is it TTL based or is it > cleaned when the job is over? > > 3. we have a very long batch application, and as it goes on, the number > of total tasks for each job gets larger and larger. It is not really a > problem, because most of those tasks will be skipped since we cache RDDs. > We noticed however that there is a delay in the actual start of a job of 1 > min for every 2M tasks in your job. Are there suggested workarounds to > avoid that delay? Maybe saving the RDD and re-loading it? > > Thanks > Thomas > >