Re: Shuffle files lifecycle

Thomas Gerber Mon, 29 Jun 2015 19:53:35 -0700

Thanks Silvio.


On Mon, Jun 29, 2015 at 7:41 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:

>   Regarding 1 and 2, yes shuffle output is stored on the worker local
> disks and will be reused across jobs as long as they’re available. You can
> identify when they’re used by seeing skipped stages in the job UI. They are
> periodically cleaned up based on available space of the configured
> spark.local.dirs paths.
>
>   From: Thomas Gerber
> Date: Monday, June 29, 2015 at 10:12 PM
> To: user
> Subject: Shuffle files lifecycle
>
>   Hello,
>
>  It is my understanding that shuffle are written on disk and that they
> act as checkpoints.
>
>  I wonder if this is true only within a job, or across jobs. Please note
> that I use the words job and stage carefully here.
>
>  1. can a shuffle created during JobN be used to skip many stages from
> JobN+1? Or is the lifecycle of the shuffle files bound to the job that
> created them?
>
>  2. when are shuffle files actually deleted? Is it TTL based or is it
> cleaned when the job is over?
>
>  3. we have a very long batch application, and as it goes on, the number
> of total tasks for each job gets larger and larger. It is not really a
> problem, because most of those tasks will be skipped since we cache RDDs.
> We noticed however that there is a delay in the actual start of a job of 1
> min for every 2M tasks in your job. Are there suggested workarounds to
> avoid that delay? Maybe saving the RDD and re-loading it?
>
>  Thanks
> Thomas
>
>

Re: Shuffle files lifecycle

Reply via email to