Re: Shuffle files lifecycle

Thomas Gerber Mon, 29 Jun 2015 19:17:07 -0700

Ah, for #3, maybe this is what *rdd.checkpoint *does!
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD


Thomas


On Mon, Jun 29, 2015 at 7:12 PM, Thomas Gerber <thomas.ger...@radius.com>
wrote:

> Hello,
>
> It is my understanding that shuffle are written on disk and that they act
> as checkpoints.
>
> I wonder if this is true only within a job, or across jobs. Please note
> that I use the words job and stage carefully here.
>
> 1. can a shuffle created during JobN be used to skip many stages from
> JobN+1? Or is the lifecycle of the shuffle files bound to the job that
> created them?
>
> 2. when are shuffle files actually deleted? Is it TTL based or is it
> cleaned when the job is over?
>
> 3. we have a very long batch application, and as it goes on, the number of
> total tasks for each job gets larger and larger. It is not really a
> problem, because most of those tasks will be skipped since we cache RDDs.
> We noticed however that there is a delay in the actual start of a job of 1
> min for every 2M tasks in your job. Are there suggested workarounds to
> avoid that delay? Maybe saving the RDD and re-loading it?
>
> Thanks
> Thomas
>
>

Re: Shuffle files lifecycle

Reply via email to