Hi Marius,

Are you using the sort or hash shuffle?

Also, do you have the external shuffle service enabled (so that the Worker
JVM or NodeManager can still serve the map spill files after an Executor
crashes)?

How many partitions are in your RDDs before and after the problematic
shuffle operation?



On Monday, February 23, 2015, Marius Soutier <mps....@gmail.com> wrote:

> Hi guys,
>
> I keep running into a strange problem where my jobs start to fail with the
> dreaded "Resubmitted (resubmitted due to lost executor)” because of having
> too many temp files from previous runs.
>
> Both /var/run and /spill have enough disk space left, but after a given
> amount of jobs have run, following jobs will struggle with completion.
> There are a lot of failures without any exception message, only the above
> mentioned lost executor. As soon as I clear out /var/run/spark/work/ and
> the spill disk, everything goes back to normal.
>
> Thanks for any hint,
> - Marius
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

Reply via email to