Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176
 -containerId container_1445957755572_0176_01_000003

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, <ross.cramb...@thomsonreuters.com> wrote:

> I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
> transforms on a JSON data set that I load into a data frame. The data set
> is not large (~100GB) and most stages execute without any issues. However,
> some more complex stages tend to lose executors/nodes regularly. What would
> cause this to happen? The logs don’t give too much information -
>
> 15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on
> ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container
> container_1445957755572_0176_01_000003)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID
> 8331, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID
> 8322, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID
> 8268, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID
> 8330, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID
> 8312, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID
> 8351, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID
> 8342, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID
> 8309, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID
> 8338, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> 15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID
> 8323, ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
> [Stage 33:===============================>                     (117 + 50)
> / 200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with
> remote system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275]
> has failed, address is now gated for [5000] ms. Reason: [Disassociated]
>
>  - Followed by a list of lost tasks on each executor.

Reply via email to