[
https://issues.apache.org/jira/browse/YARN-5374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375759#comment-15375759
]
Wangda Tan commented on YARN-5374:
----------------------------------
[~LucasW], it seems to me that the issue is caused by Spark application doesn't
well handle container preemption message. If so, I suggest you can drop a mail
to Spark maillist or file a Spark JIRA instead.
> Preemption causing communication loop
> -------------------------------------
>
> Key: YARN-5374
> URL: https://issues.apache.org/jira/browse/YARN-5374
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacityscheduler, nodemanager, resourcemanager, yarn
> Affects Versions: 2.7.1
> Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
> Reporter: Lucas Winkelmann
> Priority: Blocker
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster
> available takes about 1hr so job 1 is between 25-50% complete). Note that if
> I wait less time then the issue sometimes does not occur, it appears to be
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other
> job is already complete meaning the second attempt has full cluster
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have
> waited ~6 hrs until finally killing the application ).
>
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message =
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host:
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>
> My observations have led me to believe that the application master does not
> know about this container being killed and continuously asks the container to
> remove the executor until eventually failing the attempt or continue trying
> to remove the executor.
>
> I have done much digging online for anyone else experiencing this issue but
> have come up with nothing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]