[
https://issues.apache.org/jira/browse/YARN-4254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955134#comment-14955134
]
Sunil G commented on YARN-4254:
-------------------------------
Hi [~bibinchundatt]
Thank you for sharing the details. As you mentioned, AM attempt is in scheduled
state but container is not yet launched here.
But container allocation is done in the NM heartbeat for this app attempt (AM
container), and yet to be pulled from RMAppAttempt
AMContainerAllocatedTransition. Based on our offline discussion, this must be
failing due to the DNS lookup/etc-hosts lookup. Thus causing the looping of
attempt retries as you mentioned. In my opinion I am also agreeing with your
point of view, and this is to be handled.
Currently in some cases, there are chances that DNS may be off for a while,
hence we must retry to pull such containers again. This is done currently in
FicaSchedulerApp. However in cases like this JIRA, it will cause permanent hang
for application, since container is allocated by RM but cannot be pulled due to
continuous host lookup errors.
So if we do a validation for valid host in register/heartbeat, we also must
ensure that we remove such containers from newly allocated list. OR, we could
handle the exception while trying to create container token and then remove
from {{newlyAllocatedContainers}} list. Thoughts?
> ApplicationAttempt stuck for ever due to UnknowHostexception
> ------------------------------------------------------------
>
> Key: YARN-4254
> URL: https://issues.apache.org/jira/browse/YARN-4254
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
>
>
> Scenario
> =======
> 1. RM HA and 5 NMs available in cluster and are working fine
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)