Sunil G commented on YARN-4254:

Hi [~bibinchundatt]
Thank you for sharing the details. As you mentioned, AM attempt is in scheduled 
state but container is not yet launched here.
But container allocation is done in the NM heartbeat for this app attempt (AM 
container), and yet to be pulled from RMAppAttempt 
AMContainerAllocatedTransition. Based on our offline discussion, this must be 
failing due to the DNS lookup/etc-hosts lookup. Thus causing the looping of 
attempt retries as you mentioned. In my opinion I am also agreeing with your 
point of view, and this is to be handled.

Currently in some cases, there are chances that DNS may be off for a while, 
hence we must retry to pull such containers again. This is done currently in 
FicaSchedulerApp. However in cases like this JIRA, it will cause permanent hang 
for application, since container is allocated by RM but cannot be pulled due to 
continuous host lookup errors.

So if we do a validation for valid host in register/heartbeat, we also must 
ensure that we remove such containers from newly allocated list. OR, we could 
handle the exception while trying to create container token and then remove 
from {{newlyAllocatedContainers}} list. Thoughts?

> ApplicationAttempt stuck for ever due to UnknowHostexception
> ------------------------------------------------------------
>                 Key: YARN-4254
>                 URL: https://issues.apache.org/jira/browse/YARN-4254
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>         Attachments: 0001-YARN-4254.patch, Logs.txt, Test.patch
> Scenario
> =======
> 1. RM HA and 5 NMs available in cluster and are working fine 
> 2. Add one more NM to the same cluster but RM /etc/hosts not updated.
> 3. Submit application to the same cluster
> If Am get allocated to the newly added NM the *application attempt will get 
> stuck for ever*.User will not get to know why the same happened.
> Impact
> 1.RM logs gets overloaded with exception
> 2.Application gets stuck for ever.
> Handling suggestion YARN-261 allows for Fail application attempt .
> If we fail the same next attempt could get assigned to another NM.

This message was sent by Atlassian JIRA

Reply via email to