Jian He updated YARN-2074:

    Attachment: YARN-2074.1.patch

Patch to not account AM preemption as AM failure.
Patch checks the diagnostics of the attempt to determine whether this attempt 
is preempted or not.

There's a race condition related to RM restart which is not addressed in this 
patch. If the attempt is preempted and RM restarts before the attempt state is 
saved in the state store. The new RM won't be able to figure out whether the 
previous attempt is preempted or not.
Fixing this may require the NM-RM protocol change to indicate NM whether the AM 
preempted or killed so that when RM recovers NM can notify RM back whether the 
previous AM container is preempted or not. In addition, RMContainer transition 
may also need to be changed accordingly. we may fix it in separate jira.

> Preemption of AM containers shouldn't count towards AM failures
> ---------------------------------------------------------------
>                 Key: YARN-2074
>                 URL: https://issues.apache.org/jira/browse/YARN-2074
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Jian He
>         Attachments: YARN-2074.1.patch
> One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM 
> containers getting preempted shouldn't count towards AM failures and thus 
> shouldn't eventually fail applications.
> We should explicitly handle AM container preemption/kill as a separate issue 
> and not count it towards the limit on AM failures.

This message was sent by Atlassian JIRA

Reply via email to