[
https://issues.apache.org/jira/browse/YARN-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian He updated YARN-2074:
--------------------------
Attachment: YARN-2074.1.patch
Patch to not account AM preemption as AM failure.
Patch checks the diagnostics of the attempt to determine whether this attempt
is preempted or not.
There's a race condition related to RM restart which is not addressed in this
patch. If the attempt is preempted and RM restarts before the attempt state is
saved in the state store. The new RM won't be able to figure out whether the
previous attempt is preempted or not.
Fixing this may require the NM-RM protocol change to indicate NM whether the AM
preempted or killed so that when RM recovers NM can notify RM back whether the
previous AM container is preempted or not. In addition, RMContainer transition
may also need to be changed accordingly. we may fix it in separate jira.
> Preemption of AM containers shouldn't count towards AM failures
> ---------------------------------------------------------------
>
> Key: YARN-2074
> URL: https://issues.apache.org/jira/browse/YARN-2074
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Jian He
> Attachments: YARN-2074.1.patch
>
>
> One orthogonal concern with issues like YARN-2055 and YARN-2022 is that AM
> containers getting preempted shouldn't count towards AM failures and thus
> shouldn't eventually fail applications.
> We should explicitly handle AM container preemption/kill as a separate issue
> and not count it towards the limit on AM failures.
--
This message was sent by Atlassian JIRA
(v6.2#6252)