[ 
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15129911#comment-15129911
 ] 

Jian He commented on YARN-4635:
-------------------------------

I have some questions on existing code.
why should  below container exit status back list the node ?
- KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM. I feel these are specific to 
the container only ?
- ABORTED, it’s used when AM releases the container or app finishes, or 
container expired  etc. 
- INVALID, this is the default exit code.

For DISKS_FAILED which is considered as global blacklist node in this jira, I 
think in this case, the node will report as unhealthy and RM should remove the 
node already.

In YARN-4389,  AMBlackListingRequest contains a boolean flag and a threshold 
number. Do you think it’s ok to just use the threshold number only ? 0 means 
disabled, and numbers larger than 0 means enabled?  cc [~sunilg]


> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM 
> container failures in global 
> affection. That means we need to differentiate the non­-succeed 
> ContainerExitStatus reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to