[ 
https://issues.apache.org/jira/browse/YARN-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132634#comment-15132634
 ] 

Sunil G commented on YARN-4637:
-------------------------------

As discussed in YARN-4635, we intend to cover two scenarios for blacklist purge 
mechanism.

- time based: After a defined interval, we can bring back nodes from global 
blacklist to normal. This will also ensure that the time delta is taken from 
current time to the time which this node is last added to global blacklist. 
(this means that, we also need to record time information from other app's AM 
container failures on same node).
Time interval will be configurable in minutes. We can discuss on a default 
interval (like 4 hours). This mechanism can  be turned on/off via a 
configuration for better control.
- NM event based: After a node is marked to global blacklist, we can have cases 
where NM reports healthy disks. In such cases, these nodes can be removed from 
global blacklist.
- Other container success events based: We can have multiple cases here. AM 
container is failed due to DISK_FAILED, but at same time some other app's AM 
container come with a success launch on same node. Its tough to assess these 
corner case. Still we can try give such nodes to one more AM to confirm.


> AM launching blacklist purge mechanism (time based)
> ---------------------------------------------------
>
>                 Key: YARN-4637
>                 URL: https://issues.apache.org/jira/browse/YARN-4637
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Sunil G
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to