[
https://issues.apache.org/jira/browse/YARN-4637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132634#comment-15132634
]
Sunil G commented on YARN-4637:
-------------------------------
As discussed in YARN-4635, we intend to cover two scenarios for blacklist purge
mechanism.
- time based: After a defined interval, we can bring back nodes from global
blacklist to normal. This will also ensure that the time delta is taken from
current time to the time which this node is last added to global blacklist.
(this means that, we also need to record time information from other app's AM
container failures on same node).
Time interval will be configurable in minutes. We can discuss on a default
interval (like 4 hours). This mechanism can be turned on/off via a
configuration for better control.
- NM event based: After a node is marked to global blacklist, we can have cases
where NM reports healthy disks. In such cases, these nodes can be removed from
global blacklist.
- Other container success events based: We can have multiple cases here. AM
container is failed due to DISK_FAILED, but at same time some other app's AM
container come with a success launch on same node. Its tough to assess these
corner case. Still we can try give such nodes to one more AM to confirm.
> AM launching blacklist purge mechanism (time based)
> ---------------------------------------------------
>
> Key: YARN-4637
> URL: https://issues.apache.org/jira/browse/YARN-4637
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Sunil G
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)