[
https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128664#comment-15128664
]
Junping Du commented on YARN-4635:
----------------------------------
Thanks for comments, Sunil.
bq. Is this the case where node1 to node6 is blacklisted by app and node7 to
node10 is blacklist by global manager.
Yes. This is correct.
bq. Could we also check disableThreshold on the total Set which we created now.
And if we crosses the limit, clear app based / global based blacklists from
this list. Could this solve the above mentioned scenario?
The thing could be slightly complicated than this. Several things to consider:
- The threshold can be different for global/app as we already give app
flexibility in YARN-4389, we should choose one bar (upper or lower or always
app bar).
- When together over threshold bar we chose above, we should flip both lists or
only one of them. Also, the flip mechanism worth to discuss further, as I think
other mechanism like: LRU could be better.
- if one list get flipped, how shall we merge with the other unflipped one. The
removal items could overlap items in additions although they belongs to
different affected scope, etc.
I would suggest to have a further discussion in a separated JIRA.
> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
> Key: YARN-4635
> URL: https://issues.apache.org/jira/browse/YARN-4635
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Priority: Critical
> Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM
> container failures in global
> affection. That means we need to differentiate the non-succeed
> ContainerExitStatus reasoning from
> NM or more related to App.
> For more details, please refer the document in YARN-4576.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)