[
https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201895#comment-15201895
]
Vinod Kumar Vavilapalli commented on YARN-4837:
-----------------------------------------------
[~sunilg] and [~sjlee0],
Appreciate your feedback.
- Yes, AMs going to 'bad' nodes again and again and failing is a real problem.
There are multiple reasons as to why this happens.
-- It is true we cannot enumerate *all* the reasons.
-- It is also true that we have some reasons that we *can* already deal
with explicitly.
- The primary reason for this JIRA is that I actually don't believe that users
need explicit control *today* on how the AM scheduling on faults (i.e
[~sunilg]'s agreement above - "agreeing to your point and its early for user to
take blacklisting decisions w/o having much needed/useful information")
- Like I also mentioned, it is misnamed too. So, let me just call it
_AM-container-scheduling_ for the time being.
h4. Modified proposal
So how about we
- Completely keep _AM-container-scheduling_ inside the ResourceManager and
don't expose any user-APIs to skip-nodes
- Explicitly treat known exit-codes:
|DISKS_FAILED| node is already unhealthy, no need for any skipping nodes|
|PREEMPTED, KILLED_BY_RESOURCEMANAGER, KILLED_AFTER_APP_COMPLETION| Not the app
or the system's fault, it's by design, no need for skipping nodes|
|KILLED_EXCEEDED_VMEM, KILLED_EXCEEDED_PMEM| No point in skipping the node as
it's not the system's fault|
|KILLED_BY_APPMASTER|Cannot happen for AM container|
|All other non-zero codes|Need some action|
- And book-keep all other failure cases and do soft-skipping *only* on the
server-side. By this I refer to something similar to node->rack locality
progression - avoid this node for a few scheduling opportunities and then come
back to it after waiting out enough time. This way no node gets locked out, nor
does any app get stuck.
If we just do this, we will take care of our most important problem - apps
getting affected due to AMs going repeatedly to the same places. And we also
(a) won't force our users to already make these decisions without really
understanding how and (b) won't introduce the bad problems of 'blacklisting'
that exists today - for e.g YARN-4685.
h4. 2.8.0
Even if we don't yet reach the consensus on the above or a similar proposal, I
feel strongly that we should remove these user-facing configs / APIs from 2.8.0.
Thoughts?
/cc [~vvasudev], [~jianhe], [~wangda] who may not be looking at this.
> User facing aspects of 'AM blacklisting' feature need fixing
> ------------------------------------------------------------
>
> Key: YARN-4837
> URL: https://issues.apache.org/jira/browse/YARN-4837
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed
> before we release it in 2.8.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)