[
https://issues.apache.org/jira/browse/YARN-3946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635202#comment-14635202
]
Sumit Nigam commented on YARN-3946:
-----------------------------------
Hi [~varun_saxena] -
Yes, the idea is not to only debug the issue (which you rightly mentioned,
Admin can). I am currently on 2.6.0 and will try 2.7.0 when I can, for sure.
There are too many reasons to be able to correlate as to what may have happened
- AM level, resource level, queue level, possibly a combination of these, etc.
A programmatic API is also useful to apply corrective measures - say, I can
program to submit my app to a whole new queue altogether, etc. after I notice
it is queue level capacity issue or try reserving container, etc - all
programatically!
Another important use case is that of attempting to submit the app (say,
through own AM) and after a period of remaining in ACCEPTED state, reporting
back automatically as to why the state remains so. A REST API is extremely
useful in such a case. With this, it would be possible to to even ascertain
when a job moves to ACCEPTED state from RUNNING state itself (RM restart, AM
crash + restart). Again, this currently requires looking through logs / UI to
ascertain what happened. In esp big clusters, this is indeed non-trivial.
I'd agree with Nagannarasimha that we should be able to know that without
administrative understanding of the same. Plus, I am not working on this.
> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---------------------------------------------------------------------------
>
> Key: YARN-3946
> URL: https://issues.apache.org/jira/browse/YARN-3946
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: Sumit Nigam
>
> Currently there is no direct way to get the exact reason as to why a
> submitted app is still in ACCEPTED state. It should be possible to know
> through RM REST API as to what aspect is not being met - say, queue limits
> being reached, or core/ memory requirement not being met, or AM limit being
> reached, etc.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)