Sumit Nigam commented on YARN-3946:

Hi [~varun_saxena] - 
Yes, the idea is not to only debug the issue (which you rightly mentioned, 
Admin can). I am currently on 2.6.0 and will try 2.7.0 when I can, for sure.

There are too many reasons to be able to correlate as to what may have happened 
- AM level, resource level, queue level, possibly a combination of these, etc. 
A programmatic API is also useful to apply corrective measures - say, I can 
program to submit my app to a whole new queue altogether, etc. after I notice 
it is queue level capacity issue or try reserving container, etc - all 

Another important use case is that of attempting to submit the app (say, 
through own AM) and after a period of remaining in ACCEPTED state, reporting 
back automatically as to why the state remains so. A REST API is extremely 
useful in such a case. With this, it would be possible to to even ascertain 
when a job moves to ACCEPTED state from RUNNING state itself (RM restart, AM 
crash + restart). Again, this currently requires looking through logs / UI to 
ascertain what happened. In esp big clusters, this is indeed non-trivial.

I'd agree with Nagannarasimha that we should be able to know that without 
administrative understanding of the same. Plus, I am not working on this.

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state.
> ---------------------------------------------------------------------------
>                 Key: YARN-3946
>                 URL: https://issues.apache.org/jira/browse/YARN-3946
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sumit Nigam
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.

This message was sent by Atlassian JIRA

Reply via email to