Wangda Tan commented on YARN-3946:

[~Naganarasimha], thanks for update, some comments:

1) RMAppImpl:
When app goes to final state (FINISHED/KILLEd, etc.), should we simply set 
AMLaunchDiagnostics to null?

2) SchedulerApplicationAttempt:
Why need two separate methods: updateDiagnosticsIfNotRunning/updateDiagnostics? 
They're a little confusing to me, I think AM launch diagnostics should be 
updated only if AM container is not running. If you think it's make sense to 
you, I suggest to rename/merge them to updateAMContainerDiagnostics.

3) Do you think is it better to rename AMState.PENDING to inactivated? I think 
"PENDING" could mean "activated-but-not-activated" to end users (assume users 
don't have enough background knownledge about scheduler).

4) Instead of setting AMLaunchDiagnostics to null when RMAppAttempt enters 
Scheduled state, do you think is it better to do that in RUNNING and 
FINAL_SAVING state? Unmanaged AM could skip the SCHEDULED state.

5) It will be also very usaful if you can update AM launch diagnostics when 
RMAppAttempt go to LAUNCHED state, sometimes AM container allocated and sent to 
NM, but not sucessfully launched/registered to RM. Currently we don't know if 
this happens because YarnApplicationState doesn't have a "launched" state.

[~jianhe], could you take a look at this patch as well?

> Allow fetching exact reason as to why a submitted app is in ACCEPTED state in 
> CS
> --------------------------------------------------------------------------------
>                 Key: YARN-3946
>                 URL: https://issues.apache.org/jira/browse/YARN-3946
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacity scheduler, resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sumit Nigam
>            Assignee: Naganarasimha G R
>         Attachments: 3946WebImages.zip, YARN-3946.v1.001.patch, 
> YARN-3946.v1.002.patch, YARN-3946.v1.003.Images.zip, YARN-3946.v1.003.patch, 
> YARN-3946.v1.004.patch
> Currently there is no direct way to get the exact reason as to why a 
> submitted app is still in ACCEPTED state. It should be possible to know 
> through RM REST API as to what aspect is not being met - say, queue limits 
> being reached, or core/ memory requirement not being met, or AM limit being 
> reached, etc.

This message was sent by Atlassian JIRA

Reply via email to