Jason Lowe commented on YARN-3131:

bq. Referring to my earlier comment, does it make more sense to do the simple 
checks inline instead of doing them as part of the app state machine?

I believe the main issue is that when an app is submitted it must first persist 
the app to the state store which could take some time.  There was some concern 
originally that this would be too expensive to process inline.  This design 
came from YARN-549, maybe [~zjshen] has some additional input as to whether it 
would be reasonable to do more of the app submission processing inline.

bq. We added a unit test for this and have seen it failing randomly on a 
minicluster as catching the failure on the first getAppReport() call is not 

Ah, the state machine transitions for an app rejected by the scheduler look 
like this:

AppReport) -> FAILED

So if we look at the first app report that isn't NEW/NEW_SAVING we might still 
see SUBMITTED just before FAILED.  We'd have to also continue polling during 
the SUBMITTED state to verify it wasn't rejected.  When things are running 
normally the SUBMITTED state is usually very short-lived, which would explain 
the flaky test.

bq. The issue mainly stems from the fact that in Tez, we start an AM and then 
submit work to it directly.

So is it important to know the AM is actually running?  An AM could take an 
indeterminate amount of time to eventually launch if submitted to a crowded 
queue, and the AM could fail to run in a number of ways before that (e.g.: 
killed by user, fail to launch due to JVM or container launch context issues, 
etc.).  Seems like there would need to be other failsafes in place besides just 
the fact that the scheduler accepted the job.

> YarnClientImpl should check FAILED and KILLED state in submitApplication
> ------------------------------------------------------------------------
>                 Key: YARN-3131
>                 URL: https://issues.apache.org/jira/browse/YARN-3131
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chang Li
>            Assignee: Chang Li
> Just run into a issue when submit a job into a non-existent queue and 
> YarnClient raise no exception. Though that job indeed get submitted 
> successfully and just failed immediately after, it will be better if 
> YarnClient can handle the immediate fail situation like YarnRunner does

This message was sent by Atlassian JIRA

Reply via email to