[ 
https://issues.apache.org/jira/browse/YARN-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329001#comment-14329001
 ] 

Jason Lowe commented on YARN-3131:
----------------------------------

bq. I do not think that continuously polling until RUNNING is a good idea. The 
most common case on a busy cluster is that an app can be submitted at time X 
but not start running until a long time later.

The patch does not cause the client to poll until the job is RUNNING.  It polls 
until the job has progressed past the SUBMITTED state.  The SUBMITTED state is 
a brief transient state before the ACCEPTED state.  So the client will wait 
approximately as long as it does today, and it fixes that flaky submit unit 
test in Tez.  It will not block until the AM is actually running.

bq. As I mentioned earlier, I still believe that doing some basic checks 
in-line in ClientRMService itself and throwing an exception back straight away 
is probably a better idea than polling for any RUNNING/FAILED state. 

I agree that a blocking method is much easier on the client, but I don't think 
this is an easy change to make in the short term.  Again I think it requires a 
major change to the RPC layer and the RM to support server-side asynchronous 
call handling, otherwise we have to throw an army of threads at the client 
service to avoid blocking other clients and that has scaling issues.  We could 
probably add an API to the scheduler to do an in-line sanity check on the 
requested queue (which is a backwards-incompatible change for schedulers not in 
the Hadoop repo).  However there are many other things that could go wrong 
during submission that take a long time to perform, such as saving the 
application state and renewing delegation tokens.  I'm not sure it's a win if 
we check for one thing in-line that could go wrong but still have to poll for 
all the other things that could go wrong.  In the end, Tez and other YARN 
clients need to know if the app was accepted or not.  The queue being wrong is 
just one of the ways the submit could fail.

Continuing to poll in the SUBMITTED state also meshes with the thoughts on the 
SUBMITTED state being something the client probably shouldn't see anyway.  See 
the discussion about NEW_SAVING and SUBMITTED in YARN-3230.

Thanks, Chang, for updating the patch.  Please investigate the unit test 
failure, as it looks like it could be related.  My only nit on the patch is it 
would be a bit clearer and more efficient if we used EnumSet constants to 
capture the set of states we're waiting the app to leave and the set of states 
that are failed-to-submit states.

I suppose another way to solve this problem is to take the approach discussed 
in YARN-3230 and have the RM not expose the NEW_SAVING and SUBMITTED states to 
the client -- they would just see NEW.  We'd have to leave the states in the 
enumeration for backwards compatibility, but we'd stop exposing them in app 
reports.  Any thoughts on that [~zjshen] or [~jianhe]?

> YarnClientImpl should check FAILED and KILLED state in submitApplication
> ------------------------------------------------------------------------
>
>                 Key: YARN-3131
>                 URL: https://issues.apache.org/jira/browse/YARN-3131
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Chang Li
>            Assignee: Chang Li
>         Attachments: yarn_3131_v1.patch
>
>
> Just run into a issue when submit a job into a non-existent queue and 
> YarnClient raise no exception. Though that job indeed get submitted 
> successfully and just failed immediately after, it will be better if 
> YarnClient can handle the immediate fail situation like YarnRunner does



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to