[
https://issues.apache.org/jira/browse/YARN-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329001#comment-14329001
]
Jason Lowe commented on YARN-3131:
----------------------------------
bq. I do not think that continuously polling until RUNNING is a good idea. The
most common case on a busy cluster is that an app can be submitted at time X
but not start running until a long time later.
The patch does not cause the client to poll until the job is RUNNING. It polls
until the job has progressed past the SUBMITTED state. The SUBMITTED state is
a brief transient state before the ACCEPTED state. So the client will wait
approximately as long as it does today, and it fixes that flaky submit unit
test in Tez. It will not block until the AM is actually running.
bq. As I mentioned earlier, I still believe that doing some basic checks
in-line in ClientRMService itself and throwing an exception back straight away
is probably a better idea than polling for any RUNNING/FAILED state.
I agree that a blocking method is much easier on the client, but I don't think
this is an easy change to make in the short term. Again I think it requires a
major change to the RPC layer and the RM to support server-side asynchronous
call handling, otherwise we have to throw an army of threads at the client
service to avoid blocking other clients and that has scaling issues. We could
probably add an API to the scheduler to do an in-line sanity check on the
requested queue (which is a backwards-incompatible change for schedulers not in
the Hadoop repo). However there are many other things that could go wrong
during submission that take a long time to perform, such as saving the
application state and renewing delegation tokens. I'm not sure it's a win if
we check for one thing in-line that could go wrong but still have to poll for
all the other things that could go wrong. In the end, Tez and other YARN
clients need to know if the app was accepted or not. The queue being wrong is
just one of the ways the submit could fail.
Continuing to poll in the SUBMITTED state also meshes with the thoughts on the
SUBMITTED state being something the client probably shouldn't see anyway. See
the discussion about NEW_SAVING and SUBMITTED in YARN-3230.
Thanks, Chang, for updating the patch. Please investigate the unit test
failure, as it looks like it could be related. My only nit on the patch is it
would be a bit clearer and more efficient if we used EnumSet constants to
capture the set of states we're waiting the app to leave and the set of states
that are failed-to-submit states.
I suppose another way to solve this problem is to take the approach discussed
in YARN-3230 and have the RM not expose the NEW_SAVING and SUBMITTED states to
the client -- they would just see NEW. We'd have to leave the states in the
enumeration for backwards compatibility, but we'd stop exposing them in app
reports. Any thoughts on that [~zjshen] or [~jianhe]?
> YarnClientImpl should check FAILED and KILLED state in submitApplication
> ------------------------------------------------------------------------
>
> Key: YARN-3131
> URL: https://issues.apache.org/jira/browse/YARN-3131
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Chang Li
> Assignee: Chang Li
> Attachments: yarn_3131_v1.patch
>
>
> Just run into a issue when submit a job into a non-existent queue and
> YarnClient raise no exception. Though that job indeed get submitted
> successfully and just failed immediately after, it will be better if
> YarnClient can handle the immediate fail situation like YarnRunner does
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)