Wang, Xinglong created YARN-9980:
------------------------------------
Summary: App hangs in accepted when moved from DEFAULT_PARTITION
queue to an exclusive partition queue
Key: YARN-9980
URL: https://issues.apache.org/jira/browse/YARN-9980
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Wang, Xinglong
Assignee: Wang, Xinglong
Attachments: Screen Shot 2019-11-14 at 5.11.39 PM.png
App hangs in accpeted when moved from DEFAULT_PARTITION queue to an exclusive
partition queue.
queue_root
queue_a ----- default_partition
queue_b ----- exclusive partition x, default partition is x
When an app is submitted to queue_a, with AM_LABEL_EXPRESSION unset, RM will
give default_partition as AM_LABEL_EXPRESSION to this app, then it gets an am1
and runs. And if later, the app is moved to queue_b, and the am1 is
preempted/killed/failed, it will schedule another am2 if am retry number
allows. But this time the resource request for this am2 is with
AM_LABEL_EXPRESSION = default_partition, the issue is queue_b don't have any
resource with default_partition, then this app will be in accepted state
forever in RM UI.
My understanding is that, since the app was submitted with no
AM_LABEL_EXPRESSION, And in the code base, we allow in our code for such kind
of app to run with current queue's default partition.
Here for the move queue scenario, we should also let the app to run
successfully. That means am2 should get queue_b's default partition x resource
to run instead of pending forever.
In our production, we have a landing queue with default_partition, we have some
kind of route mechanism to route apps in this queue to other queues including
queues with exclusive partition.
!Screen Shot 2019-11-14 at 5.11.39 PM.png!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]