[
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013171#comment-17013171
]
Szilard Nemeth edited comment on YARN-9879 at 1/10/20 8:04 PM:
---------------------------------------------------------------
Thank you for your feedbacks [~wilfreds], [~pbacsko], [~leftnoteasy] and thans
[~shuzirra] for moving this forward!
Chiming in a bit late, but I would like to give my opinion as well. Apologies
for being so lengthy, but I wanted to quote and reply some parts of the
comments had been given so far.
Please note that some parts of what I'm saying here is taking the FS to CS
migration into account.
*1. Quoting from Wilfred's comment:*
A.
{quote}If we simply relax the rule that the leaf queue must be unique in the
system in favour of the fact that a queue must be unique based on the full
queue path. This does not break existing configurations as the unique leaf
queue is also unique when you take into account the whole path. That means
there is nothing for the current clusters that needs to change. Internally the
scheduler does have to change to make sure that all references use the queue
path. This will require a lot of changes throughout the scheduler when you look
up a queue and the way we store the reference if it is not directly to the leaf
queue.
{quote}
I do agree with this, existing configurations does not break. However, allowing
non-unique leaf queues can make the CS behave weirdly and inconsistently as
queues are being added later on, as [~shuzirra] described this lately in his
comment. More on this matter later.
B.
{quote}When an application is submitted with just a queue name (not a path) we
expect that the name is a unique leaf queue name. If that queue does not exist
or is not uniquely identifiable we reject the application submission.
Resolution of the real leaf queue follows the same steps as it does now. The
queue name in the end is converted to the correct leaf queue identified by the
a path. For existing configurations nothing has changed. Internally we hide all
the changes.
{quote}
AND
C.
{quote}The important part is applying a new configuration. If the configuration
adds a leaf queue that is not unique the configuration update currently is
rejected. With this change we would allow that config to become active. This
could break existing applications when they try to submit to the leaf queue
that is no longer unique.
We should at least log and warn clearly in the response of the update. Maybe
even show it in the UI or we could ask for a confirmation. The first update
that adds a non unique queue to the configuration should always fail
complaining loudly. It should then keep warning the user and rejecting the
update unless a confirmation flag is set to force the update through. After the
first update that would not be needed anymore.
Reading a config from a file or store which is used to initialise the
scheduler should not trigger such behaviour. We still should show a warning in
the logs to make sure it is not lost.
{quote}
About what you said: "This could break existing applications when they try to
submit to the leaf queue that is no longer unique."
Again, this is really problematic and error-prone, as detailed by [~shuzirra]
in his comment above.
We discussed this together with Gergo and I'm trying to give more weight on
how tricky this case can get and what I'm doing now is just reiterate on
Gergo's comment
Imagine the following queue setup:
{code:java}
root
|____a
|
|____b
| |____b
| |____c
| |____d
|
|____c
|____e
|____f
{code}
If someone is adding root.c.b, queue 'b' is non-unique anymore without even
knowing or taking care of root.b.b is the other queue 'b'.
Even if we log and warn in the response of the update, adding the root.c.b
queue can break the app submitted to queue root.b.b.
I can imagine many real-world scenarios where users were submitting apps to a
queue (root.b.b) with just the leaf queue name ('b') and their app submission
would get suddenly rejected since another queue do exist with the same name.
Just consider a user-group mapping and two same usernames exist under different
organizations/user groups and we can be in the situation what I depicted. I
also agree to Gergo that we can't expect or assume every user and apps
submission is guarded by monitoring or warning systems.
I think it's definitely not a good idea to break existing users and app
submissions with this change.
*2. Quoting from Wangda's comment:*
{quote}I personally think it is not a big deal if application reject reasons
from RM can clearly guide users to use full qualified queue path when
duplicated queue names exists. It is like if a team has only one Peter we can
use the first name only otherwise we will add last name to avoid confusion. It
isn't counter-intuitive to me.
{quote}
How can we know it's not a big deal? As we are not fully familiar how users are
submitting apps: They can have scripts (or even many scripts), cron jobs, other
unknown setup and whatnot that triggers app submissions. We can't and shouldn't
do any non-backward compatible changes. For me, this approach is not an
acceptable one as described with my ASCII art + description in detail.
*3. Quoting from Gergo's comment*
A.
{quote}I think this can be a major issue. In a scenario when the user have
let's say a job which runs daily, and it's been working like that for years,
can simply break, because an other user, totally unrelated to his team or even
department creates a queue with the same name. The suddenly this application
will start failing, and even if we add logs which explicitly tell what is the
reason behind the application rejection, user one might not even notice it
started failing. (We shouldn't assume everyone has proper monitoring and
warning systems). So technically any user who can create a queue, can disable
an other user's application if it is started by using single queue name
reference. And it concerns me a bit.
{quote}
Agree with this, the whole point of my comment was based around this fact Gergo
shared.
B.
{quote}However Adam Antal had an idea to even fix this issue, and then we can
move forwards with Wilfred's suggestion:
We should make it possible to flag QUEUES for using full queue name reference,
and these flags should mean ALL queues under the flagged queue can ONLY be
referenced by full queue name. For regular queues (non-flagged ones), we would
still enforce the unique leaf queue policy, while newer or migrating users
could stick to the full queue reference. This proposal can also helps gradual
migration, for older CS users buy slowly flagging queues in which the
applications are already started by queue name.
{quote}
*Some things I have in mind, so listing the ADVANTAGES and some properties of
this approach:*
1. First and foremost, duplicate queue names cannot be added under unflagged
queues.
2. Using the department / user analogy here. We don't have the problem anymore
that some other department adds a user and this user becomes non-unique as
other department has the same username in it, so app submissions can work well
for all departments / users.
As the new department added a user (root.c.b) has their subqueues flagged, app
submissions should only work with full queue paths here and at the same time,
the another department had the same user (root.b.b), app submission will still
work well as all queues under root.c should be referenced only referenced by
full queue path.
3. Customers can gradually move queues to use full queue path, as much as they
want.
4. FS to CS migration tool should add the flag to the root queue if moved from
FS.
4. Quoting from Wangda's comment:
{quote}Thanks Gergely Pollak, I think adding a flag (suggestion from Adam
Antal) will prevent admin to change it accidentally, but it is hard to be
understand (thinking about a regular Hadoop user). And we need to maintain it
in a long run.
{quote}
I'm not sure why this is harder to understand than regular queue and scheduler
config parameters, which can be overly complex to setup.
With this approach, at least we would give something in our user's hands: They
can use the flag to move queues under to the new way to address queues (queue
path) gradually. Without the flag-based approach, we don't give users anything
but we require them to change their submission code everytime a leaf queue is
added that is already in the hierarchy with the same name. If we don't consider
this as a hard requirement, apps would fail randomly, from the perspective of
the users.
*As a summary:*
*Please note that I did not dive into the CS code to decide what is the
development cost of such a flag-based solution, this is up to* [~shuzirra] *to
give some estimates as the jira is on his name.*
*However, I wanted to state and share my thoughts: Rejecting apps on
submission time that worked before and won't work afterwards seems like very
far from an acceptable solution.*
was (Author: snemeth):
Thank you for your feedbacks [~wilfreds], [~pbacsko], [~leftnoteasy] and thans
[~shuzirra] for moving this forward!
Chiming in a bit late, but I would like to give my opinion as well. Apologies
for being so lengthy, but I wanted to quote and reply some parts of the
comments had been given so far.
Please note that some parts of what I'm saying here is taking the FS to CS
migration into account.
1. Quoting from Wilfred's comment:
A.
{quote}If we simply relax the rule that the leaf queue must be unique in the
system in favour of the fact that a queue must be unique based on the full
queue path. This does not break existing configurations as the unique leaf
queue is also unique when you take into account the whole path. That means
there is nothing for the current clusters that needs to change. Internally the
scheduler does have to change to make sure that all references use the queue
path. This will require a lot of changes throughout the scheduler when you look
up a queue and the way we store the reference if it is not directly to the leaf
queue.
{quote}
I do agree with this, existing configurations does not break. However,
allowing non-unique leaf queues can make the CS behave weirdly and
inconsistently as queues are being added later on, as [~shuzirra] described
this lately in his comment. More on this matter later.
B.
{quote}When an application is submitted with just a queue name (not a path) we
expect that the name is a unique leaf queue name. If that queue does not exist
or is not uniquely identifiable we reject the application submission.
Resolution of the real leaf queue follows the same steps as it does now. The
queue name in the end is converted to the correct leaf queue identified by the
a path. For existing configurations nothing has changed. Internally we hide all
the changes.
{quote}
AND
C.
{quote}The important part is applying a new configuration. If the configuration
adds a leaf queue that is not unique the configuration update currently is
rejected. With this change we would allow that config to become active. This
could break existing applications when they try to submit to the leaf queue
that is no longer unique.
We should at least log and warn clearly in the response of the update. Maybe
even show it in the UI or we could ask for a confirmation. The first update
that adds a non unique queue to the configuration should always fail
complaining loudly. It should then keep warning the user and rejecting the
update unless a confirmation flag is set to force the update through. After the
first update that would not be needed anymore.
Reading a config from a file or store which is used to initialise the
scheduler should not trigger such behaviour. We still should show a warning in
the logs to make sure it is not lost.
{quote}
About what you said: "This could break existing applications when they try to
submit to the leaf queue that is no longer unique."
Again, this is really problematic and error-prone, as detailed by [~shuzirra]
in his comment above.
We discussed this together with Gergo and I'm trying to give more weight on
how tricky this case can get and what I'm doing now is just reiterate on
Gergo's comment
Imagine the following queue setup:
{code:java}
root
|____a
|
|____b
| |____b
| |____c
| |____d
|
|____c
|____e
|____f
{code}
If someone is adding root.c.b, queue 'b' is non-unique anymore without even
knowing or taking care of root.b.b is the other queue 'b'.
Even if we log and warn in the response of the update, adding the root.c.b
queue can break the app submitted to queue root.b.b.
I can imagine many real-world scenarios where users were submitting apps to a
queue (root.b.b) with just the leaf queue name ('b') and their app submission
would get suddenly rejected since another queue do exist with the same name.
Just consider a user-group mapping and two same usernames exist under different
organizations/user groups and we can be in the situation what I depicted. I
also agree to Gergo that we can't expect or assume every user and apps
submission is guarded by monitoring or warning systems.
I think it's definitely not a good idea to break existing users and app
submissions with this change.
2. Quoting from Wangda's comment:
{quote}I personally think it is not a big deal if application reject reasons
from RM can clearly guide users to use full qualified queue path when
duplicated queue names exists. It is like if a team has only one Peter we can
use the first name only otherwise we will add last name to avoid confusion. It
isn't counter-intuitive to me.
{quote}
How can we know it's not a big deal? As we are not fully familiar how users are
submitting apps: They can have scripts (or even many scripts), cron jobs, other
unknown setup and whatnot that triggers app submissions. We can't and shouldn't
do any non-backward compatible changes. For me, this approach is not an
acceptable one as described with my ASCII art + description in detail.
3. Quoting from Gergo's comment
A.
{quote}I think this can be a major issue. In a scenario when the user have
let's say a job which runs daily, and it's been working like that for years,
can simply break, because an other user, totally unrelated to his team or even
department creates a queue with the same name. The suddenly this application
will start failing, and even if we add logs which explicitly tell what is the
reason behind the application rejection, user one might not even notice it
started failing. (We shouldn't assume everyone has proper monitoring and
warning systems). So technically any user who can create a queue, can disable
an other user's application if it is started by using single queue name
reference. And it concerns me a bit.
{quote}
Agree with this, the whole point of my comment was based around this fact Gergo
shared.
B.
{quote}However Adam Antal had an idea to even fix this issue, and then we can
move forwards with Wilfred's suggestion:
We should make it possible to flag QUEUES for using full queue name reference,
and these flags should mean ALL queues under the flagged queue can ONLY be
referenced by full queue name. For regular queues (non-flagged ones), we would
still enforce the unique leaf queue policy, while newer or migrating users
could stick to the full queue reference. This proposal can also helps gradual
migration, for older CS users buy slowly flagging queues in which the
applications are already started by queue name.
{quote}
*Some things I have in mind, so listing the ADVANTAGES and some properties of
this approach:*
1. First and foremost, duplicate queue names cannot be added under unflagged
queues.
2. Using the department / user analogy here. We don't have the problem anymore
that some other department adds a user and this user becomes non-unique as
other department has the same username in it, so app submissions can work well
for all departments / users.
As the new department added a user (root.c.b) has their subqueues flagged, app
submissions should only work with full queue paths here and at the same time,
the another department had the same user (root.b.b), app submission will still
work well as all queues under root.c should be referenced only referenced by
full queue path.
3. Customers can gradually move queues to use full queue path, as much as they
want.
4. FS to CS migration tool should add the flag to the root queue if moved from
FS.
4. Quoting from Wangda's comment:
{quote}Thanks Gergely Pollak, I think adding a flag (suggestion from Adam
Antal) will prevent admin to change it accidentally, but it is hard to be
understand (thinking about a regular Hadoop user). And we need to maintain it
in a long run.
{quote}
I'm not sure why this is harder to understand than regular queue and scheduler
config parameters, which can be overly complex to setup.
With this approach, at least we would give something in our user's hands: They
can use the flag to move queues under to the new way to address queues (queue
path) gradually. Without the flag-based approach, we don't give users anything
but we require them to change their submission code everytime a leaf queue is
added that is already in the hierarchy with the same name. If we don't consider
this as a hard requirement, apps would fail randomly, from the perspective of
the users.
*As a summary:*
*Please note that I did not dive into the CS code to decide what is the
development cost of such a flag-based solution, this is up to* [~shuzirra] *to
give some estimates as the jira is on his name.*
*However, I wanted to state and share my thoughts: Rejecting apps on
submission time that worked before and won't work afterwards seems like very
far from an acceptable solution.*
> Allow multiple leaf queues with the same name in CS
> ---------------------------------------------------
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Gergely Pollak
> Assignee: Gergely Pollak
> Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in
> the queue hierarchy.
> Design doc and first proposal is being made, I'll attach it as soon as it's
> done.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]