[jira] [Comment Edited] (YARN-9879) Allow multiple leaf queues with the same name in CS

Szilard Nemeth (Jira) Fri, 10 Jan 2020 12:05:55 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013171#comment-17013171
 ]


Szilard Nemeth edited comment on YARN-9879 at 1/10/20 8:04 PM:
---------------------------------------------------------------

Thank you for your feedbacks [~wilfreds], [~pbacsko], [~leftnoteasy] and thans 
[~shuzirra] for moving this forward!

Chiming in a bit late, but I would like to give my opinion as well. Apologies 
for being so lengthy, but I wanted to quote and reply some parts of the 
comments had been given so far.
 Please note that some parts of what I'm saying here is taking the FS to CS 
migration into account.

*1. Quoting from Wilfred's comment:*

A.
{quote}If we simply relax the rule that the leaf queue must be unique in the 
system in favour of the fact that a queue must be unique based on the full 
queue path. This does not break existing configurations as the unique leaf 
queue is also unique when you take into account the whole path. That means 
there is nothing for the current clusters that needs to change. Internally the 
scheduler does have to change to make sure that all references use the queue 
path. This will require a lot of changes throughout the scheduler when you look 
up a queue and the way we store the reference if it is not directly to the leaf 
queue.
{quote}
I do agree with this, existing configurations does not break. However, allowing 
non-unique leaf queues can make the CS behave weirdly and inconsistently as 
queues are being added later on, as [~shuzirra] described this lately in his 
comment. More on this matter later.

 

B.
{quote}When an application is submitted with just a queue name (not a path) we 
expect that the name is a unique leaf queue name. If that queue does not exist 
or is not uniquely identifiable we reject the application submission. 
Resolution of the real leaf queue follows the same steps as it does now. The 
queue name in the end is converted to the correct leaf queue identified by the 
a path. For existing configurations nothing has changed. Internally we hide all 
the changes.
{quote}
 

AND

C.
{quote}The important part is applying a new configuration. If the configuration 
adds a leaf queue that is not unique the configuration update currently is 
rejected. With this change we would allow that config to become active. This 
could break existing applications when they try to submit to the leaf queue 
that is no longer unique.
 We should at least log and warn clearly in the response of the update. Maybe 
even show it in the UI or we could ask for a confirmation. The first update 
that adds a non unique queue to the configuration should always fail 
complaining loudly. It should then keep warning the user and rejecting the 
update unless a confirmation flag is set to force the update through. After the 
first update that would not be needed anymore.
 Reading a config from a file or store which is used to initialise the 
scheduler should not trigger such behaviour. We still should show a warning in 
the logs to make sure it is not lost.
{quote}
About what you said: "This could break existing applications when they try to 
submit to the leaf queue that is no longer unique."
 Again, this is really problematic and error-prone, as detailed by [~shuzirra] 
in his comment above. 
 We discussed this together with Gergo and I'm trying to give more weight on 
how tricky this case can get and what I'm doing now is just reiterate on 
Gergo's comment

Imagine the following queue setup:
{code:java}
root
  |____a
  |
  |____b
  |    |____b
  |    |____c
  |    |____d
  |
  |____c
       |____e
       |____f
{code}
If someone is adding root.c.b, queue 'b' is non-unique anymore without even 
knowing or taking care of root.b.b is the other queue 'b'.
 Even if we log and warn in the response of the update, adding the root.c.b 
queue can break the app submitted to queue root.b.b.
 I can imagine many real-world scenarios where users were submitting apps to a 
queue (root.b.b) with just the leaf queue name ('b') and their app submission 
would get suddenly rejected since another queue do exist with the same name. 
Just consider a user-group mapping and two same usernames exist under different 
organizations/user groups and we can be in the situation what I depicted. I 
also agree to Gergo that we can't expect or assume every user and apps 
submission is guarded by monitoring or warning systems. 
 I think it's definitely not a good idea to break existing users and app 
submissions with this change.

*2. Quoting from Wangda's comment:*
{quote}I personally think it is not a big deal if application reject reasons 
from RM can clearly guide users to use full qualified queue path when 
duplicated queue names exists. It is like if a team has only one Peter we can 
use the first name only otherwise we will add last name to avoid confusion. It 
isn't counter-intuitive to me.
{quote}
How can we know it's not a big deal? As we are not fully familiar how users are 
submitting apps: They can have scripts (or even many scripts), cron jobs, other 
unknown setup and whatnot that triggers app submissions. We can't and shouldn't 
do any non-backward compatible changes. For me, this approach is not an 
acceptable one as described with my ASCII art + description in detail.

*3. Quoting from Gergo's comment*

A.
{quote}I think this can be a major issue. In a scenario when the user have 
let's say a job which runs daily, and it's been working like that for years, 
can simply break, because an other user, totally unrelated to his team or even 
department creates a queue with the same name. The suddenly this application 
will start failing, and even if we add logs which explicitly tell what is the 
reason behind the application rejection, user one might not even notice it 
started failing. (We shouldn't assume everyone has proper monitoring and 
warning systems). So technically any user who can create a queue, can disable 
an other user's application if it is started by using single queue name 
reference. And it concerns me a bit.
{quote}
Agree with this, the whole point of my comment was based around this fact Gergo 
shared.

B.
{quote}However Adam Antal had an idea to even fix this issue, and then we can 
move forwards with Wilfred's suggestion:

We should make it possible to flag QUEUES for using full queue name reference, 
and these flags should mean ALL queues under the flagged queue can ONLY be 
referenced by full queue name. For regular queues (non-flagged ones), we would 
still enforce the unique leaf queue policy, while newer or migrating users 
could stick to the full queue reference. This proposal can also helps gradual 
migration, for older CS users buy slowly flagging queues in which the 
applications are already started by queue name.
{quote}
*Some things I have in mind, so listing the ADVANTAGES and some properties of 
this approach:*

1. First and foremost, duplicate queue names cannot be added under unflagged 
queues.
 2. Using the department / user analogy here. We don't have the problem anymore 
that some other department adds a user and this user becomes non-unique as 
other department has the same username in it, so app submissions can work well 
for all departments / users.
 As the new department added a user (root.c.b) has their subqueues flagged, app 
submissions should only work with full queue paths here and at the same time, 
the another department had the same user (root.b.b), app submission will still 
work well as all queues under root.c should be referenced only referenced by 
full queue path.
 3. Customers can gradually move queues to use full queue path, as much as they 
want.
 4. FS to CS migration tool should add the flag to the root queue if moved from 
FS.

4. Quoting from Wangda's comment:
{quote}Thanks Gergely Pollak, I think adding a flag (suggestion from Adam 
Antal) will prevent admin to change it accidentally, but it is hard to be 
understand (thinking about a regular Hadoop user). And we need to maintain it 
in a long run.
{quote}
I'm not sure why this is harder to understand than regular queue and scheduler 
config parameters, which can be overly complex to setup.
 With this approach, at least we would give something in our user's hands: They 
can use the flag to move queues under to the new way to address queues (queue 
path) gradually. Without the flag-based approach, we don't give users anything 
but we require them to change their submission code everytime a leaf queue is 
added that is already in the hierarchy with the same name. If we don't consider 
this as a hard requirement, apps would fail randomly, from the perspective of 
the users.

*As a summary:*
 *Please note that I did not dive into the CS code to decide what is the 
development cost of such a flag-based solution, this is up to* [~shuzirra] *to 
give some estimates as the jira is on his name.*
 *However, I wanted to state and share my thoughts: Rejecting apps on 
submission time that worked before and won't work afterwards seems like very 
far from an acceptable solution.*


was (Author: snemeth):
Thank you for your feedbacks [~wilfreds], [~pbacsko], [~leftnoteasy] and thans 
[~shuzirra] for moving this forward!

Chiming in a bit late, but I would like to give my opinion as well. Apologies 
for being so lengthy, but I wanted to quote and reply some parts of the 
comments had been given so far.
 Please note that some parts of what I'm saying here is taking the FS to CS 
migration into account.

1. Quoting from Wilfred's comment:

A.
{quote}If we simply relax the rule that the leaf queue must be unique in the 
system in favour of the fact that a queue must be unique based on the full 
queue path. This does not break existing configurations as the unique leaf 
queue is also unique when you take into account the whole path. That means 
there is nothing for the current clusters that needs to change. Internally the 
scheduler does have to change to make sure that all references use the queue 
path. This will require a lot of changes throughout the scheduler when you look 
up a queue and the way we store the reference if it is not directly to the leaf 
queue.
{quote}

 I do agree with this, existing configurations does not break. However, 
allowing non-unique leaf queues can make the CS behave weirdly and 
inconsistently as queues are being added later on, as [~shuzirra] described 
this lately in his comment. More on this matter later.

 

B.
{quote}When an application is submitted with just a queue name (not a path) we 
expect that the name is a unique leaf queue name. If that queue does not exist 
or is not uniquely identifiable we reject the application submission. 
Resolution of the real leaf queue follows the same steps as it does now. The 
queue name in the end is converted to the correct leaf queue identified by the 
a path. For existing configurations nothing has changed. Internally we hide all 
the changes.
{quote}
 

AND

C.
{quote}The important part is applying a new configuration. If the configuration 
adds a leaf queue that is not unique the configuration update currently is 
rejected. With this change we would allow that config to become active. This 
could break existing applications when they try to submit to the leaf queue 
that is no longer unique.
 We should at least log and warn clearly in the response of the update. Maybe 
even show it in the UI or we could ask for a confirmation. The first update 
that adds a non unique queue to the configuration should always fail 
complaining loudly. It should then keep warning the user and rejecting the 
update unless a confirmation flag is set to force the update through. After the 
first update that would not be needed anymore.
 Reading a config from a file or store which is used to initialise the 
scheduler should not trigger such behaviour. We still should show a warning in 
the logs to make sure it is not lost.
{quote}
About what you said: "This could break existing applications when they try to 
submit to the leaf queue that is no longer unique."
 Again, this is really problematic and error-prone, as detailed by [~shuzirra] 
in his comment above. 
 We discussed this together with Gergo and I'm trying to give more weight on 
how tricky this case can get and what I'm doing now is just reiterate on 
Gergo's comment

Imagine the following queue setup:
{code:java}
root
  |____a
  |
  |____b
  |    |____b
  |    |____c
  |    |____d
  |
  |____c
       |____e
       |____f
{code}
If someone is adding root.c.b, queue 'b' is non-unique anymore without even 
knowing or taking care of root.b.b is the other queue 'b'.
 Even if we log and warn in the response of the update, adding the root.c.b 
queue can break the app submitted to queue root.b.b.
 I can imagine many real-world scenarios where users were submitting apps to a 
queue (root.b.b) with just the leaf queue name ('b') and their app submission 
would get suddenly rejected since another queue do exist with the same name. 
Just consider a user-group mapping and two same usernames exist under different 
organizations/user groups and we can be in the situation what I depicted. I 
also agree to Gergo that we can't expect or assume every user and apps 
submission is guarded by monitoring or warning systems. 
 I think it's definitely not a good idea to break existing users and app 
submissions with this change.

2. Quoting from Wangda's comment:
{quote}I personally think it is not a big deal if application reject reasons 
from RM can clearly guide users to use full qualified queue path when 
duplicated queue names exists. It is like if a team has only one Peter we can 
use the first name only otherwise we will add last name to avoid confusion. It 
isn't counter-intuitive to me.
{quote}
How can we know it's not a big deal? As we are not fully familiar how users are 
submitting apps: They can have scripts (or even many scripts), cron jobs, other 
unknown setup and whatnot that triggers app submissions. We can't and shouldn't 
do any non-backward compatible changes. For me, this approach is not an 
acceptable one as described with my ASCII art + description in detail.

3. Quoting from Gergo's comment

A.
{quote}I think this can be a major issue. In a scenario when the user have 
let's say a job which runs daily, and it's been working like that for years, 
can simply break, because an other user, totally unrelated to his team or even 
department creates a queue with the same name. The suddenly this application 
will start failing, and even if we add logs which explicitly tell what is the 
reason behind the application rejection, user one might not even notice it 
started failing. (We shouldn't assume everyone has proper monitoring and 
warning systems). So technically any user who can create a queue, can disable 
an other user's application if it is started by using single queue name 
reference. And it concerns me a bit.
{quote}
Agree with this, the whole point of my comment was based around this fact Gergo 
shared.

B.
{quote}However Adam Antal had an idea to even fix this issue, and then we can 
move forwards with Wilfred's suggestion:

We should make it possible to flag QUEUES for using full queue name reference, 
and these flags should mean ALL queues under the flagged queue can ONLY be 
referenced by full queue name. For regular queues (non-flagged ones), we would 
still enforce the unique leaf queue policy, while newer or migrating users 
could stick to the full queue reference. This proposal can also helps gradual 
migration, for older CS users buy slowly flagging queues in which the 
applications are already started by queue name.
{quote}
*Some things I have in mind, so listing the ADVANTAGES and some properties of 
this approach:*

 1. First and foremost, duplicate queue names cannot be added under unflagged 
queues.
 2. Using the department / user analogy here. We don't have the problem anymore 
that some other department adds a user and this user becomes non-unique as 
other department has the same username in it, so app submissions can work well 
for all departments / users.
 As the new department added a user (root.c.b) has their subqueues flagged, app 
submissions should only work with full queue paths here and at the same time, 
the another department had the same user (root.b.b), app submission will still 
work well as all queues under root.c should be referenced only referenced by 
full queue path.
 3. Customers can gradually move queues to use full queue path, as much as they 
want.
 4. FS to CS migration tool should add the flag to the root queue if moved from 
FS.

4. Quoting from Wangda's comment:
{quote}Thanks Gergely Pollak, I think adding a flag (suggestion from Adam 
Antal) will prevent admin to change it accidentally, but it is hard to be 
understand (thinking about a regular Hadoop user). And we need to maintain it 
in a long run.
{quote}
I'm not sure why this is harder to understand than regular queue and scheduler 
config parameters, which can be overly complex to setup.
 With this approach, at least we would give something in our user's hands: They 
can use the flag to move queues under to the new way to address queues (queue 
path) gradually. Without the flag-based approach, we don't give users anything 
but we require them to change their submission code everytime a leaf queue is 
added that is already in the hierarchy with the same name. If we don't consider 
this as a hard requirement, apps would fail randomly, from the perspective of 
the users.

*As a summary:*
 *Please note that I did not dive into the CS code to decide what is the 
development cost of such a flag-based solution, this is up to* [~shuzirra] *to 
give some estimates as the jira is on his name.*
 *However, I wanted to state and share my thoughts: Rejecting apps on 
submission time that worked before and won't work afterwards seems like very 
far from an acceptable solution.*

> Allow multiple leaf queues with the same name in CS
> ---------------------------------------------------
>
>                 Key: YARN-9879
>                 URL: https://issues.apache.org/jira/browse/YARN-9879
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Gergely Pollak
>            Assignee: Gergely Pollak
>            Priority: Major
>         Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-9879) Allow multiple leaf queues with the same name in CS

Reply via email to