[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363387#comment-17363387
 ] 

Hadoop QA commented on YARN-10767:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
44s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
30s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 41s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 19m 
43s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
38s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  8s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| 

[jira] [Updated] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread D M Murali Krishna Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

D M Murali Krishna Reddy updated YARN-10767:

Attachment: YARN-10767.004.patch

> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch, YARN-10767.002.patch, 
> YARN-10767.003.patch, YARN-10767.004.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-14 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.branch-3.3.001.patch
YARN-10789.branch-3.2.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363215#comment-17363215
 ] 

Eric Payne commented on YARN-10802:
---

[~bteke], Thanks for raising this issue and for working on it. I have a 
question and an observation.
{quote}Capacity Scheduler's minimum-user-limit-percent only accepts integers, 
which means at most 100 users can use a single queue fairly
{quote}
This isn't exactly accurate.

Minimum user limit percent is only enforced when a queue's max capacity is 
reached _AND_ (100 / {{min-user-limit-pct}}) users are both using resources and 
asking for more resources. As long as the queue's max capacity is not reached 
_AND_ there are more resources available in the system, the 101st, 102nd, 
103rd, etc., will be assigned resources.

So, my question is, do you have a use case where
 1. 100 users are using up the max capacity in the queue
 2. All 100 users are active (that is, requesting more resources)
 3. The 101st user comes in and is starved because, as containers are released, 
they are assigned to one of the first 100 (again, because they are all asking 
for resources)?

We have several very-heavily-used multi-tenant queues that often have 100 or 
more users running, but only a subset of them are actively requesting resources.

My observation is that when we have set the min-user-limit-pct to be 1 in a 
very highly used multi-tenant queue, the user limit grows way too slowly. The 
min-user-limit-pct is used in calculating the user limit (seen as "Max 
Resources" in the queue's pull-down menu in the RM GUI). When the queue grows 
above its capacity but is still below its max capacity, the calculations for 
user limit in {{UsersManager#computeUserLimit}} uses the min-user-limit-pct to 
limit how fast the user limit can grow. The smaller the min-user-limit-pct is, 
the slower it grows. What ends up happening is that a few users want to grow 
larger, but several smaller users come in, request resources, and leave without 
ever reaching the current user limit. This process repeats because there are 
several new active users all the time, so the longer-running, larger users 
can't grow beyond a certain limit even though there are still available queue 
and cluster resources.

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single queue fairly. Using decimal values 
> could solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10802:
-
Description: Capacity Scheduler's minimum-user-limit-percent only accepts 
integers, which means at most 100 users can use a single queue fairly. Using 
decimal values could solve this problem.  (was: Capacity Scheduler's 
minimum-user-limit-percent only accepts integers, which means at most 100 users 
can use a single fairly. Using decimal values could solve this problem.)

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single queue fairly. Using decimal values 
> could solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363205#comment-17363205
 ] 

Jim Brennan commented on YARN-10767:


Thanks for the update [~dmmkr]! I can see that you changed
{noformat}
  public static String findActiveRMHAId(YarnConfiguration conf) {
YarnConfiguration yarnConf = new YarnConfiguration(conf);
{noformat}
to
{noformat}
  public static String findActiveRMHAId(YarnConfiguration yarnConf) {
{noformat}
Effectively moving the construction of the temporary YarnConfiguration to the 
caller.
 I see in the other place where this method is called, it was already doing 
that.
 So in that sense this make sense.

I am wondering about the change in behavior for findActiveRMHAId() though. 
Previously, it did not change the conf that was passed in - it made changes in 
a local copy. Now, it will modify the passed in conf whether it succeeds or 
fails, by setting RM_HA_ID.

That is why I suggested changing it to this:
{noformat}
  public static String findActiveRMHAId(Configuration conf) {
YarnConfiguration yarnConf = new YarnConfiguration(conf);
{noformat}
Then you can just use the conf you were passed in.

This does not make any functional difference with the current callers, but it 
could matter to future callers, if they assume findActiveRMHAId won't modify 
the passed in conf.

 

> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch, YARN-10767.002.patch, 
> YARN-10767.003.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Szilard Nemeth (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-10802:
--
Fix Version/s: 3.4.0

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single fairly. Using decimal values could 
> solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363194#comment-17363194
 ] 

Szilard Nemeth commented on YARN-10802:
---

Ok, had some quick offline discussion with [~bteke], it turns out I confused 
decimal with whole numbers (english is not my native) but still it's a bit 
embarassing.
Anyways, the primitive data types documentation for Java also mentions double / 
float as decimal types: 
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
I don't think it's worth to upload a new patch to fix the nit, so I fixed it 
just before committing.
Thanks [~bteke] again for the patch, committed to trunk and resolving jira now.

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single fairly. Using decimal values could 
> solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363192#comment-17363192
 ] 

Szilard Nemeth commented on YARN-10802:
---

Hi [~bteke],
Thanks for working on this.

Some comments for the latest patch:
1. Checking the description: Capacity Scheduler's minimum-user-limit-percent 
only accepts integers, which means at most 100 users can use a single fairly. 
*Using decimal values could solve this problem.*
Didn't you want to add "using fractional values could solve this problem"?
Also, the name of the testcase 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue#testDecimalUserLimits
 is saying decimal user limits, but you are setting 50.1% which is a fractional.

2. Nit: In 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue#testDecimalUserLimits
You may replace 0*GB with 0 in assertions like:
{code}
assertEquals(0*GB, app1.getCurrentConsumption().getMemorySize());
{code}

Other than these, the patch looks okay.

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single fairly. Using decimal values could 
> solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10813) Root queue capacity is not set when using node labels

2021-06-14 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363185#comment-17363185
 ] 

Szilard Nemeth commented on YARN-10813:
---

Thanks [~gandras] for reporting this,
This is quite a trivial fix, but it's good that you have spotted it.

Some questions / observations: 
1. How come our tests didn't catch this? Is it easy to add a unit test to cover 
the fixed scenario if it isn't any?
2. I would have been baffled to realize if we don't have a common constant for 
the queue name "root" anywhere. The thing is, we have many constants, just 
search for "root" from package: 
org/apache/hadoop/yarn/server/resourcemanager/scheduler
I know it's not strongly related to this, but could you please report a 
follow-up to clean up those? I just don't want to increase the number of 
occurrences of "root" in production code anymore.
Thanks.

> Root queue capacity is not set when using node labels
> -
>
> Key: YARN-10813
> URL: https://issues.apache.org/jira/browse/YARN-10813
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10813.001.patch
>
>
> CapacitySchedulerConfiguration#getNonLabeledQueueCapacity handles root in the 
> following way:
> {code:java}
> if (absoluteResourceConfigured || configuredWeightAsCapacity(
> configuredCapacity)) {
>   // Return capacity in percentage as 0 for non-root queues and 100 for
>   // root.From AbstractCSQueue, absolute resource will be parsed and
>   // updated. Once nodes are added/removed in cluster, capacity in
>   // percentage will also be re-calculated.
>   return queue.equals("root") ? 100.0f : 0f;
> }
> {code}
> CapacitySchedulerConfiguration#internalGetLabeledQueueCapacity on the other 
> hand does not take root queue into consideration:
> {code:java}
> if (absoluteResourceConfigured || configuredWeightAsCapacity(
> configuredCapacity)) {
>   // Return capacity in percentage as 0 for non-root queues and 100 for
>   // root.From AbstractCSQueue, absolute resource, and weight will be 
> parsed
>   // and updated separately. Once nodes are added/removed in cluster,
>   // capacity is percentage will also be re-calculated.
>   return defaultValue;
> }
> float capacity = getFloat(capacityPropertyName, defaultValue);
> {code}
> Due to this, labeled root capacity is 0, which is not set in in 
> AbstractCSQueue#derivedCapacityFromAbsoluteConfigurations, because root is 
> never in Absolute mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10801) Fix Auto Queue template to properly set all configuration properties

2021-06-14 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363181#comment-17363181
 ] 

Szilard Nemeth commented on YARN-10801:
---

Hi [~gandras],
Thanks for working on this.

One thing I quite don't understand: 
ParentQueue#createNewQueue used to call ParentQueue#getConfForAutoCreatedQueue 
for child queues:
{code}
 childQueue = new LeafQueue(csContext,
getConfForAutoCreatedQueue(childQueuePath, isLeaf), queueShortName,
this, null);
{code}

The method definition was removed with your patch, the original method:
{code}
ParentQueue#getConfForAutoCreatedQueue

  private CapacitySchedulerConfiguration getConfForAutoCreatedQueue(
  String childQueuePath, boolean isLeaf) {
// Copy existing config
CapacitySchedulerConfiguration dupCSConfig =
new CapacitySchedulerConfiguration(
csContext.getConfiguration(), false);
autoCreatedQueueTemplate.setTemplateEntriesForChild(dupCSConfig,
childQueuePath);
if (isLeaf) {
  // set to -1, to disable it
  dupCSConfig.setUserLimitFactor(childQueuePath, -1);

  // Set Max AM percentage to a higher value
  dupCSConfig.setMaximumApplicationMasterResourcePerQueuePercent(
  childQueuePath, 0.5f);
}

return dupCSConfig;
  }
{code}

However, you replaced the calls with: 
{code}
  if (isLeaf) {
childQueue = new LeafQueue(csContext, csContext.getConfiguration(),
queueShortName, this, null, true);
  }
{code}

Method definition of LeafQueue#setDynamicQueueProperties: 
{code}
  @Override
  protected void setDynamicQueueProperties(
  CapacitySchedulerConfiguration configuration) {
super.setDynamicQueueProperties(configuration);
// set to -1, to disable it
configuration.setUserLimitFactor(getQueuePath(), -1);
// Set Max AM percentage to a higher value
configuration.setMaximumApplicationMasterResourcePerQueuePercent(
getQueuePath(), 1f);
  }
{code}

I can see that the old setMaximumApplicationMasterResourcePerQueuePercent was 
called with 0.5f and the new is called with 1f.
Could you please explain to me what's the intention of this change?
Could you add some more unit test assertions: For the AM resource percentage 
takes the correct value and the userLimitFactor is -1?

Thanks.

> Fix Auto Queue template to properly set all configuration properties
> 
>
> Key: YARN-10801
> URL: https://issues.apache.org/jira/browse/YARN-10801
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10801.001.patch, YARN-10801.002.patch, 
> YARN-10801.003.patch, YARN-10801.004.patch, YARN-10801.005.patch
>
>
> Currently Auto Queue templates set configuration properties only on 
> Configuration object passed in the constructor. Due to the fact, that a lot 
> of configuration values are ready from the Configuration object in csContext, 
> template properties are not set in every cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363171#comment-17363171
 ] 

Eric Payne commented on YARN-10821:
---

{quote}

- In UsersManager#computeUserLimit the userLimit is calculated as is 
(currentCapacity * userLimit)
{code}
 Resource userLimitResource = Resources.max(resourceCalculator,
 partitionResource,
 Resources.divideAndCeil(resourceCalculator, resourceUsed,
 usersSummedByWeight),
 Resources.divideAndCeil(resourceCalculator,
 Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
 100));
{code}
{quote}
One more thing to note: another difference between the preemption and 
allocation calculations is that in the preemption path, {{resourceUsed}} in the 
above algorithm is resources used by all users whereas in the allocation path, 
it is only resources used by active users (that is, users currently asking for 
resources).

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363169#comment-17363169
 ] 

Eric Payne commented on YARN-10821:
---

{quote}
In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
calculated first:
In UsersManager#computeUserLimit the userLimit is calculated as is 
(currentCapacity * userLimit) 
{quote}
[~gandras], thanks for raising this issue.

{{LeafQueue#getUserAMResourceLimitPerPartition}} and 
{{UsersManager#computeUserLimit}} are used to calculate different things.
{{getUserAMResourceLimitPerPartition}} is used to calculate the maximum 
resources that can be used for AMs by all apps from a single user in the 
{{LeafQueue}}
{{computeUserLimit}} is used to calculate the maximum total resources that can 
be used by all apps from a single user in the {{LeafQueue}}
{{computeUserLimit}} is used not only during calculations by the preemption 
monitor, but it is also used to calculate headroom during container allocation 
and assignment to a queue. In this way, the preemption monitor and the Capacity 
Scheduler allocations are using the same computations for each users' user 
limit.

The calculations in {{getUserAMResourceLimitPerPartition}} are more lenient 
than those in {{computeUserLimit}}. But they are calculating different limits. 
This difference is not between preemption vs. allocation, but between AM 
resources limit vs. total resources limit per user.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10821.001.patch
>
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363164#comment-17363164
 ] 

Hadoop QA commented on YARN-10821:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 23m  
3s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
 6s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  2s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 
20s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
56s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
50s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 39s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1062/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  3s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | 

[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-14 Thread Szilard Nemeth (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363120#comment-17363120
 ] 

Szilard Nemeth commented on YARN-10789:
---

Hi [~tarunparimi],

Can you please upload the branch-3.3 patch so Jenkins will trigger and does the 
build?

Thanks.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363103#comment-17363103
 ] 

Hadoop QA commented on YARN-10802:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 21m  
6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 3 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
38s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  5s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 
20s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
54s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 53s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1060/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 672 unchanged - 15 fixed = 674 total (was 687) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
50s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 50s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| 

[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread D M Murali Krishna Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363087#comment-17363087
 ] 

D M Murali Krishna Reddy commented on YARN-10767:
-

[~Jim_Brennan], I have fixed the spotbugs issue in the v3 patch.

Can you have a look?

> Yarn Logs Command retrying on Standby RM for 30 times
> -
>
> Key: YARN-10767
> URL: https://issues.apache.org/jira/browse/YARN-10767
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: D M Murali Krishna Reddy
>Priority: Major
> Attachments: YARN-10767.001.patch, YARN-10767.002.patch, 
> YARN-10767.003.patch
>
>
> When ResourceManager HA is enabled and the first RM is unavailable, on 
> executing "bin/yarn logs -applicationId  -am 1", we get 
> ConnectionException for connecting to the first RM, the ConnectionException 
> Occurs for 30 times before it tries to connect to the second RM.
>  
> This can be optimized by trying to fetch the logs from the Active RM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled

2021-06-14 Thread Minni Mittal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minni Mittal updated YARN-10822:

Description: 
INFO  [91] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from NEW to LOCALIZING

INFO  [91] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to 
SCHEDULED

INFO  [91] ContainerScheduler: Opportunistic container 
container_e1171_1623422468672_2229_01_000738 will be queued at the NM.

INFO  [127] ContainerManagerImpl: Stopping container with container Id: 
container_e1171_1623422468672_2229_01_000738

INFO  [91] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
KILLING

INFO  [91] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL

INFO  [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container 
Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS 
APPID=application_1623422468672_2229 
CONTAINERID=container_e1171_1623422468672_2229_01_000738

INFO  [91] ApplicationImpl: Removing 
container_e1171_1623422468672_2229_01_000738 from application 
application_1623422468672_2229

INFO  [91] ContainersMonitorImpl: Stopping resource-monitoring for 
container_e1171_1623422468672_2229_01_000738

INFO  [163] NodeStatusUpdaterImpl: Removed completed containers from NM 
context:[container_e1171_1623422468672_2229_01_000738]


NM restart happened and recovery is attempted

 

INFO  [1] ContainerManagerImpl: Recovering 
container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code 
-1000

INFO  [1] ApplicationImpl: Adding container_e1171_1623422468672_2229_01_000738 
to application application_1623422468672_2229

INFO  [89] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from NEW to SCHEDULED

INFO  [89] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
KILLING

INFO  [89] ContainerImpl: Container 
container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
CONTAINER_CLEANEDUP_AFTER_KILL

Ideally, when container got killed before restart, it should finish the 
container immediately. 

> Containers going from New to Scheduled transition even though container is 
> killed before NM restart when NM recovery is enabled
> ---
>
> Key: YARN-10822
> URL: https://issues.apache.org/jira/browse/YARN-10822
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
>
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> LOCALIZING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from LOCALIZING to 
> SCHEDULED
> INFO  [91] ContainerScheduler: Opportunistic container 
> container_e1171_1623422468672_2229_01_000738 will be queued at the NM.
> INFO  [127] ContainerManagerImpl: Stopping container with container Id: 
> container_e1171_1623422468672_2229_01_000738
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [91] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
> INFO  [91] NMAuditLogger: USER=defaultcafor1stparty OPERATION=Container 
> Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS 
> APPID=application_1623422468672_2229 
> CONTAINERID=container_e1171_1623422468672_2229_01_000738
> INFO  [91] ApplicationImpl: Removing 
> container_e1171_1623422468672_2229_01_000738 from application 
> application_1623422468672_2229
> INFO  [91] ContainersMonitorImpl: Stopping resource-monitoring for 
> container_e1171_1623422468672_2229_01_000738
> INFO  [163] NodeStatusUpdaterImpl: Removed completed containers from NM 
> context:[container_e1171_1623422468672_2229_01_000738]
> NM restart happened and recovery is attempted
>  
> INFO  [1] ContainerManagerImpl: Recovering 
> container_e1171_1623422468672_2229_01_000738 in state QUEUED with exit code 
> -1000
> INFO  [1] ApplicationImpl: Adding 
> container_e1171_1623422468672_2229_01_000738 to application 
> application_1623422468672_2229
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from NEW to 
> SCHEDULED
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from SCHEDULED to 
> KILLING
> INFO  [89] ContainerImpl: Container 
> container_e1171_1623422468672_2229_01_000738 transitioned from KILLING to 
> 

[jira] [Created] (YARN-10822) Containers going to New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled

2021-06-14 Thread Minni Mittal (Jira)
Minni Mittal created YARN-10822:
---

 Summary: Containers going to New to Scheduled transition even 
though container is killed before NM restart when NM recovery is enabled
 Key: YARN-10822
 URL: https://issues.apache.org/jira/browse/YARN-10822
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Minni Mittal
Assignee: Minni Mittal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10822) Containers going from New to Scheduled transition even though container is killed before NM restart when NM recovery is enabled

2021-06-14 Thread Minni Mittal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Minni Mittal updated YARN-10822:

Summary: Containers going from New to Scheduled transition even though 
container is killed before NM restart when NM recovery is enabled  (was: 
Containers going to New to Scheduled transition even though container is killed 
before NM restart when NM recovery is enabled)

> Containers going from New to Scheduled transition even though container is 
> killed before NM restart when NM recovery is enabled
> ---
>
> Key: YARN-10822
> URL: https://issues.apache.org/jira/browse/YARN-10822
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Minni Mittal
>Assignee: Minni Mittal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10767) Yarn Logs Command retrying on Standby RM for 30 times

2021-06-14 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363036#comment-17363036
 ] 

Hadoop QA commented on YARN-10767:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
31s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 31m 
17s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m  6s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 
16s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
36s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  9s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| 

[jira] [Comment Edited] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362940#comment-17362940
 ] 

Andras Gyori edited comment on YARN-10821 at 6/14/21, 3:32 PM:
---

I am not entirely convinced that this is the best solution to this problem, and 
as user limit is heavily used throughout the entire codebase, I am also unsure 
that it does not break something. Perhaps experts could help here cc [~epayne].


was (Author: gandras):
I am not entirely convinced that this is the best solution to this problem, and 
as user limit is heavily used throughout the entire codebase, I am also not 
sure that it will not break anything. Perhaps experts could help here cc 
[~epayne].

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363004#comment-17363004
 ] 

Eric Payne commented on YARN-10821:
---

Thanks [~gandras] for bringing this up. I will take a look.

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362940#comment-17362940
 ] 

Andras Gyori commented on YARN-10821:
-

I am not entirely convinced that this is the best solution to this problem, and 
as user limit is heavily used throughout the entire codebase, I am also not 
sure that it will not break anything. Perhaps experts could help here cc 
[~epayne].

> User limit is not calculated as per definition for preemption
> -
>
> Key: YARN-10821
> URL: https://issues.apache.org/jira/browse/YARN-10821
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
>
> Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
> uses pending resources to determine the resources needed by a queue, which is 
> calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
> method involves headroom calculated by UsersManager#computeUserLimit. 
> However, the pending resources for preemption are limited in an unexpected 
> fashion.
>  * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
> calculated first:
> {code:java}
>  float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
>  1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
> {code}
>  * In UsersManager#computeUserLimit the userLimit is calculated as is 
> (currentCapacity * userLimit)
> {code:java}
>  Resource userLimitResource = Resources.max(resourceCalculator,
>  partitionResource,
>  Resources.divideAndCeil(resourceCalculator, resourceUsed,
>  usersSummedByWeight),
>  Resources.divideAndCeil(resourceCalculator,
>  Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
>  100));
> {code}
> The fewer users occupying the queue, the more prevalent and outstanding this 
> effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10821) User limit is not calculated as per definition for preemption

2021-06-14 Thread Andras Gyori (Jira)
Andras Gyori created YARN-10821:
---

 Summary: User limit is not calculated as per definition for 
preemption
 Key: YARN-10821
 URL: https://issues.apache.org/jira/browse/YARN-10821
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Reporter: Andras Gyori
Assignee: Andras Gyori


Minimum user limit percent (MULP) is a soft limit by definition. Preemption 
uses pending resources to determine the resources needed by a queue, which is 
calculated in LeafQueue#getTotalPendingResourcesConsideringUserLimit. This 
method involves headroom calculated by UsersManager#computeUserLimit. However, 
the pending resources for preemption are limited in an unexpected fashion.
 * In LeafQueue#getUserAMResourceLimitPerPartition an effective userLimit is 
calculated first:
{code:java}
 float effectiveUserLimit = Math.max(usersManager.getUserLimit() / 100.0f,
 1.0f / Math.max(getAbstractUsersManager().getNumActiveUsers(), 1));
{code}

 * In UsersManager#computeUserLimit the userLimit is calculated as is 
(currentCapacity * userLimit)
{code:java}
 Resource userLimitResource = Resources.max(resourceCalculator,
 partitionResource,
 Resources.divideAndCeil(resourceCalculator, resourceUsed,
 usersSummedByWeight),
 Resources.divideAndCeil(resourceCalculator,
 Resources.multiplyAndRoundDown(currentCapacity, getUserLimit()),
 100));
{code}

The fewer users occupying the queue, the more prevalent and outstanding this 
effect will be in preemption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-14 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362837#comment-17362837
 ] 

Surendra Singh Lilhore commented on YARN-10820:
---

[~Swathi Chandrashekar], Added you as contributor and assigned to you

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Assigned] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-14 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore reassigned YARN-10820:
-

Assignee: SwathiChandrashekar  (was: Prabhu Joseph)

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: SwathiChandrashekar
>Priority: Major
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Commented] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362831#comment-17362831
 ] 

Benjamin Teke commented on YARN-10802:
--

Hi [~snemeth], Thanks for checking this.
1. Fixed most of the checkstyle issues, two remains but those two would require 
larger effort than the patch itself and they're unrelated, so if that's okay I 
would skip them.
2. The UT failure seems unrelated, happened 
[here|https://issues.apache.org/jira/browse/YARN-10726?jql=project%20%3D%20YARN%20AND%20text%20~%20%22TestCapacitySchedulerAsyncScheduling%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC]
 as well.

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single fairly. Using decimal values could 
> solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10802) Change Capacity Scheduler minimum-user-limit-percent to accept decimal values

2021-06-14 Thread Benjamin Teke (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Teke updated YARN-10802:
-
Attachment: YARN-10802.004.patch

> Change Capacity Scheduler minimum-user-limit-percent to accept decimal values
> -
>
> Key: YARN-10802
> URL: https://issues.apache.org/jira/browse/YARN-10802
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Benjamin Teke
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10802.001.patch, YARN-10802.002.patch, 
> YARN-10802.003.patch, YARN-10802.004.patch
>
>
> Capacity Scheduler's minimum-user-limit-percent only accepts integers, which 
> means at most 100 users can use a single fairly. Using decimal values could 
> solve this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10813) Root queue capacity is not set when using node labels

2021-06-14 Thread Benjamin Teke (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362825#comment-17362825
 ] 

Benjamin Teke commented on YARN-10813:
--

Thanks [~gandras] for the patch, this indeed seems to be a bug. LGTM 
(non-binding).

> Root queue capacity is not set when using node labels
> -
>
> Key: YARN-10813
> URL: https://issues.apache.org/jira/browse/YARN-10813
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10813.001.patch
>
>
> CapacitySchedulerConfiguration#getNonLabeledQueueCapacity handles root in the 
> following way:
> {code:java}
> if (absoluteResourceConfigured || configuredWeightAsCapacity(
> configuredCapacity)) {
>   // Return capacity in percentage as 0 for non-root queues and 100 for
>   // root.From AbstractCSQueue, absolute resource will be parsed and
>   // updated. Once nodes are added/removed in cluster, capacity in
>   // percentage will also be re-calculated.
>   return queue.equals("root") ? 100.0f : 0f;
> }
> {code}
> CapacitySchedulerConfiguration#internalGetLabeledQueueCapacity on the other 
> hand does not take root queue into consideration:
> {code:java}
> if (absoluteResourceConfigured || configuredWeightAsCapacity(
> configuredCapacity)) {
>   // Return capacity in percentage as 0 for non-root queues and 100 for
>   // root.From AbstractCSQueue, absolute resource, and weight will be 
> parsed
>   // and updated separately. Once nodes are added/removed in cluster,
>   // capacity is percentage will also be re-calculated.
>   return defaultValue;
> }
> float capacity = getFloat(capacityPropertyName, defaultValue);
> {code}
> Due to this, labeled root capacity is 0, which is not set in in 
> AbstractCSQueue#derivedCapacityFromAbsoluteConfigurations, because root is 
> never in Absolute mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-14 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362820#comment-17362820
 ] 

Tarun Parimi commented on YARN-10789:
-

Thanks [~snemeth] for the review and commit. Thanks [~bteke],[~zhuqi] for your 
reviews. 

We can backport it to 3.3/3.2 branches. The trunk patch applies cleanly on 3.3. 
Will add a patch for 3.2.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-14 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362752#comment-17362752
 ] 

Tarun Parimi commented on YARN-10816:
-

Thanks [~snemeth] for the review and commit.

> Avoid doing delegation token ops when 
> yarn.timeline-service.http-authentication.type=simple
> ---
>
> Key: YARN-10816
> URL: https://issues.apache.org/jira/browse/YARN-10816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.4.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10816.001.patch, YARN-10816.002.patch
>
>
> YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
> used in TimelineClient when 
> yarn.timeline-service.http-authentication.type=simple
> PseudoAuthenticationHandler doesn't support delegation token ops like get, 
> renew and cancel since those ops strictly require SPNEGO auth to work. We 
> don't use timeline delegation tokens when simple auth is used.
> Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
> yarn.timeline-service.http-authentication.type=simple, but hadoop security 
> was enabled. After YARN-10339, the tokens are not used when 
> yarn.timeline-service.http-authentication.type=simple.
> In a rolling upgrade scenario, we can have a client  which doesn't have 
> YARN-10339 changes submitting an application and requests a Timeline 
> delegation token even when 
> yarn.timeline-service.http-authentication.type=simple. RM on the other hand 
> can have YARN-10339 changes and so will result in error while trying to renew 
> the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-14 Thread Prabhu Joseph (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362724#comment-17362724
 ] 

Prabhu Joseph commented on YARN-10820:
--

Hi [~bibinchundatt], Could you please assign [~Swathi Chandrashekar] the 
contributor. Thanks.

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> 

[jira] [Commented] (YARN-10820) Make GetClusterNodesRequestPBImpl thread safe

2021-06-14 Thread Swathi Chandrashekar (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362722#comment-17362722
 ] 

Swathi Chandrashekar commented on YARN-10820:
-

Hi Prabhu, Can you please assign it to me ?

> Make GetClusterNodesRequestPBImpl thread safe
> -
>
> Key: YARN-10820
> URL: https://issues.apache.org/jira/browse/YARN-10820
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: client
>Affects Versions: 3.1.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> yarn node list intermittently fails with below
> {code:java}
> 2021-06-13 11:26:42,316 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.ArrayIndexOutOfBoundsException: 1 on 
> [resourcemanager-1], so propagating back to caller.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>  at java.util.ArrayList.add(ArrayList.java:465)
>  at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
>  at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>  at com.sun.proxy.$Proxy8.getClusterNodes(Unknown Source)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.yarn.client.RequestHedgingRMFailoverProxyProvider$RMRequestHedgingInvocationHandler$1.call(RequestHedgingRMFailoverProxyProvider.java:159)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-06-13 11:27:58,415 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception: java.lang.UnsupportedOperationException on 
> [resourcemanager-0], so propagating back to caller.
> Exception in thread "main" java.lang.UnsupportedOperationException
> at 
> java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$GetClusterNodesRequestProto$Builder.addAllNodeStates(YarnServiceProtos.java:28009)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToBuilder(GetClusterNodesRequestPBImpl.java:124)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.mergeLocalToProto(GetClusterNodesRequestPBImpl.java:82)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.GetClusterNodesRequestPBImpl.getProto(GetClusterNodesRequestPBImpl.java:56)
> at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterNodes(ApplicationClientProtocolPBClientImpl.java:329)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
>