[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848123#comment-17848123
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

You need to figure out why you get two remove events in a row for the same 
application. This code has not change in multiple years. If this was really a 
big issue we should have seen this happen more often and years ago.

Try to reproduce without the backports and see if it still happens. You might 
have backported things that are not compatible that cause side effects.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally as mentioned in the comment here when such scenario occurs 
> ideally we should not take down RM.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11697) Fix fair scheduler race condition in removeApplicationAttempt and moveApplication

2024-05-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848072#comment-17848072
 ] 

Wilfred Spiegelenburg commented on YARN-11697:
--

The stack trace does not correspond to hadoop 3.2.1: 
[FairScheduler.java:757|https://github.com/apache/hadoop/blob/branch-3.2.1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L757]

That points to this line in hadoop 3.2.1 which is part of 
completedContainerInternal
{code:java}
755        application.containerCompleted(rmContainer, containerStatus, event);
756        if (node != null) {
757          node.releaseContainer(rmContainer.getContainerId(), false);
758        } else if (LOG.isDebugEnabled()) {
759          LOG.debug("Skipping container release on removed node: " + nodeID);
760        } {code}
The comment in the moveApplication around locking the app attempt are for 
scheduling. An application could be scheduled while being moved and that needs 
to be stopped. The remove of an application attempt takes a write lock on the 
scheduler itself. Same as the move does. So a moveApplication and 
removeApplicationAttempt cannot happen at the same time. they both need that 
lock and are serialised.

I think you are looking at the wrong thing and a move is not involved.

> Fix fair scheduler race condition in removeApplicationAttempt and 
> moveApplication
> -
>
> Key: YARN-11697
> URL: https://issues.apache.org/jira/browse/YARN-11697
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.1
>Reporter: Syed Shameerur Rahman
>Assignee: Syed Shameerur Rahman
>Priority: Major
>
> For Hadoop version 3.2.1, the ResourceManager (RM) restarts frequently with 
> the following exception
> {code:java}
> 2024-03-11 04:41:29,329 FATAL org.apache.hadoop.yarn.event.EventDispatcher 
> (SchedulerEventDispatcher:Event Processor): Error in handling event type 
> APP_ATTEMPT_REMOVED to the Event Dispatcher
> java.lang.IllegalStateException: Given app to remove 
> appattempt_1706879498319_86660_01 Alloc:  does not 
> exist in queue [root, demand=, 
> running=, share=, w=1.0]
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.removeApp(FSLeafQueue.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplicationAttempt(FairScheduler.java:757)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1378)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:139)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> at java.lang.Thread.run(Thread.java:750)
> {code}
> The exception seems similar to the one mentioned in YARN-5136, but it looks 
> like there is still some edge cases not covered by YARN-5136.
> 1. On deeper look, i could see that as mentioned in the comment here. if a 
> call for a moveApplication and removeApplicationAttempt for the same attempt 
> are processed in short succession the application attempt will still contain 
> a queue reference but is already removed from the list of applications for 
> the queue.
> 2. This can happen when 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1908]
>  removes the appAttempt from the queue and 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L707]
>  also tries to remove the same appAttempt from the queue.
> 3. On further checking, i could see that before doing 
> [moveApplication|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1779]
>  writeLock on appAttempt is taken where as for 
> [removeApplicationAttempt|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L665]
>  , i don't see any writelock being taken which can result in race condition 
> if same appAttempt is being processed.
> 4. Additionally 

[jira] [Commented] (YARN-10966) nodeUpdate will make NPE when node decomissioning trans to decomissed at same time

2021-10-20 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432150#comment-17432150
 ] 

Wilfred Spiegelenburg commented on YARN-10966:
--

I can see the issue at a slightly different point than we had seen it happening 
before. It is similar to what was seen in YARN-4677.

Instead of adding the RMNode to the method calls to just get the node ID can we 
not pass in just the node ID? It is the only part needed for further calls and 
can be used for logging also.

Can you also look at the other schedulers (Fifo and Fair) as the same test as 
you extended for the capacity scheduler also exist in those schedulers and we 
should not break those schedulers and have the same tests.

> nodeUpdate will make NPE  when node decomissioning trans to decomissed at 
> same time
> ---
>
> Key: YARN-10966
> URL: https://issues.apache.org/jira/browse/YARN-10966
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 3.1.1, 3.2.1, 3.3.1
>Reporter: tuyu
>Priority: Major
> Fix For: 3.1.1, 3.2.1
>
> Attachments: YARN-10966.001.patch
>
>
> [YARN-4677|https://issues.apache.org/jira/browse/YARN-4677] fix race 
> condition, but not fix complete, it will cause NPE exception when 
> containerLaunchedOnNode call node.getNodeID but the node is null 
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.containerLaunchedOnNode(AbstractYarnScheduler.java:366)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNewContainerInfo(AbstractYarnScheduler.java:1029)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.nodeUpdate(AbstractYarnScheduler.java:1130)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1938)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler.testRemovedNodeDecomissioningNode
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-26 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332842#comment-17332842
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

OK, in that case we're ready to commit. +1 from my side.

[~snemeth]: I do think that we need to release note this as it is a difference 
in behaviour. Do we also need to make sure that YARN-8951 works, or at least 
does not throw a NPE and takes down the RM before we commit this?

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331726#comment-17331726
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

Are we pushing the documentation in this Jira or are we opening up a new one? 
If we do a new Jira to fix the docs we're OK to commit otherwise we need to get 
the documentation update added in this Jira before we commit.

[~bteke] & [~snemeth] any preference from your side?

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325396#comment-17325396
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

Code change looks good. This does however also impact the 
[documentation|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html]:
 * {{By default, all users share a single queue, named “default”.}}
 * {{user-as-default-queue}} says it falls back to the default queue
 * {{allow-undeclared-pools}} is another point that mentions the default queue

The last impact is on the default placement rule. Not sure if that is just a 
documentation change or a needs more.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch, YARN-7769.002.patch, 
> YARN-7769.003.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-13 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320581#comment-17320581
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

These test failures are all related to the change. Please check, the tests 
assumed a default queue to be available and it no longer is.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-7769.001.patch
>
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7769) FS QueueManager should not create default queue at init

2021-04-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319902#comment-17319902
 ] 

Wilfred Spiegelenburg commented on YARN-7769:
-

No problem with you taking over.

Let me know if I need to review for you.

> FS QueueManager should not create default queue at init
> ---
>
> Key: YARN-7769
> URL: https://issues.apache.org/jira/browse/YARN-7769
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Benjamin Teke
>Priority: Major
>
> Currently the FairScheduler QueueManager automatically creates the default 
> queue. However the default queue does not need to exist. We have two possible 
> cases which we should handle:
> * Based on the placement rule "Default" the name for the default queue might 
> not be default and it should be created with a different name
> * There might not be a "Default" placement rule at all which removes the need 
> to create the queue.
> We should leave the creation of the default queue to the point in time that 
> we can assess if it is needed or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302978#comment-17302978
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

Thank you to [~sahuja] for the fix, and to all ([~snemeth] , [~shuzirra] , 
[~gandras] & [~pbacsko]) for the discussion and resolution around this jira.

I committed to trunk with a comment in the commit message:
{quote}This only fixes the user name resolution for weights in the queues. It 
does not add generic support for user names with dots in all use cases in the 
capacity scheduler.
{quote}

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-03-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297089#comment-17297089
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

I completely agree with your assessment [~pbacsko]. This is nowhere near a full 
fix for the dot problem at all. That needs to be tackled one issue at a time. 
We should not do all of it now. We can take multiple jiras to fix these issues.

I thus second the case by case solution approach. Fixing this one outside of 
placement rule changes is one step. Introducing a standard way for the property 
resolution for all properties that use the  would be a *nice* to 
have, again not needed now. The property introduced for max apps resolves 
without an issue even with dots in the name. 

Placement rules are complex, I would not recommend that this Jira should look 
at it at all. 

[~snemeth] & [~shuzirra]: based on the fact that we need to fix this 
irrespective of what is done in placement rules I would like to proceed with 
the commit for this. The change allows the administrator to just use the 
existing user name in the configuration similar to the "max-parallel-apps"  
setting. When and if a solution is implemented for the placement rules to 
support dots in user and group names, which are part of the queue path, new 
fixes might be needed for this issue and YARN-9930. We might even leave these 
two  as is. That is not a decision we need to make now.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292620#comment-17292620
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

I do not see the relation with placement rules or the FS for this fix at all.

The weight of a queue can be used with or without a placement rule. It is a 
setting on a queue. Having this setting on a user based queue also does not 
really make sense. It gives more resources to one user over others in the 
queue. So I can see this being used on a parent queue or a group queue but not 
on an individuals leaf queue. On top of that the queue path of the 
configuration setting is not affected by this change. We're talking about this 
setting:
{code:java}
yarn.scheduler.capacity..user-settings..weight{code}
The weight is retrieved for a specific queue as defined in the _._ 
That part is already resolved and is not changed. The only resolution that is 
changed is the __ part between the words _user-settings_. and 
_.weight._ The queue path could be anything and is not in play here. It could 
even a fixed configured queue or one mapped on a group name.

The administrator should know as minimal as possible, preferably  nothing, 
about the internals for storing users in the CS. If the queue mapping rule for 
the user changes the dots to make it a single part of the queue path then that 
is independent of this change. It still does not change the way the user is 
stored in the CS. It changes the way you map a user to a queue in the placement 
rules.

On the FS side we thought about standardising dot usage. We considered both 
cases using and not _dot_ in the config files in user names. When I looked at 
it I was not sure which was the correct solution. It could lead to strange 
behaviour and extra administrative work. The admin forgets to remove the dot 
and all of a sudden the config does not apply. That is why it never went 
further than just the Jira YARN-5674.

With this change as proposed you will support weights for all users with a dot 
except for the user called: something_.weights_ That will be the only user set 
that breaks which is far less than breaking all users with a dot in the 
username. I do not see any other  bound properties in the 
configuration at the moment.

If you want to solve the generic dot issue for user based placement then that 
is outside of this change. 

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10652) Capacity Scheduler fails to handle user weights for a user that has a "." (dot) in it

2021-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291339#comment-17291339
 ] 

Wilfred Spiegelenburg commented on YARN-10652:
--

Change looks good +1 (binding)

I'll let it sit for a day or so for other people to have a look at this too. I 
will commit if there are no comments in the next day or so.

> Capacity Scheduler fails to handle user weights for a user that has a "." 
> (dot) in it
> -
>
> Key: YARN-10652
> URL: https://issues.apache.org/jira/browse/YARN-10652
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Major
> Attachments: Correct user weight of 0.76 picked up for the user with 
> a dot after the patch.png, Incorrect default user weight of 1.0 being picked 
> for the user with a dot before the patch.png, YARN-10652.001.patch
>
>
> AD usernames can have a "." (dot) in them i.e. they can be of the format -> 
> {{firstname.lastname}}. However, if you specify a username with this format 
> against the Capacity Scheduler setting -> 
> {{yarn.scheduler.capacity.root.default.user-settings.firstname.lastname.weight}},
>  it fails to be applied and is instead assigned the default of 1.0f weight. 
> This renders the user weight feature (being used as a means of setting user 
> priorities for a queue) unusable for such users.
> This limitation comes from [1]. From [1], only word characters (A word 
> character: [a-zA-Z_0-9]) (see [2]) are permissible at the moment which is no 
> good for AD names that contain a "." (dot).
> Similar discussion has been had in a few HADOOP jiras e.g. HADOOP-7050 and 
> HADOOP-15395 and the outcome was to use non-whitespace characters i.e. 
> instead of {{\w+}}, use {{\S+}}.
> We could go down similar path and unblock this feature for the AD usernames 
> with a "." (dot) in them.
> [1] 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java#L1953
> [2] 
> https://docs.oracle.com/javase/tutorial/essential/regex/pre_char_classes.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10290) Resourcemanager recover failed when fair scheduler queue acl changed

2020-06-01 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10290.
--
Resolution: Duplicate

This issue is fixed in YARN-7913.
That change fixes a number of issues around restores that fail.

The change was not backported to Hadoop 2.x

> Resourcemanager recover failed when fair scheduler queue acl changed
> 
>
> Key: YARN-10290
> URL: https://issues.apache.org/jira/browse/YARN-10290
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.2
>Reporter: yehuanhuan
>Priority: Blocker
>
> Resourcemanager recover failed when fair scheduler queue acl changed. Because 
> of queue acl changed, when recover the application (addApplication() in 
> fairscheduler) is rejected. Then recover the applicationAttempt 
> (addApplicationAttempt() in fairscheduler) get Application is null. This will 
> lead to two RM is at standby. Repeat as follows.
>  
> # user run a long running application.
> # change queue acl (aclSubmitApps) so that the user does not have permission.
> # restart the RM.
> {code:java}
> 2020-05-25 16:04:06,191 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating 
> application application_1590393162216_0005 with final state: FAILED
> 2020-05-25 16:04:06,192 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:663)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1246)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1072)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1036)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:845)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:102)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:897)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:850)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:322)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:427)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1173)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:584)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:980)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1021)
> at 
> 

[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-04-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091044#comment-17091044
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

Sorry for that, Thank you for reverting [~weichiu] I had reverted the fix after 
the commit as per the 
[comment|https://issues.apache.org/jira/browse/YARN-10063?focusedCommentId=17077798=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17077798]
 above but I have either not pushed it or it failed during the push. 
I have the revert in my local branch-3.2 :-(

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch, YARN-10063.004.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-04-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10063.
--
Fix Version/s: 3.4.0
   3.3.0
   Resolution: Fixed

committed to 3.3

removed backport to 3.2 branch YARN-6586 was not backported to 3.2

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch, YARN-10063.004.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2020-03-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9940.
-
Resolution: Not A Problem

This issue is fixed in later versions via YARN-8373. In the version it is 
logged against it does not exist.

The custom code that caused the issue to show up is a mix of Hadoop 2.7 and 
Hadoop 2.9.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-03-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058354#comment-17058354
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

+1 pending build

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-03-12 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058334#comment-17058334
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

Can you check this line:
{code}
"%11s launch docker container:%2d appid containerid workdir "
{code}
It removes spacing before the modifier {{%2d}} which is used to line up the 
numeric option with the rest of the commands. It don't think that should be 
removed.

The rest of the change looks OK.
Waiting for a build to run.

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch, 
> YARN-10063.003.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10182) SLS运行报错Couldn't create /yarn-leader-election/yarnRM

2020-03-05 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-10182.
--
Resolution: Fixed

JIRA is not the correct place to ask questions on how to setup or run parts. 
Please use the mailing lists if you need help: u...@hadoop.apache.org

> SLS运行报错Couldn't create /yarn-leader-election/yarnRM
> ---
>
> Key: YARN-10182
> URL: https://issues.apache.org/jira/browse/YARN-10182
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
> Environment: Cloudera Express 6.0.0
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get  "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?
>  
>Reporter: zhangyu
>Priority: Major
> Attachments: slsrun.log.txt
>
>
> RM1 :active RM2:standby
> kerberos is on
> yarn-site.xml : /etc/hadoop/conf.cloudera.yarn/yarn-site.xml
> keytab: /etc/krb5.keytab ===>the keytab of yarn
> when I run slsrun.sh on RM1 ,I will get an error:
> Exception in thread "main" org.apache.hadoop.service.ServiceStateException: 
> java.io.IOException: Couldn't create /yarn-leader-election/yarnRM
> If I use sample-conf/yarn-site.xml ,I will get "KerberosAuthException: Login 
> failure for user: yarn from keytab /etc/krb5.keytab 
> javax.security.auth.login.LoginException: Unable to obtain password from user"
> How to resolve it ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044996#comment-17044996
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

Ah ok, that makes sense now.
Based on the testing you have done it looks like we're all set and do not have 
any side effects of the change.

Two little nits that need to be fixed before we can commit this: can we clean 
up the message generation for the strings and remove the surplus addition in 
the text for the string in these two spots:
{code}
196 throw new IllegalArgumentException("Illegal" + " capacity of "
{code}
and
{code}
229 "Illegal" + " capacity of " + sum + " for children of queue "
{code}

Beside that: +1 

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-25 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044286#comment-17044286
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

1: this point by itself sounds good, we want the child queues to get resources 
and run the apps until they finish

2: if when 0 is set for the parent it is always above the hard limit then how 
does 1 work? I would have expected that the limit is enforced in the hierarchy. 
So can you please explain a bit more on how this works as this is not what I 
would expect.

2: (second time) any resource above the limit should be preemptable so it lines 
up with the first point 2.

3: is expected I would consider setting any on the root (min max etc) as a bug: 
the root is the whole cluster and shoulkd thus mirror what is available in the 
cluster without limits. Otherwise a cluster could never grow or shrink.

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039817#comment-17039817
 ] 

Wilfred Spiegelenburg edited comment on YARN-10124 at 2/19/20 9:11 AM:
---

When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

I am not against removing the limitation but I am thinking about the side 
effects it could have.


was (Author: wilfreds):
When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039817#comment-17039817
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

When you set a parent queue to a 0 capacity could that not leave the apps in 
the leaf queues below it in a state that they are not scheduled?
How does it affect the capacity calculation, i.e. is the queue now always below 
or above its (hard)limits?

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10124) Remove restriction of ParentQueue capacity zero when childCapacities > 0

2020-02-11 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17034266#comment-17034266
 ] 

Wilfred Spiegelenburg commented on YARN-10124:
--

Based on what I read here is that you expect a stopped queue to not have a 
capacity and thus return 0 when calculating the correct distributing. If that 
is the use case then why can't we implement that. In other words: why not turn 
this around, only return or take into account the capacity of a queue when it 
is not in a stopped state? So you return 0 for all stopped queues. You do not 
have to go further than that.

No need to (re)calculate below the parent that is stopped 0 (as that is all 
ignored) and turning the queue back on will trigger the existing settings to be 
applied again without further changes.

> Remove restriction of ParentQueue capacity zero when childCapacities > 0
> 
>
> Key: YARN-10124
> URL: https://issues.apache.org/jira/browse/YARN-10124
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-10124-001.patch
>
>
> ParentQueue capacity cannot be set to 0 when child capacities > 0. To disable 
> a parent queue temporarily, user can only STOP the queue but the capacity of 
> the queue cannot be used for other queues. Allowing 0 capacity for parent 
> queue will allow user to use the capacity for other queues and also to retain 
> the child queue capacity values. (else user has to set all child queue 
> capacities to 0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned YARN-10112:


Assignee: Wilfred Spiegelenburg

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Assignee: Wilfred Spiegelenburg
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10112) Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used with Fair Scheduler size based weights enabled

2020-01-29 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17026400#comment-17026400
 ] 

Wilfred Spiegelenburg commented on YARN-10112:
--

This does not happen in the current releases of YARN anymore.

In YARN-7414 we moved the {{getAppWeight}} out of the scheduler into the 
{{FSAppAttempt}}. That did not solve the locking issue but was the right thing 
to do. In the follow up YARN-7513 I removed the lock from the new call. I would 
say that this is thus a duplicate of the combination YARN-7414 & YARN-7513.

Both are fixed in hadoop 3.01 and 3.1. Backporting of this change is possible.

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---
>
> Key: YARN-10112
> URL: https://issues.apache.org/jira/browse/YARN-10112
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.8.5
>Reporter: Yu Wang
>Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x7fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x7fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x7fbceea0e800 nid=0x2ea2 runnable [0x7fbcbcf6] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x0006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2020-01-28 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025566#comment-17025566
 ] 

Wilfred Spiegelenburg commented on YARN-8990:
-

I do think that is a good idea. The patch should still apply for both to the 
branch-3.2, if not I can provide a branch specific patch if needed but we need 
a committer to check it in for us

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020931#comment-17020931
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

I agree, {{getQueueName()}} should stay as is. We have a {{getQueuePath()}} 
already. Every CSQueue can already return both. We should change all non 
external facing calls getting the name of a queue to the path version. The only 
calls that can stay are the ones that provide their data in an externally 
viewable form (REST, UI or IPC) as to not break compatibility.

I also do not see why we would need the ambiguous queue list. The queue is 
always unique when a path is used. It does not matter if the current leaf queue 
name uniqueness is enforced or not.
 Everything can always be found by its path. If I do not have a path I expect 
leaf queue uniqueness and can find the queue by just checking the part after 
the last _dot_ in the path.
 i.e.
 * queue paths defined as: root.parent.child1
 child queue unique flag is set
 find a queue with name: *child1* (no dots, expect leaf queue uniqueness) -> 
returns the queue correctly
 * add a queue defined as: root.otherparent.child1
 child queue unique flag is not set, allowed
 find a queue with name: *child1* (no dots, expect leaf queue uniqueness) -> 
returns an error

Internally we just store everything using the path, that would remove the whole 
keeping things in sync and makes the code consistent when combined with using 
the path everywhere internally

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf, YARN-9879.POC001.patch
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020662#comment-17020662
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

Test failures are not related, branch-3.2 fix seems to be good to go.

You might also need to trigger the branch-3.1 build, the backport is slightly 
different because of YARN-8248

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Attachment: YARN-7913-branch-3.2.001.patch
YARN-7913-branch-3.1.001.patch

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-21 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020242#comment-17020242
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

added branch-3.1 and branch-3.2 patches

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-7913-branch-3.1.001.patch, 
> YARN-7913-branch-3.2.001.patch, YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10087:
-
Attachment: ats_stack.txt

> ATS possible NPE on REST API when data is missing
> -
>
> Key: YARN-10087
> URL: https://issues.apache.org/jira/browse/YARN-10087
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wilfred Spiegelenburg
>Priority: Major
> Attachments: ats_stack.txt
>
>
> If the data stored by the ATS is not complete REST calls to the ATS can 
> return a NPE instead of results.
> {{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}
> The issue shows up when the ATS was down for a short period and in that time 
> new applications were started. This causes certain parts of the application 
> data to be missing in the ATS store. In most cases this is not a problem and 
> data will be returned but when you start filtering data the filtering fails 
> throwing the NPE.
>  In this case the request was for: 
> {{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}
> If certain pieces of data are missing the ATS should not even consider 
> returning that data, filtered or not. We should not display partial or 
> incomplete data.
>  In case of the missing user information ACL checks cannot be correctly 
> performed and we could see more issues.
> A similar issue was fixed in YARN-7118 where the queue details were missing. 
> It just _skips_ the app to prevent the NPE but that is not the correct thing 
> when the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016530#comment-17016530
 ] 

Wilfred Spiegelenburg commented on YARN-10087:
--

Attached the NPE as logged in the ATS logs.

The logs point to these lines of code in the release that was running:
{code:java}
188 if (userQuery != null && !userQuery.isEmpty()) {
189   if (!appReport.getUser().equals(userQuery)) {
190 continue;
191   }
192 } {code}

> ATS possible NPE on REST API when data is missing
> -
>
> Key: YARN-10087
> URL: https://issues.apache.org/jira/browse/YARN-10087
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Wilfred Spiegelenburg
>Priority: Major
> Attachments: ats_stack.txt
>
>
> If the data stored by the ATS is not complete REST calls to the ATS can 
> return a NPE instead of results.
> {{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}
> The issue shows up when the ATS was down for a short period and in that time 
> new applications were started. This causes certain parts of the application 
> data to be missing in the ATS store. In most cases this is not a problem and 
> data will be returned but when you start filtering data the filtering fails 
> throwing the NPE.
>  In this case the request was for: 
> {{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}
> If certain pieces of data are missing the ATS should not even consider 
> returning that data, filtered or not. We should not display partial or 
> incomplete data.
>  In case of the missing user information ACL checks cannot be correctly 
> performed and we could see more issues.
> A similar issue was fixed in YARN-7118 where the queue details were missing. 
> It just _skips_ the app to prevent the NPE but that is not the correct thing 
> when the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10087) ATS possible NPE on REST API when data is missing

2020-01-15 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-10087:


 Summary: ATS possible NPE on REST API when data is missing
 Key: YARN-10087
 URL: https://issues.apache.org/jira/browse/YARN-10087
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Wilfred Spiegelenburg


If the data stored by the ATS is not complete REST calls to the ATS can return 
a NPE instead of results.

{{{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}}}

The issue shows up when the ATS was down for a short period and in that time 
new applications were started. This causes certain parts of the application 
data to be missing in the ATS store. In most cases this is not a problem and 
data will be returned but when you start filtering data the filtering fails 
throwing the NPE.
 In this case the request was for: 
{{http://:8188/ws/v1/applicationhistory/apps?user=hive'}}

If certain pieces of data are missing the ATS should not even consider 
returning that data, filtered or not. We should not display partial or 
incomplete data.
 In case of the missing user information ACL checks cannot be correctly 
performed and we could see more issues.

A similar issue was fixed in YARN-7118 where the queue details were missing. It 
just _skips_ the app to prevent the NPE but that is not the correct thing when 
the user is missing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015701#comment-17015701
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

A per queue flag looks very strange. I am also not sure it will help or add 
anything on top of having a global flag that just prevents the config change. 
Summarising the proposed solution: add a flag that prevents the admin from 
adding non unique leaf queue names and thus fail the config change when he/she 
tries

The behaviour inside the scheduler must all be based on the full queue paths 
anyway. You cannot have one queue being addressed by the leaf name and the 
other by the path. The code complexity to do that would be enormous and lead to 
unsupportable code. That means that after the placement rule(s) are run and the 
app is placed everything must be based on a full path.

Placement rules throw up a totally different issue here. When we use placement 
rules we have one of two possible cases:
 * the rule generates a queue name and a parent queue name, i.e. a path
 * the rule generates just a leaf queue name

Which means that the rule can generate a leaf queue anywhere in the hierarchy 
without specifying a hierarchy. So no parent is set by the rule but the leaf 
queue generated could be located below a parent. With that last possibility we 
have the extra complexity in that the rules are not behaving consistently.
 Example:
Two CS definitions to compare both allow queue creation and overwrite of the 
submitted queue:
 # queues: root.parent.wilfred
 # queues: root

mapping rule defined: {{u:%user:%user}}

1) user submitting the app is {{wilfred}} queue given on submission is default
In CS config 1 we submit to the {{root.parent.wilfred}} queue while in the 
second CS config we submit to {{root.wilfred}} queue.

2) user submitting the app is {{peter}} queue given on submission is default
In both CS configs we submit to the {{root.peter}} queue.

With different config at the CS level but for the same rule we place the app in 
a sub queue sometimes but not the other, that is inconsistent.

I think rules even need to start taking this flag into account to preserve this 
inconsistent behaviour.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015435#comment-17015435
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

Thank you [~snemeth] for looking at the patch
1) yes it is a different approach: we now do a proper restore instead of 
failing the app and or crashing the RM

2) removed that line it does not make sense to throw errors and fail the app 
but not fail the RM

3) You are correct in your assumption about the flag. It is checked in the case 
you are mentioning. This is the full change in that area:
{code}
487 if (!isAppRecovering) {
488   rejectApplicationWithMessage(applicationId,
489   queueName + " is not a leaf queue");
490   return;
491 }
492 // app is recovering we do not want to fail the app now as it 
was there
493 // before we started the recovery. Add it to the recovery queue:
494 // dynamic queue directly under root, no ACL needed (auto clean 
up)
495 queueName = "root.recovery";
496 queue = queueMgr.getLeafQueue(queueName, true, applicationId);
{code}
The flag is already checked in line 487 of the change. When we get to the 
create of the recovery queue (line 492-496) we are recovering, if we are not 
recovering we have rejected the app and left the method using the return in 
line 490.

4) will fix the comment

5) Yes we only want to check newly added apps that are _not_ recovering. When 
we recover there is a large chance that there are no NMs registered. When we 
use the % resource setup for the maximum size of a queue then the size check of 
the AM resource will fail. Since the AM was/is already running we need to only 
perform that check if we are not recovering the app.

6) The largest number of statements inside an if..else I can see is 6 lines. 
Not sure if refactoring the method will make it any clearer. I also don't see 
any consistency in the checks that would allow us to change the flow easily and 
improve clarity.

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Description: 
There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
recovered 
(https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.


  was:
There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
recovered 
(https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.

_The point of this ticket is to improve the error handling and reduce the 
number of passive -> active RM transition attempts (solving the above described 
failure scenario isn't in scope)._


> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (YARN-8470) Fair scheduler exception with SLS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-8470.
-
  Assignee: Wilfred Spiegelenburg  (was: Szilard Nemeth)
Resolution: Duplicate

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8470) Fair scheduler exception with SLS

2020-01-14 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015344#comment-17015344
 ] 

Wilfred Spiegelenburg commented on YARN-8470:
-

[~Steven Rand] this has been fixed via YARN-9984 and is in 3.2.2.

> Fair scheduler exception with SLS
> -
>
> Key: YARN-8470
> URL: https://issues.apache.org/jira/browse/YARN-8470
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Miklos Szegedi
>Assignee: Szilard Nemeth
>Priority: Major
>
> I ran into the following exception with sls:
> 2018-06-26 13:34:04,358 ERROR resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> FSPreemptionThread, that exited unexpectedly: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptOnNode(FSPreemptionThread.java:207)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreemptForOneContainer(FSPreemptionThread.java:161)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.identifyContainersToPreempt(FSPreemptionThread.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:81)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10063) Usage output of container-executor binary needs to include --http/--https argument

2020-01-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010548#comment-17010548
 ] 

Wilfred Spiegelenburg commented on YARN-10063:
--

Thank you review [~pbacsko] and for the update [~sahuja]

We need to add the usage correctly to the output. What I see at the moment does 
not look correct.The output indicates that we have {{--http | --https}} It does 
not show that the two parameters following the "choice" are only supported with 
the https option.

Not sure how we can show that better, we might need to go to partially back to 
the way [~sahuja] did show it in the first version and use the same construct 
as used for the {{command and command-args}} to provide the detail needed, so 
something like:
{quote}where command and command-args:
 initialize container: 0 appid tokens nm-local-dirs nm-log-dirs cmd app...
 launch container: 1 appid containerid workdir container-script tokens 
*http-option* pidfile nm-local-dirs nm-log-dirs resources 
optional-tc-command-file
 [DISABLED] launch docker container: 4 appid containerid 
 ...
 where http-option is one of:
 --http
 --https keystorepath truststorepath
{quote}

A bit difficult to show in jira but I think you get the gist.

> Usage output of container-executor binary needs to include --http/--https 
> argument
> --
>
> Key: YARN-10063
> URL: https://issues.apache.org/jira/browse/YARN-10063
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Siddharth Ahuja
>Assignee: Siddharth Ahuja
>Priority: Minor
> Attachments: YARN-10063.001.patch, YARN-10063.002.patch
>
>
> YARN-8448/YARN-6586 seems to have introduced a new option - "\--http" 
> (default) and "\--https" that is possible to be passed in to the 
> container-executor binary, see :
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L564
> and 
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L521
> however, the usage output seems to have missed this:
> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c#L74
> Raising this jira to improve this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2020-01-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010190#comment-17010190
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

Thank you [~leftnoteasy] for the comments.
{quote}And once application is submitted to CS, internal to CS, we should make 
sure we use queue path instead of queue name at all other places. Otherwise we 
will complicate other logics.
{quote}
I agree that is what I had in mind too. Make it as simple as possible inside 
the scheduler and that is to use just the full path internally.

For the configuration change: I do not think it is a problem and we can just 
accept the change. To be fair to the administrator we should show a message 
when the configuration is loaded or changed and the leaf queues are not unique 
(any more). However that is probably as far as we need to go.
{quote}Instead of using scheduler.getQueue, we may need to consider to add a 
method like getAppSubmissionQueue() to get a queue based on path or name, and 
after that, we will put normalized queue_path back to submission context of 
application to make sure in the future inside scheduler we all refer to queue 
path.
{quote}
The FS already does something like this already because it uses a placement 
rule in all cases. We should leverage a similar mechanism in the CS. We pass 
the queue from the submission into the queue placement which handles the full 
path or not. In both cases it just passes back the queue object which will be 
using the full path. If the queue is not found or the queue name is not unique 
it fails as per normal. The returned queue info is updated in the app and 
submission context.
 Far simpler than putting the burden on the core scheduler. It is all hidden in 
the placement of the app into the queue using the placement engine.

I did not mention queue mapping in my design. Queue mapping itself I thought 
did not need to change. We already calculate the parent queue in the rules if I 
am correct so the only change would be the return value. We do all internal 
handling for queues with the full queue path so it is a logical change. Using 
the placement rule for the qualified or not qualified mapping does require some 
changes in that area.

I might have forgotten to mention other bits and pieces like the cli or flow on 
effects on the UI but that needs to assessed when we have have a design we 
agree on. There will be more jiras needed to fix separate parts when the change 
is made to the core.

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009551#comment-17009551
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

[~snemeth] and [~sunilg] can you have a look at the change please?

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Attachment: YARN-7913.003.patch

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-06 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009329#comment-17009329
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

The ACL test case did not fail in the expected way when reversing the fix as 
ACLs were not correctly setup in the test config.
Updating patch with the proper config. [^YARN-7913.003.patch] 

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch, YARN-7913.003.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-06 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-7913:

Attachment: YARN-7913.002.patch

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2020-01-06 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009240#comment-17009240
 ] 

Wilfred Spiegelenburg commented on YARN-7913:
-

new patch to fix javac commenting on deprecated method use and checkstyle 
issues (IDE was not setup correctly) 

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7913.000.poc.patch, YARN-7913.001.patch, 
> YARN-7913.002.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053-branch-3.2.002.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053-branch-3.1.003.patch, 
> YARN-10053-branch-3.2.001.patch, YARN-10053-branch-3.2.002.patch, 
> YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002545#comment-17002545
 ] 

Wilfred Spiegelenburg edited comment on YARN-10053 at 12/23/19 11:49 PM:
-

The junit test failures are not related to my change. The tests pass locally. 
One failures is caused by a port bind exception, they seem environment related 
or are already known (YARN-9740, YARN-9338).

Uploading new patches to fix checkstyle issues in 3.1 and 3.2 branches.


was (Author: wilfreds):
The junit test failures are not related to my change. The tests pass locally. 
One failures is caused by a port bind exception, they seem environment related 
or are already known (YARN-9740, YARN-9338).
in 3.1 and 3.2 branches
Uploading new patches to fix checkstyle issues

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053-branch-3.1.003.patch, 
> YARN-10053-branch-3.2.001.patch, YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002545#comment-17002545
 ] 

Wilfred Spiegelenburg commented on YARN-10053:
--

The junit test failures are not related to my change. The tests pass locally. 
One failures is caused by a port bind exception, they seem environment related 
or are already known (YARN-9740, YARN-9338).
in 3.1 and 3.2 branches
Uploading new patches to fix checkstyle issues

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053-branch-3.1.003.patch, 
> YARN-10053-branch-3.2.001.patch, YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053-branch-3.1.003.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053-branch-3.1.003.patch, 
> YARN-10053-branch-3.2.001.patch, YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10047) Memory consume of process tree will consider subprocess which may make container exit unexcepted

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002521#comment-17002521
 ] 

Wilfred Spiegelenburg commented on YARN-10047:
--

Accounting is not done by YARN, it is done by the OS. It sounds like you are 
saying that the OS is not accounting correctly. 

>From a YARN perspective we consider any process that is forked from the 
>container script to be part of the container. All sub processes are therefore 
>part of the container. You must account for that inside the container.

If you are referring to some other issue or I am not understanding the problem 
you see correctly then please show it with some logs or screenshots.



> Memory consume of process tree will consider  subprocess which may make 
> container exit unexcepted
> -
>
> Key: YARN-10047
> URL: https://issues.apache.org/jira/browse/YARN-10047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As below, we have a case which spark driver execute some scripts.Then 
> sometimes the driver will be killed.
> {code:java}
> yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container 
> [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
> running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
> physical memory used; xxx. Killing container.
> {code}
> {code:java}
> boolean isProcessTreeOverLimit(String containerId,
>   long currentMemUsage,
>   long curMemUsageOfAgedProcesses,
>   long vmemLimit) {
> boolean isOverLimit = false;
>
> /**
> if (currentMemUsage > (2 * vmemLimit)) {
>   LOG.warn("Process tree for container: " + containerId
>   + " running over twice " + "the configured limit. Limit=" + 
> vmemLimit
>   + ", current usage = " + currentMemUsage);
>   isOverLimit = true;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053-branch-3.2.001.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053-branch-3.2.001.patch, 
> YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053-branch-3.1.002.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10053) Placement rules do not use correct group service init

2019-12-23 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002230#comment-17002230
 ] 

Wilfred Spiegelenburg commented on YARN-10053:
--

Similar update for the 3.1 branch, now it also needs a 3.2 branch change.

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, 
> YARN-10053-branch-3.1.002.patch, YARN-10053.001.patch, YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053.002.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, YARN-10053.001.patch, 
> YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002141#comment-17002141
 ] 

Wilfred Spiegelenburg commented on YARN-10053:
--

Some tests were overriding the group resolution in a way that does not work 
anymore.
Updating the test initialisation.

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, YARN-10053.001.patch, 
> YARN-10053.002.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002074#comment-17002074
 ] 

Wilfred Spiegelenburg commented on YARN-10053:
--

Patch for the same issue that applies to the 3.1 and 3.2 branches

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, YARN-10053.001.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-10053:
-
Attachment: YARN-10053-branch-3.1.001.patch

> Placement rules do not use correct group service init
> -
>
> Key: YARN-10053
> URL: https://issues.apache.org/jira/browse/YARN-10053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.3
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-10053-branch-3.1.001.patch, YARN-10053.001.patch
>
>
> The placement rules, CS and FS, all create a new group service instead of 
> using the shared group mapping service. This means that the cache for the 
> placement rules is not same as for the ACL and other parts of the RM service.
> It also could cause an issue with the configuration that is passed in to 
> create the cache: the scheduler config might not have the same values as the 
> service config and could thus cause issues. This second issue just seems to 
> affect the CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10053) Placement rules do not use correct group service init

2019-12-22 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-10053:


 Summary: Placement rules do not use correct group service init
 Key: YARN-10053
 URL: https://issues.apache.org/jira/browse/YARN-10053
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.3
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


The placement rules, CS and FS, all create a new group service instead of using 
the shared group mapping service. This means that the cache for the placement 
rules is not same as for the ACL and other parts of the RM service.

It also could cause an issue with the configuration that is passed in to create 
the cache: the scheduler config might not have the same values as the service 
config and could thus cause issues. This second issue just seems to affect the 
CS not the FS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7012) yarn refuses application if a user has no groups associated

2019-12-22 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-7012.
-
  Assignee: Wilfred Spiegelenburg
Resolution: Not A Bug

The group mapping you configure should handle the local and non local users. A 
user must have at least a primary group, it is needed for ownership of files 
etc.
The exception you see shows that the user is not considered valid and this is 
supposed to happen.

> yarn refuses application if a user has no groups associated
> ---
>
> Key: YARN-7012
> URL: https://issues.apache.org/jira/browse/YARN-7012
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
> Environment: redhat 7.3
>Reporter: Michael Salmon
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>
> When an application is submitted the list of queue mappings is searched for 
> the default queue for the user submitting the application. If there is a 
> queue mapping for a group and the user does not have any groups then an 
> exception is thrown and the application is rejected:
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application 
> application_1499325026025_0002 submitted by user hive reason: No groups found 
> for user hive
> at UserGroupMappingPlacementRule.java:153 although the lookup is at line 122.
> This problem arises when using local users and AD groups.
> Users without groups should be treated as users with no matching groups



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10046) RM failed to transition to Active because of App recovery throwing java.lang.NullPointerException

2019-12-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1768#comment-1768
 ] 

Wilfred Spiegelenburg commented on YARN-10046:
--

If this is really teh version that you say it is the error happens at this 
point:
{code}
519  SchedulerApplication application = applications.get(
520  applicationAttemptId.getApplicationId());
521  String user = application.getUser();
522  FSLeafQueue queue = (FSLeafQueue) application.getQueue();
{code}

Which would mean that the application is null which makes it the same issue as 
YARN-7913. Please check that one, I have started working on a fix for it.

Normally this failure means that you have changed the scheduler configuration 
so much that we cannot handle it on recovery.

> RM failed to transition to Active because of App recovery throwing 
> java.lang.NullPointerException
> -
>
> Key: YARN-10046
> URL: https://issues.apache.org/jira/browse/YARN-10046
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Yong Xing
>Priority: Critical
>
>  
> CDH Distribution: Hadoop 3.0.0-cdh6.0.1
> The exception stack is as follows.
> 2019-12-12 17:09:41,422 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:521)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1221)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:130)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1265)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1206)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:907)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:116)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:1046)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$2000(RMAppImpl.java:118)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1110)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1051)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>  at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:357)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:544)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1393)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1146)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1186)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1182)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> 

[jira] [Commented] (YARN-10047) container memory monitor may make container exit

2019-12-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1760#comment-1760
 ] 

Wilfred Spiegelenburg commented on YARN-10047:
--

This is expected behaviour: the container is using 50GB while it is only 
allowed to use 5GB. If your container needs more memory then you need to give 
it more memory.
The code shown is also not the code that is killing the container: your 
container is over *physical* memory (i.e. real used mem) not over _virtual_ 
memory. Your message is generated here in the 
[ContainersMonitorImpl.java|https://github.com/apache/hadoop/blob/1ac967a6b77c262b23e10c6ca68538b7e4ed39b0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java#L434]


> container memory monitor may make container exit
> 
>
> Key: YARN-10047
> URL: https://issues.apache.org/jira/browse/YARN-10047
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
>
> As below, we have a case which spark driver execute some scripts.Then 
> sometimes the driver will be killed.
> {code:java}
> yarn.174410.log.2019-12-17.02:2019-12-17,06:59:14,831 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Container 
> [pid=50529,containerID=container_e917_1576303656075_174957_01_003197] is 
> running beyond physical memory limits. Current usage: 50.28 GB of 5.25 GB 
> physical memory used; xxx. Killing container.
> {code}
> {code:java}
> boolean isProcessTreeOverLimit(String containerId,
>   long currentMemUsage,
>   long curMemUsageOfAgedProcesses,
>   long vmemLimit) {
> boolean isOverLimit = false;
>
> /**
> if (currentMemUsage > (2 * vmemLimit)) {
>   LOG.warn("Process tree for container: " + containerId
>   + " running over twice " + "the configured limit. Limit=" + 
> vmemLimit
>   + ", current usage = " + currentMemUsage);
>   isOverLimit = true;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10043) FairOrderingPolicy Improvements

2019-12-18 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16999627#comment-16999627
 ] 

Wilfred Spiegelenburg commented on YARN-10043:
--

Improvements are always welcome, specially if it helps bring us to one 
scheduler to do proper fair and capacity scheduling.
Giving some more specifics on what you are thinking about will help.

> FairOrderingPolicy Improvements
> ---
>
> Key: YARN-10043
> URL: https://issues.apache.org/jira/browse/YARN-10043
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> FairOrderingPolicy can be improved by using some of the approaches (only 
> relevant) implemented in FairSharePolicy of FS. This improvement has 
> significance in FS to CS migration context.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9879) Allow multiple leaf queues with the same name in CS

2019-12-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997290#comment-16997290
 ] 

Wilfred Spiegelenburg commented on YARN-9879:
-

I have read through the design document and was wondering if we cannot take a 
far simpler approach.

If we simply relax the rule that the leaf queue must be unique in the system in 
favour of the fact that a queue must be unique based on the full queue path. 
This does not break existing configurations as the unique leaf queue is also 
unique when you take into account the whole path. That means there is nothing 
for the current clusters that needs to change. Internally the scheduler does 
have to change to make sure that all references use the queue path. This will 
require a lot of changes throughout the scheduler when you look up a queue and 
the way we store the reference if it is not directly to the leaf queue. 

The only other point that we need to correctly handle this now is on the submit 
side. This must be handled backward compatible. We have two cases to handle: 
just a queue name and a queue path. I'll discuss updating  the configuration is 
later.

# When an application is submitted with just a queue name (not a path) we 
expect that the name is a unique leaf queue name. If that queue does not exist 
or is not uniquely identifiable we reject the application submission. 
Resolution of the real leaf queue follows the same steps as it does now. The 
queue name in the end is converted to the correct leaf queue identified by the 
a path. For existing configurations nothing has changed. Internally we hide all 
the changes.
# When the submit has a queue path (fully qualified or not) we check that the 
queue exists based on that path. If the leaf queue is not defined using its 
path the application submission is rejected. 

In the case that the scheduler has a non unique leaf queue name submitting to 
those queues can only be done by using their paths. There is nothing that needs 
to be configured to switch this behaviour on or off.

The important part is applying a new configuration. If the configuration adds a 
leaf queue that is not unique the configuration update currently is rejected. 
With this change we would allow that config to become active. This *could* 
break existing applications when they try to submit to the leaf queue that is 
no longer unique.
We should at least log and warn clearly in the response of the update. Maybe 
even show it in the UI or we could ask for a confirmation. The first update 
that adds a non unique queue to the configuration should always fail 
complaining loudly. It should then keep warning the user and rejecting the 
update unless a confirmation flag is set to force the update through. After the 
first update that would not be needed anymore.
Reading a config from a file or store which is used to initialise the scheduler 
should not trigger such behaviour. We still should show a warning in the logs 
to make sure it is not lost.

What do you think about this approach?

> Allow multiple leaf queues with the same name in CS
> ---
>
> Key: YARN-9879
> URL: https://issues.apache.org/jira/browse/YARN-9879
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: DesignDoc_v1.pdf
>
>
> Currently the leaf queue's name must be unique regardless of its position in 
> the queue hierarchy. 
> Design doc and first proposal is being made, I'll attach it as soon as it's 
> done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2019-12-09 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991698#comment-16991698
 ] 

Wilfred Spiegelenburg commented on YARN-5106:
-

It is correct that XMLs would have been moved around as testing has changed 
with the addition of YARN-8967.
Back porting could thus require major updates to the patch to make sure you 
pick up the earlier set of XMLs. It really depends on how important this is to 
get back ported.

Not having it in the three active branches 3.x could make back porting new FS 
tests much harder. The choice becomes: do the work once now or keep doing it 
for a longer period for each fix/change separately.

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Adam Antal
>Priority: Major
>  Labels: newbie++
> Fix For: 3.3.0
>
> Attachments: YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.1.001.patch, YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.1.002.patch, YARN-5106-branch-3.2.001.patch, 
> YARN-5106-branch-3.2.001.patch, YARN-5106-branch-3.2.002.patch, 
> YARN-5106.001.patch, YARN-5106.002.patch, YARN-5106.003.patch, 
> YARN-5106.004.patch, YARN-5106.005.patch, YARN-5106.006.patch, 
> YARN-5106.007.patch, YARN-5106.008.patch, YARN-5106.008.patch, 
> YARN-5106.008.patch, YARN-5106.009.patch, YARN-5106.010.patch, 
> YARN-5106.011.patch, YARN-5106.012.patch, YARN-5106.013.patch, 
> YARN-5106.014.patch, YARN-5106.015.patch, YARN-5106.016.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7721) TestContinuousScheduling fails sporadically with NPE

2019-12-09 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991693#comment-16991693
 ] 

Wilfred Spiegelenburg commented on YARN-7721:
-

If there is a problem with the {{transferStateFromPreviousAttempt}} call itself 
then it should cause other tests to fail.
It is not a generic issue in the sense that we always want to wait until 
{{getLastScheduledContainer}} returns a non null value. In the case that we are 
scheduling normally, i.e. non test code, there is no problem and we do not want 
to delay there.

There is nothing left to do for this jira except commit the code, that was why 
it was assigned to [~sunilg] [~prabhujoseph] or [~snemeth] could also do that.

> TestContinuousScheduling fails sporadically with NPE
> 
>
> Key: YARN-7721
> URL: https://issues.apache.org/jira/browse/YARN-7721
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Jason Darrell Lowe
>Assignee: Sunil G
>Priority: Major
> Attachments: YARN-7721.001.patch
>
>
> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime is 
> failing sporadically with an NPE in precommit builds, and I can usually 
> reproduce it locally after a few tries:
> {noformat}
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.085 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:383)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> [...]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5106) Provide a builder interface for FairScheduler allocations for use in tests

2019-12-06 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-5106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989655#comment-16989655
 ] 

Wilfred Spiegelenburg commented on YARN-5106:
-

I did not look at back porting it to any earlier release than trunk. It is a 
major change however it did not change anything in the configuration of the 
scheduler. The XML is still exactly the same as it was before the change. 

The way we load it has changed but the XML itself is the same. So for the 
builder: it writes out the configuration to a file and then calls the loader. 
This is more likely where things stumble because the new loader takes the 
{{scheduler}} as a parameter while the old one did not.
That should be the only point the two jiras should interact.

> Provide a builder interface for FairScheduler allocations for use in tests
> --
>
> Key: YARN-5106
> URL: https://issues.apache.org/jira/browse/YARN-5106
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Adam Antal
>Priority: Major
>  Labels: newbie++
> Fix For: 3.3.0
>
> Attachments: YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.1.001.patch, YARN-5106-branch-3.1.001.patch, 
> YARN-5106-branch-3.1.002.patch, YARN-5106-branch-3.2.001.patch, 
> YARN-5106-branch-3.2.001.patch, YARN-5106-branch-3.2.002.patch, 
> YARN-5106.001.patch, YARN-5106.002.patch, YARN-5106.003.patch, 
> YARN-5106.004.patch, YARN-5106.005.patch, YARN-5106.006.patch, 
> YARN-5106.007.patch, YARN-5106.008.patch, YARN-5106.008.patch, 
> YARN-5106.008.patch, YARN-5106.009.patch, YARN-5106.010.patch, 
> YARN-5106.011.patch, YARN-5106.012.patch, YARN-5106.013.patch, 
> YARN-5106.014.patch, YARN-5106.015.patch, YARN-5106.016.patch
>
>
> Most, if not all, fair scheduler tests create an allocations XML file. Having 
> a helper class that potentially uses a builder would make the tests cleaner. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10010) NM upload log cost too much time

2019-12-03 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986740#comment-16986740
 ] 

Wilfred Spiegelenburg commented on YARN-10010:
--

The thread pool size is configurable via 
{{yarn.nodemanager.logaggregation.threadpool-size-max}} if you need more 
threads than you can set it higher.
I would recommend that you set it to a number almost as high as the number of 
simultaneous application you expect in the cluster. I cannot give you an exact 
number but 100 threads especially in larger clusters might not be enough.

Can we also close this as a dupe of YARN-8364 ?

> NM upload log cost too much time
> 
>
> Key: YARN-10010
> URL: https://issues.apache.org/jira/browse/YARN-10010
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Attachments: notfound.png
>
>
> Since thread pool size of log service is 100.
> Some times the log uploading service will delay for some apps.like below
>  !notfound.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9993) Remove incorrectly committed files from YARN-9011

2019-11-27 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-9993:
---

 Summary: Remove incorrectly committed files from YARN-9011
 Key: YARN-9993
 URL: https://issues.apache.org/jira/browse/YARN-9993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.2.2
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


With the checkin of YARN-9011 a number of files were added that should not have 
been in the commit:
[https://github.com/apache/hadoop/tree/branch-3.2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications]

This causes the asf license check to fail on the 3.2 branch build



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977911#comment-16977911
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

I missed the dot instead of the dash. Thank you [~sunilg] for fixing the name

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Fix For: 3.3.0, 3.2.2
>
> Attachments: YARN-8373-branch-3.1.001.patch, 
> YARN-8373-branch.3.1.001.patch, YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch, YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-19 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977415#comment-16977415
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

Not sure why it tried to use trunk as the patch follows the branch naming.

I ran the tests with the patch on the branch and did not see any issues.

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Fix For: 3.3.0, 3.2.2
>
> Attachments: YARN-8373-branch.3.1.001.patch, YARN-8373.001.patch, 
> YARN-8373.002.patch, YARN-8373.003.patch, YARN-8373.004.patch, 
> YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-19 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reopened YARN-8373:
-

Adding patch for the 3.1 branch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Fix For: 3.3.0, 3.2.2
>
> Attachments: YARN-8373-branch.3.1.001.patch, YARN-8373.001.patch, 
> YARN-8373.002.patch, YARN-8373.003.patch, YARN-8373.004.patch, 
> YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-18 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976997#comment-16976997
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

{quote}[~wilfreds] cud we add a test to check whether reading from nodes doesnt 
cause any pblms OR any existing tests are already covering the same.
{quote}
Existing code is already checking this and has a unit test . The set is a copy 
of the nodes and YARN-3675 covers the read path in the testing
{quote}In an another note, does continuousSchedulingAttempt need writeLock by 
any chance ?
{quote}
The write lock is taken in {{attemptScheduling}} a little later. In this case a 
read lock is enough as we are not making any changes and it impacts scheduling 
less. The method that does the real scheduling {{attemptScheduling}} must have 
the write lock and already takes one. It also does another check to see if the 
node still exists as it might have been removed in the time between making the 
list and really allocating (your concern above).

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch, YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9984) FSPreemptionThread crash with NullPointerException

2019-11-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-9984:

Attachment: YARN-9984.001.patch

> FSPreemptionThread crash with NullPointerException
> --
>
> Key: YARN-9984
> URL: https://issues.apache.org/jira/browse/YARN-9984
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.0.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-9984.001.patch
>
>
> When an application is unregistered there is a chance that there are still 
> containers running on a node for that application. In all cases we handle the 
> application missing from the RM gracefully (log a message and continue) 
> except for the FS pre-emption thread.
> In case the application is removed but some containers are still linked to a 
> node the FSPreemptionThread will crash with a NPE when it tries to retrieve 
> the application id for the attempt:
> {code:java}
> FSAppAttempt app =
> scheduler.getSchedulerApp(container.getApplicationAttemptId());
> ApplicationId appId = app.getApplicationId();{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9984) FSPreemptionThread crash with NullPointerException

2019-11-17 Thread Wilfred Spiegelenburg (Jira)
Wilfred Spiegelenburg created YARN-9984:
---

 Summary: FSPreemptionThread crash with NullPointerException
 Key: YARN-9984
 URL: https://issues.apache.org/jira/browse/YARN-9984
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.0.0
Reporter: Wilfred Spiegelenburg
Assignee: Wilfred Spiegelenburg


When an application is unregistered there is a chance that there are still 
containers running on a node for that application. In all cases we handle the 
application missing from the RM gracefully (log a message and continue) except 
for the FS pre-emption thread.

In case the application is removed but some containers are still linked to a 
node the FSPreemptionThread will crash with a NPE when it tries to retrieve the 
application id for the attempt:
{code:java}
FSAppAttempt app =
scheduler.getSchedulerApp(container.getApplicationAttemptId());
ApplicationId appId = app.getApplicationId();{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-17 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976205#comment-16976205
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

{quote}1. Javadoc still says "PriorityQueue": @return sorted list in the form 
of a PriorityQueue
{quote}
fixed, I thought I had changed them all.
{quote}2. Since we're using a set, maybe it's worthwile to rename the variable 
"sortedList" to something else and also change the "list" word in the Javadoc.
{quote}
renamed to set

Uploaded a new patch with these changes

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch, YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-17 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-8373:

Attachment: YARN-8373.005.patch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch, YARN-8373.005.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975692#comment-16975692
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

I really should have looked at my fix in YARN-8436 while writing this one. I 
used a {{TreeSet}} there too to fix the same issue...

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-16 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975689#comment-16975689
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

I missed that in the {{PriorityQueue}}. Not really obvious that the iterator 
does that. I have changed to a {{TreeSet}}.

I looked at that however the problem is that {{Collections.sort()}} uses 
{{TimSort}} Which means that to use that we must be able to guarantee that 
there is no way that the objects being sorted cannot change while sorting. The 
node object that we sort are not locked while we are sorting. This means that 
we cannot guarantee that. To guarantee no change we would need to lock all 
individual nodes for the time we are sorting. A rather large performance impact.

On top of that even if in the current code path we do not have more than one 
thread that changes the node there is no guarantee that behaviour will not 
change. Since the {{ClusterNodeTracker}} is used by all scheduler we also 
cannot guarantee that a different scheduler does not do it or uses the same 
method.

Based on all that it is better to be defensive and use the TreeSet.
You can easily reproduce the issue with the unit test that has been updated. 
When you use a {{Collection.sort()}} in the {{ClusterNodeTracker}} you will see 
the test fail. At least it did for me every single time I ran it.

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 

[jira] [Updated] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-16 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-8373:

Attachment: YARN-8373.004.patch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch, YARN-8373.004.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9974) Large diagnostics may cause RM recover failed

2019-11-13 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973803#comment-16973803
 ] 

Wilfred Spiegelenburg commented on YARN-9974:
-

This looks like a duplicate of YARN-6125 and YARN-6967.
Which version of hadoop are you using?

> Large diagnostics may cause RM recover failed
> -
>
> Key: YARN-9974
> URL: https://issues.apache.org/jira/browse/YARN-9974
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Critical
>
> {code:java}
> 2019-09-04,16:37:32,224 DEBUG org.apache.zookeeper.ClientCnxn: Reading reply 
> sessionid:0x563398cdd5889a1, packet:: clientPath:null serverPath:null 
> finished:false header:: 1659,4  replyHeader:: 1659,27117069873,0  request:: 
> '/yarn-ha/zjyprc-hadoop/rm-state/ZKRMStateRoot/RMAppRoot/application_1531361280531_691245,F
>   response:: 
> 

[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-11 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971887#comment-16971887
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

See my comment:
{quote}
The one inside the scheduler prevents updates of nodes to flow through as much 
as possible.
{quote}
There is no guarantee that there will not be a change. The only way we can 
prevent all changes is by locking down all nodes in the list which we do not 
want. Putting the scheduler lock in place will quiet down things but does not 
guarantee it.

The PriorityQueue does not avoid change. It handles the fact that objects can 
change while in the list.

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-11 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971642#comment-16971642
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

The one inside the scheduler prevents updates of nodes to flow through as much 
as possible. Node changes are linked to allocating or finishing containers, 
scheduling actions. These are prevented as much as possible with this lock.
The one in the node tracker prevents adding or deleting nodes while it creates 
the list. However a node can still be removed after sorting and that is handled 
later while iterating over the list of nodes that is returned

The two locks do different things.

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8990) Fix fair scheduler race condition in app submit and queue cleanup

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970395#comment-16970395
 ] 

Wilfred Spiegelenburg commented on YARN-8990:
-

Thank you [~Steven Rand] for making us aware of the omission. 
And yes you are correct. This was checked into 3.2.0 only and not in 3.2.x. It 
is in 3.3

For YARN-8992 the fix version is set incorrectly, that one is only in 3.3 
(adding comment there too)

 [~sunilg] how do we handle these two?

> Fix fair scheduler race condition in app submit and queue cleanup
> -
>
> Key: YARN-8990
> URL: https://issues.apache.org/jira/browse/YARN-8990
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
>Priority: Blocker
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8990.001.patch, YARN-8990.002.patch
>
>
> With the introduction of the dynamic queue deletion in YARN-8191 a race 
> condition was introduced that can cause a queue to be removed while an 
> application submit is in progress.
> The issue occurs in {{FairScheduler.addApplication()}} when an application is 
> submitted to a dynamic queue which is empty or the queue does not exist yet. 
> If during the processing of the application submit the 
> {{AllocationFileLoaderService}} kicks of for an update the queue clean up 
> will be run first. The application submit first creates the queue and get a 
> reference back to the queue. 
> Other checks are performed and as the last action before getting ready to 
> generate an AppAttempt the queue is updated to show the submitted application 
> ID..
> The time between the queue creation and the queue update to show the submit 
> is long enough for the queue to be removed. The application however is lost 
> and will never get any resources assigned.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8992) Fair scheduler can delete a dynamic queue while an application attempt is being added to the queue

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970394#comment-16970394
 ] 

Wilfred Spiegelenburg commented on YARN-8992:
-


For YARN-8992 the fix version is set incorrectly, this one is only in 3.3.
Similar to the related YARN-8990, missing from 3.2.1

> Fair scheduler can delete a dynamic queue while an application attempt is 
> being added to the queue
> --
>
> Key: YARN-8992
> URL: https://issues.apache.org/jira/browse/YARN-8992
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.1
>Reporter: Haibo Chen
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Fix For: 3.2.1
>
> Attachments: YARN-8992.001.patch, YARN-8992.002.patch
>
>
> As discovered in YARN-8990, QueueManager can see a leaf queue being empty 
> while FSLeafQueue.addApp() is called in the middle of  
> {code:java}
> return queue.getNumRunnableApps() == 0 &&
>   leafQueue.getNumNonRunnableApps() == 0 &&
>   leafQueue.getNumAssignedApps() == 0;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9920) YarnAuthorizationProvider AccessRequest gets Null RemoteAddress from FairScheduler

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970376#comment-16970376
 ] 

Wilfred Spiegelenburg commented on YARN-9920:
-

I am not sure what you are after with this change. The {{getRemoteAddress()}} 
method for the {{AccessRequest}} is never called. Same for 
{{getForwardedAddresses()}}.
ACLs do not support limiting on IP either so passing the information through 
from a YARN perspective does not make sense and having nulls does not cause any 
issues. If this is to allow auditing or extending in the future then I can 
understand otherwise please explain.

The other problem I have is with the IPC server call that is done. When we get 
to the {{Server.getRemoteAddress()}} call which address are we getting back? 
The remote address is a thread local for the IPC server since we have multiple 
threads servicing IPC incoming requests how can we be sure that the scheduler 
when checking queue access for example gets the correct thread from the server 
pool?

Second issue is that the web services, like moving an app, uses the 
{{ClientRMService}} to eventually execute the move. The access check is 
performed inside the {{ClientRMService}} in which we call 
{{Server.getRemoteAddress()}}. There is no IPC request at all which probably 
means we get the local node IP back.

Based on just the quick look I think the info is highly suspect and it would be 
better to have a proper look at when and where we build the {{AccessRequest}} 
to make sure we get the proper information in all cases.
So a no go from my side.

> YarnAuthorizationProvider AccessRequest gets Null RemoteAddress from 
> FairScheduler
> --
>
> Key: YARN-9920
> URL: https://issues.apache.org/jira/browse/YARN-9920
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, security
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9920-001.patch, YARN-9920-002.patch, 
> YARN-9920-003.patch
>
>
> YarnAuthorizationProvider AccessRequest has null RemoteAddress in case of 
> FairScheduler. FSQueue#hasAccess uses Server.getRemoteAddress() which will be 
> null when the call is from RMWebServices and EventDispatcher. It works fine 
> when called by IPC Server Handler.
> FSQueue#hasAccess is called at three places where (2) and (3) returns null.
> *1. IPC Server -> RMAppManager#createAndPopulateNewRMApp -> FSQueue#hasAccess 
> -> Server.getRemoteAddress returns correct Remote IP.*
>  
> *2. IPC Server -> RMAppManager#createAndPopulateNewRMApp -> 
> AppAddedSchedulerEvent*
>     *EventDispatcher -> FairScheduler#addApplication -> FSQueue.hasAccess -> 
> Server.getRemoteAddress returns null*
>   
> {code:java}
> org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer.checkPermission(ConfiguredYarnAuthorizer.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.hasAccess(FSQueue.java:316)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplication(FairScheduler.java:509)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1268)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:133)
> at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
> {code}
>  
> *3. RMWebServices -> QueueACLsManager#checkAccess -> FSQueue.hasAccess -> 
> Server.getRemoteAddress returns null.*
> {code:java}
> org.apache.hadoop.yarn.security.ConfiguredYarnAuthorizer.checkPermission(ConfiguredYarnAuthorizer.java:101)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSQueue.hasAccess(FSQueue.java:316)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.checkAccess(FairScheduler.java:1610)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.QueueACLsManager.checkAccess(QueueACLsManager.java:84)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.hasAccess(RMWebServices.java:270)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices.getApps(RMWebServices.java:553)
> {code}
>  
> Have verified with CapacityScheduler and it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970251#comment-16970251
 ] 

Wilfred Spiegelenburg commented on YARN-9940:
-

The changes you are making are also not helpful in hadoop 2.7. When you 
synchronise a method you do the same as placing all the code inside the method 
within a synchronised block. There is only one synchronised method or code 
block that can run at any time in the same class.
 These two code samples are effectively the same:
{code:java}
public synchronized myMethod() {
 all my code here...
}
{code}
and
{code:java}
public myMethod() {
 synchronized (this)  {
 all my code here...
  }
}
{code}
Since the method {{FairScheduler.completedContainer()}} is already synchronised 
your change of adding a block that is synchronised inside the method does not 
make a difference. It will be optimised away by the compiler.

Hadoop 2.7.2 does not have read/write locks in the scheduler at all so I don't 
know what version you are running but it is not hadoop 2.7. The read/write 
locks were introduced in YARN-3139 which is only in hadoop 2.9 and later.  Same 
for the line numbers in the stack they do not line up with the 2.7 release.

As per YARN-8373:
 - the test is not really testing anything as it holds the scheduler lock while 
calling the deductUnallocatedResource() this does not happen in the real code 
and should not be there.
 - the best solution is to move to a PriorityQueue for the sorted list that 
really fixes the issue as the test shows without the lock in place.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9952) ontinuous scheduling thread crashes

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg resolved YARN-9952.
-
Resolution: Duplicate

> ontinuous scheduling thread crashes
> ---
>
> Key: YARN-9952
> URL: https://issues.apache.org/jira/browse/YARN-9952
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
>
> {color:#172b4d}2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread{color}[FairSchedulerContinuousScheduling,5,main]{color:#172b4d} threw 
> an Exception.{color}
> {color:#172b4d} java.lang.IllegalArgumentException: Comparison method 
> violates its general contract!{color}
> {color:#172b4d}     at java.util.TimSort.mergeHi(TimSort.java:868){color}
> {color:#172b4d}     at java.util.TimSort.mergeAt(TimSort.java:485){color}
> {color:#172b4d}     at 
> java.util.TimSort.mergeForceCollapse(TimSort.java:426){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:223){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:173){color}
> {color:#172b4d}     at java.util.Arrays.sort(Arrays.java:659){color}
> {color:#172b4d}     at 
> java.util.Collections.sort(Collections.java:217){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970241#comment-16970241
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

patch-003 to fix the imports the IDE had replaced with a wildcard

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-8373:

Attachment: YARN-8373.003.patch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch, 
> YARN-8373.003.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-8373:

Attachment: YARN-8373.002.patch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-08 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970224#comment-16970224
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

Your link points to code in master not in trunk: master has not been updated 
since 2015, 
[trunk|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java]
 has the readlock in the code.

I do agree that fixing the data consistency would be the best thing. However 
locking large numbers of nodes before we sort and then unlock them again will 
have a huge performance impact.
Moving to a PriorityQueue is possible as the FS is the only one that uses the 
method at the moment. It also fixes the issue as is confirmed by the unit test.
The old unit test without special locking modifies the nodes while sorting 
without issues. This has been confirmed in local runs with extra logging:
{code}
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2147
2019-11-09 00:57:09,958 INFO  [FairSchedulerContinuousScheduling] 
scheduler.ClusterNodeTracker (ClusterNodeTracker.java:sortedNodeList(390)) - 
sorting node list of size 8000
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:5949
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:4677
...
...
2019-11-09 00:57:09,961 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2212
2019-11-09 00:57:09,961 INFO  [FairSchedulerContinuousScheduling] 
fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1005)) - 
scheduler sorted node list of size 8000
2019-11-09 00:57:09,962 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2949
2019-11-09 00:57:09,962 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:3866
{code}

New patch uploaded

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch, YARN-8373.002.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general 

[jira] [Commented] (YARN-9952) ontinuous scheduling thread crashes

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969803#comment-16969803
 ] 

Wilfred Spiegelenburg commented on YARN-9952:
-

[~kailiu_dev] please explain how this issue is different from YARN-9940 ?
The fix is the same, the stack trace is the same this really is a duplicate...

Please close this as a dupe of YARN-9940.

> ontinuous scheduling thread crashes
> ---
>
> Key: YARN-9952
> URL: https://issues.apache.org/jira/browse/YARN-9952
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Priority: Major
>
> {color:#172b4d}2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread{color}[FairSchedulerContinuousScheduling,5,main]{color:#172b4d} threw 
> an Exception.{color}
> {color:#172b4d} java.lang.IllegalArgumentException: Comparison method 
> violates its general contract!{color}
> {color:#172b4d}     at java.util.TimSort.mergeHi(TimSort.java:868){color}
> {color:#172b4d}     at java.util.TimSort.mergeAt(TimSort.java:485){color}
> {color:#172b4d}     at 
> java.util.TimSort.mergeForceCollapse(TimSort.java:426){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:223){color}
> {color:#172b4d}     at java.util.TimSort.sort(TimSort.java:173){color}
> {color:#172b4d}     at java.util.Arrays.sort(Arrays.java:659){color}
> {color:#172b4d}     at 
> java.util.Collections.sort(Collections.java:217){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117){color}
> {color:#172b4d}     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969793#comment-16969793
 ] 

Wilfred Spiegelenburg commented on YARN-9940:
-

Hadoop 2.7 has been EOL'ed, no more releases or fixes.
Please see here: 
https://cwiki.apache.org/confluence/display/HADOOP/EOL+%28End-of-life%29+Release+Branches



> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969741#comment-16969741
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
-

YARN-6448 introduces the synchronisation around the 
ClusterNodeTracker.sortedNodeList() method. That was needed due to a changed 
introduced by YARN-4719. The fix looks like it was build when methods were 
synchronised.

Now going forward the change for YARN-6448 was checked in upstream in release 
2.9 and later. This version also includes YARN-3139. Those changes remove the 
synchronised blocks and changes all locking to read/write locks. This really 
means that from the moment the change was added to the code it really did not 
do anything as it is the only synchronised block in the FS. It really only 
prevents two sorts from happening at the same time nothing more. The feeling I 
have is that the fix for YARN-6448 really never worked due to that interaction.

The test that was written is not really testing the real issue. This is inside 
the test code:
{code}
synchronized (scheduler) {
  node.deductUnallocatedResource(Resource.newInstance(i * 1024, i));
}
{code}
The test uses a block that is synchronised on the scheduler while in the real 
code this {{deductUnallocatedResource()}} is not locked on the scheduler at 
all. The test should really be removed as it gives a false sense of code being 
tested and correct.

The fix should be as simple as replacing the synchronised block with a read 
lock. That would bring back the fix to the state as it was intended. All the 
node changes like releasing containers etc run through the scheduler under a 
held write lock in {{attemptScheduling()} {{completedContainerInternal()}} or 
{{nodeUpdate()}}.

Fixing the real issue: locking all the nodes while sorting or creating a deep 
copy of the nodes list before sorting are costly. Neither of these will be 
without performance impact especially in large clusters. Based on the analysis 
it will also not give us anything extra

[~snemeth] [~miklos.szeg...@cloudera.com] can you check please?

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at 

[jira] [Updated] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-8373:

Attachment: YARN-8373.001.patch

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
> Attachments: YARN-8373.001.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969693#comment-16969693
 ] 

Wilfred Spiegelenburg commented on YARN-9940:
-

BTW: this also looks more like a duplicate of YARN-8373.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8373) RM Received RMFatalEvent of type CRITICAL_THREAD_CRASH

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned YARN-8373:
---

Assignee: Wilfred Spiegelenburg

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> ---
>
> Key: YARN-8373
> URL: https://issues.apache.org/jira/browse/YARN-8373
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Girish Bhat
>Assignee: Wilfred Spiegelenburg
>Priority: Major
>  Labels: newbie
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969691#comment-16969691
 ] 

Wilfred Spiegelenburg commented on YARN-9940:
-

The stack trace does not line up with hadoop 2.7.2. The FS call to sort is 
located at [line 
1002|https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L1002]
 in that release. Line 1117 is blank.

This fix also does not look correct at all. It touches code which I think it 
should not touch.

The method {{FairScheduler.completedContainer()}} method is already 
synchronised adding a synchronised block inside that will not help. The same 
for the {{AbstractYarnScheduler.recoverContainersOnNode()}} is synchronised.

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9940) avoid continuous scheduling thread crashes while sorting nodes get 'Comparison method violates its general contract'

2019-11-07 Thread Wilfred Spiegelenburg (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated YARN-9940:

Fix Version/s: (was: 2.7.2)

> avoid continuous scheduling thread crashes while sorting nodes get 
> 'Comparison method violates its general contract'
> 
>
> Key: YARN-9940
> URL: https://issues.apache.org/jira/browse/YARN-9940
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.2
>Reporter: kailiu_dev
>Assignee: kailiu_dev
>Priority: Major
> Attachments: YARN-9940-branch-2.7.2.001.patch
>
>
> 2019-10-16 09:14:51,215 ERROR 
> org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread 
> Thread[FairSchedulerContinuousScheduling,5,main] threw an Exception.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
>     at java.util.TimSort.mergeHi(TimSort.java:868)
>     at java.util.TimSort.mergeAt(TimSort.java:485)
>     at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
>     at java.util.TimSort.sort(TimSort.java:223)
>     at java.util.TimSort.sort(TimSort.java:173)
>     at java.util.Arrays.sort(Arrays.java:659)
>     at java.util.Collections.sort(Collections.java:217)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:1117)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   >