[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10892:
--
Labels: pull-request-available  (was: )

> YARN Preemption Monitor got java.util.ConcurrentModificationException when 
> three or more partitions exists
> --
>
> Key: YARN-10892
> URL: https://issues.apache.org/jira/browse/YARN-10892
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: Jeongin Ju
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10892.001.patch, YARN-10892.002.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> On our cluster with a large number of NMs, preemption monitor thread 
> consistently got java.util.ConcurrentModificationException when specific 
> conditions met. (And preemption doesn't work, of course)
> What We found as conditions are as follow. (All 4 conditions should be met)
>  # There are at least two non-exclusive partitions except default partition 
> (let me call the partitions as X and Y partition)
>  # app1 in the queue belonging to default partition (let me call the queue as 
> 'dev' queue) borrowed resources from both X, Y partitions 
>  # app2, app3 submitted to queues belonging to each X, Y partition is 
> 'PENDING' because resources are consumed by app1
>  # Preemption monitor can clear borrowed resources from X or Y when the 
> container of app1 is preempted.  
> Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
> HashMap key(partition name) while iterating HashMap.
> Logically, it is correct because we didn't traverse the same partition again 
> on this 'selectCandidates'. However HashMap structure does not allow 
> modification while iterating.
> I made test case to reproduce the error 
> case(testResourceTypesInterQueuePreemptionWithThreePartitions).
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!
>  
> {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
> (SchedulingMonitor.java:run(116)) - Exception raised while executing 
> preemption checker, skip this run..., exception=
>  java.util.ConcurrentModificationException
>  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>  at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)}}
>  
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Jeongin Ju (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeongin Ju updated YARN-10892:
--
External issue URL: https://github.com/apache/hadoop/pull/3324
Labels:   (was: pull-request-available)
Remaining Estimate: (was: 0h)

> YARN Preemption Monitor got java.util.ConcurrentModificationException when 
> three or more partitions exists
> --
>
> Key: YARN-10892
> URL: https://issues.apache.org/jira/browse/YARN-10892
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: Jeongin Ju
>Priority: Major
> Attachments: YARN-10892.001.patch, YARN-10892.002.patch
>
>  Time Spent: 10m
>
> On our cluster with a large number of NMs, preemption monitor thread 
> consistently got java.util.ConcurrentModificationException when specific 
> conditions met. (And preemption doesn't work, of course)
> What We found as conditions are as follow. (All 4 conditions should be met)
>  # There are at least two non-exclusive partitions except default partition 
> (let me call the partitions as X and Y partition)
>  # app1 in the queue belonging to default partition (let me call the queue as 
> 'dev' queue) borrowed resources from both X, Y partitions 
>  # app2, app3 submitted to queues belonging to each X, Y partition is 
> 'PENDING' because resources are consumed by app1
>  # Preemption monitor can clear borrowed resources from X or Y when the 
> container of app1 is preempted.  
> Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
> HashMap key(partition name) while iterating HashMap.
> Logically, it is correct because we didn't traverse the same partition again 
> on this 'selectCandidates'. However HashMap structure does not allow 
> modification while iterating.
> I made test case to reproduce the error 
> case(testResourceTypesInterQueuePreemptionWithThreePartitions).
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!
>  
> {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
> (SchedulingMonitor.java:run(116)) - Exception raised while executing 
> preemption checker, skip this run..., exception=
>  java.util.ConcurrentModificationException
>  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>  at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)}}
>  
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10522) Document for Flexible Auto Queue Creation in Capacity Scheduler.

2021-08-23 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403215#comment-17403215
 ] 

Andras Gyori commented on YARN-10522:
-

Thank you [~bteke] for pursuing this issue! I have the following additions to 
the latest patch:
 * Nit: This line is somewhat convoluted. I think it will be easier to 
understand to just make a note in a new sentence, like It is important to 
emphasize, that dynamic queues created in a flexible fashion only work with 
weights as their capacity.
{noformat}
but the created queues will be and can only be configured with weights as 
capacity.{noformat}

 * Nit: I think configured here is redundant, the feature itself is a 
configuration value.
{noformat}
 The Flexible Dynamic Queue Auto-Creation and Management feature allows a 
ParentQueue to be configured
{noformat}

 * Nit: pre-configured queues under the parent must be configured in the same 
way sounds better in my opinion.
{noformat}
The auto-created queues will have weights as capacity so the pre-configured 
queues under the parent must be configured to use the same
{noformat}

 * Nit: A parent queue  supports configuration of dinamically created leaf 
and parent .
{noformat}
The parent queue which has the flexible auto queue creation enabled supports 
the configuration dynamically created leaf and parent queues through template 
parameters
{noformat}

* Missing by
{noformat}
Specifies a queue property inherited auto-created leaf queues. Specifies a 
queue property inherited auto-created parent queues.
{noformat}

* Maybe an example for templates would be good, because it is not a 
straightforward feature at first glance.

> Document for Flexible Auto Queue Creation in Capacity Scheduler.
> 
>
> Key: YARN-10522
> URL: https://issues.apache.org/jira/browse/YARN-10522
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Benjamin Teke
>Priority: Major
> Attachments: YARN-10522.001.patch
>
>
> We should update document to support this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403167#comment-17403167
 ] 

Hadoop QA commented on YARN-10892:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m  
7s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 2 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
1s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 54s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 18m 
10s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
47s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 39s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1195/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 3 new + 10 unchanged - 0 fixed = 13 total (was 10) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 45s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} 

[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Jeongin Ju (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeongin Ju updated YARN-10892:
--
Attachment: YARN-10892.002.patch

> YARN Preemption Monitor got java.util.ConcurrentModificationException when 
> three or more partitions exists
> --
>
> Key: YARN-10892
> URL: https://issues.apache.org/jira/browse/YARN-10892
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.2
>Reporter: Jeongin Ju
>Priority: Major
> Attachments: YARN-10892.001.patch, YARN-10892.002.patch
>
>
> On our cluster with a large number of NMs, preemption monitor thread 
> consistently got java.util.ConcurrentModificationException when specific 
> conditions met. (And preemption doesn't work, of course)
> What We found as conditions are as follow. (All 4 conditions should be met)
>  # There are at least two non-exclusive partitions except default partition 
> (let me call the partitions as X and Y partition)
>  # app1 in the queue belonging to default partition (let me call the queue as 
> 'dev' queue) borrowed resources from both X, Y partitions 
>  # app2, app3 submitted to queues belonging to each X, Y partition is 
> 'PENDING' because resources are consumed by app1
>  # Preemption monitor can clear borrowed resources from X or Y when the 
> container of app1 is preempted.  
> Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
> HashMap key(partition name) while iterating HashMap.
> Logically, it is correct because we didn't traverse the same partition again 
> on this 'selectCandidates'. However HashMap structure does not allow 
> modification while iterating.
> I made test case to reproduce the error 
> case(testResourceTypesInterQueuePreemptionWithThreePartitions).
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!
>  
> {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
> (SchedulingMonitor.java:run(116)) - Exception raised while executing 
> preemption checker, skip this run..., exception=
>  java.util.ConcurrentModificationException
>  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
>  at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)}}
>  
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403077#comment-17403077
 ] 

Hadoop QA commented on YARN-10892:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 26m  
6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 2 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 33m 
 4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
9s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 37s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m  
5s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
57s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
48s{color} | 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
57s{color} | 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color}
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 57s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color}
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
49s{color} | 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt{color}
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 49s{color} 
| 

[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Jeongin Ju (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeongin Ju updated YARN-10892:
--
Description: 
On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met. (And preemption doesn't work, of course)

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
 at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)}}

 
{quote}

  was:
On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
 at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
 at 

[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Jeongin Ju (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeongin Ju updated YARN-10892:
--
Description: 
On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
 java.util.ConcurrentModificationException
 at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
 at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
 at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)}}

 
{quote}

  was:
On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)

[jira] [Created] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

2021-08-23 Thread Jeongin Ju (Jira)
Jeongin Ju created YARN-10892:
-

 Summary: YARN Preemption Monitor got 
java.util.ConcurrentModificationException when three or more partitions exists
 Key: YARN-10892
 URL: https://issues.apache.org/jira/browse/YARN-10892
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 3.1.2
Reporter: Jeongin Ju


On our cluster with a large number of NMs, preemption monitor thread 
consistently got java.util.ConcurrentModificationException when specific 
conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition 
(let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 
'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' 
because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the 
container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove 
HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on 
this 'selectCandidates'. However HashMap structure does not allow modification 
while iterating.

I made test case to reproduce the error 
case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same 
problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor 
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption 
checker, skip this run..., exception=
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)}}

{{}}

{{}}

{{}}
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix

2021-08-23 Thread Andras Gyori (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403014#comment-17403014
 ] 

Andras Gyori commented on YARN-10838:
-

Thank you [~bteke] for taking care of this issue! As for the last point, I 
wanted to separate this introduction from the refactor part. Check YARN-10795 
for the related jiras.

> Implement an optimised version of Configuration getPropsWithPrefix
> --
>
> Key: YARN-10838
> URL: https://issues.apache.org/jira/browse/YARN-10838
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10838.001.patch, YARN-10838.002.patch, 
> YARN-10838.003.patch, YARN-10838.004.patch, YARN-10838.005.patch
>
>
> AutoCreatedQueueTemplate also has multiple call to 
> Configuration#getPropsWithPrefix. It must be eliminated in order to improve 
> the performance on reinitialisation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org