[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated YARN-10892: -- Labels: pull-request-available (was: ) > YARN Preemption Monitor got java.util.ConcurrentModificationException when > three or more partitions exists > -- > > Key: YARN-10892 > URL: https://issues.apache.org/jira/browse/YARN-10892 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Jeongin Ju >Priority: Major > Labels: pull-request-available > Attachments: YARN-10892.001.patch, YARN-10892.002.patch > > Time Spent: 10m > Remaining Estimate: 0h > > On our cluster with a large number of NMs, preemption monitor thread > consistently got java.util.ConcurrentModificationException when specific > conditions met. (And preemption doesn't work, of course) > What We found as conditions are as follow. (All 4 conditions should be met) > # There are at least two non-exclusive partitions except default partition > (let me call the partitions as X and Y partition) > # app1 in the queue belonging to default partition (let me call the queue as > 'dev' queue) borrowed resources from both X, Y partitions > # app2, app3 submitted to queues belonging to each X, Y partition is > 'PENDING' because resources are consumed by app1 > # Preemption monitor can clear borrowed resources from X or Y when the > container of app1 is preempted. > Main problem is that FifoCandiatesSelector.selectCandidates tried to remove > HashMap key(partition name) while iterating HashMap. > Logically, it is correct because we didn't traverse the same partition again > on this 'selectCandidates'. However HashMap structure does not allow > modification while iterating. > I made test case to reproduce the error > case(testResourceTypesInterQueuePreemptionWithThreePartitions). > We found and patched our cluster on 3.1.2 but it seems trunk still has the > same problem. > I attached patch based on the trunk. > > Thanks! > > {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor > (SchedulingMonitor.java:run(116)) - Exception raised while executing > preemption checker, skip this run..., exception= > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)}} > > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeongin Ju updated YARN-10892: -- External issue URL: https://github.com/apache/hadoop/pull/3324 Labels: (was: pull-request-available) Remaining Estimate: (was: 0h) > YARN Preemption Monitor got java.util.ConcurrentModificationException when > three or more partitions exists > -- > > Key: YARN-10892 > URL: https://issues.apache.org/jira/browse/YARN-10892 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Jeongin Ju >Priority: Major > Attachments: YARN-10892.001.patch, YARN-10892.002.patch > > Time Spent: 10m > > On our cluster with a large number of NMs, preemption monitor thread > consistently got java.util.ConcurrentModificationException when specific > conditions met. (And preemption doesn't work, of course) > What We found as conditions are as follow. (All 4 conditions should be met) > # There are at least two non-exclusive partitions except default partition > (let me call the partitions as X and Y partition) > # app1 in the queue belonging to default partition (let me call the queue as > 'dev' queue) borrowed resources from both X, Y partitions > # app2, app3 submitted to queues belonging to each X, Y partition is > 'PENDING' because resources are consumed by app1 > # Preemption monitor can clear borrowed resources from X or Y when the > container of app1 is preempted. > Main problem is that FifoCandiatesSelector.selectCandidates tried to remove > HashMap key(partition name) while iterating HashMap. > Logically, it is correct because we didn't traverse the same partition again > on this 'selectCandidates'. However HashMap structure does not allow > modification while iterating. > I made test case to reproduce the error > case(testResourceTypesInterQueuePreemptionWithThreePartitions). > We found and patched our cluster on 3.1.2 but it seems trunk still has the > same problem. > I attached patch based on the trunk. > > Thanks! > > {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor > (SchedulingMonitor.java:run(116)) - Exception raised while executing > preemption checker, skip this run..., exception= > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)}} > > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10522) Document for Flexible Auto Queue Creation in Capacity Scheduler.
[ https://issues.apache.org/jira/browse/YARN-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403215#comment-17403215 ] Andras Gyori commented on YARN-10522: - Thank you [~bteke] for pursuing this issue! I have the following additions to the latest patch: * Nit: This line is somewhat convoluted. I think it will be easier to understand to just make a note in a new sentence, like It is important to emphasize, that dynamic queues created in a flexible fashion only work with weights as their capacity. {noformat} but the created queues will be and can only be configured with weights as capacity.{noformat} * Nit: I think configured here is redundant, the feature itself is a configuration value. {noformat} The Flexible Dynamic Queue Auto-Creation and Management feature allows a ParentQueue to be configured {noformat} * Nit: pre-configured queues under the parent must be configured in the same way sounds better in my opinion. {noformat} The auto-created queues will have weights as capacity so the pre-configured queues under the parent must be configured to use the same {noformat} * Nit: A parent queue supports configuration of dinamically created leaf and parent . {noformat} The parent queue which has the flexible auto queue creation enabled supports the configuration dynamically created leaf and parent queues through template parameters {noformat} * Missing by {noformat} Specifies a queue property inherited auto-created leaf queues. Specifies a queue property inherited auto-created parent queues. {noformat} * Maybe an example for templates would be good, because it is not a straightforward feature at first glance. > Document for Flexible Auto Queue Creation in Capacity Scheduler. > > > Key: YARN-10522 > URL: https://issues.apache.org/jira/browse/YARN-10522 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Benjamin Teke >Priority: Major > Attachments: YARN-10522.001.patch > > > We should update document to support this feature. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403167#comment-17403167 ] Hadoop QA commented on YARN-10892: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 13m 7s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 1s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 54s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 18m 10s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 39s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1195/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 3 new + 10 unchanged - 0 fixed = 13 total (was 10) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 45s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color}
[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeongin Ju updated YARN-10892: -- Attachment: YARN-10892.002.patch > YARN Preemption Monitor got java.util.ConcurrentModificationException when > three or more partitions exists > -- > > Key: YARN-10892 > URL: https://issues.apache.org/jira/browse/YARN-10892 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.2 >Reporter: Jeongin Ju >Priority: Major > Attachments: YARN-10892.001.patch, YARN-10892.002.patch > > > On our cluster with a large number of NMs, preemption monitor thread > consistently got java.util.ConcurrentModificationException when specific > conditions met. (And preemption doesn't work, of course) > What We found as conditions are as follow. (All 4 conditions should be met) > # There are at least two non-exclusive partitions except default partition > (let me call the partitions as X and Y partition) > # app1 in the queue belonging to default partition (let me call the queue as > 'dev' queue) borrowed resources from both X, Y partitions > # app2, app3 submitted to queues belonging to each X, Y partition is > 'PENDING' because resources are consumed by app1 > # Preemption monitor can clear borrowed resources from X or Y when the > container of app1 is preempted. > Main problem is that FifoCandiatesSelector.selectCandidates tried to remove > HashMap key(partition name) while iterating HashMap. > Logically, it is correct because we didn't traverse the same partition again > on this 'selectCandidates'. However HashMap structure does not allow > modification while iterating. > I made test case to reproduce the error > case(testResourceTypesInterQueuePreemptionWithThreePartitions). > We found and patched our cluster on 3.1.2 but it seems trunk still has the > same problem. > I attached patch based on the trunk. > > Thanks! > > {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor > (SchedulingMonitor.java:run(116)) - Exception raised while executing > preemption checker, skip this run..., exception= > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) > at > org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)}} > > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403077#comment-17403077 ] Hadoop QA commented on YARN-10892: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 26m 6s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 33m 4s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 9s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 37s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 43s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 5s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 57s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 48s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 57s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 57s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkUbuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 49s{color} | {color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1194/artifact/out/patch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-jdkPrivateBuild-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10.txt{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 49s{color} |
[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeongin Ju updated YARN-10892: -- Description: On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. (And preemption doesn't work, of course) What We found as conditions are as follow. (All 4 conditions should be met) # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition) # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted. Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap. Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating. I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions). We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem. I attached patch based on the trunk. Thanks! {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception= java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)}} {quote} was: On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. What We found as conditions are as follow. (All 4 conditions should be met) # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition) # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted. Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap. Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating. I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions). We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem. I attached patch based on the trunk. Thanks! {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception= java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at
[jira] [Updated] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
[ https://issues.apache.org/jira/browse/YARN-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeongin Ju updated YARN-10892: -- Description: On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. What We found as conditions are as follow. (All 4 conditions should be met) # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition) # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted. Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap. Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating. I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions). We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem. I attached patch based on the trunk. Thanks! {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception= java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)}} {quote} was: On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. What We found as conditions are as follow. (All 4 conditions should be met) # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition) # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted. Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap. Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating. I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions). We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem. I attached patch based on the trunk. Thanks! {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception= java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
[jira] [Created] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
Jeongin Ju created YARN-10892: - Summary: YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists Key: YARN-10892 URL: https://issues.apache.org/jira/browse/YARN-10892 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 3.1.2 Reporter: Jeongin Ju On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met. What We found as conditions are as follow. (All 4 conditions should be met) # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition) # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted. Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap. Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating. I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions). We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem. I attached patch based on the trunk. Thanks! {quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception= java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489) at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99) at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)}} {{}} {{}} {{}} {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10838) Implement an optimised version of Configuration getPropsWithPrefix
[ https://issues.apache.org/jira/browse/YARN-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403014#comment-17403014 ] Andras Gyori commented on YARN-10838: - Thank you [~bteke] for taking care of this issue! As for the last point, I wanted to separate this introduction from the refactor part. Check YARN-10795 for the related jiras. > Implement an optimised version of Configuration getPropsWithPrefix > -- > > Key: YARN-10838 > URL: https://issues.apache.org/jira/browse/YARN-10838 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10838.001.patch, YARN-10838.002.patch, > YARN-10838.003.patch, YARN-10838.004.patch, YARN-10838.005.patch > > > AutoCreatedQueueTemplate also has multiple call to > Configuration#getPropsWithPrefix. It must be eliminated in order to improve > the performance on reinitialisation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org