[
https://issues.apache.org/jira/browse/YARN-9730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938009#comment-16938009
]
Jim Brennan commented on YARN-9730:
-----------------------------------
[~jhung] I believe pulling this back to branch-2 has caused failures in
TestAppManager (and others). Example stack trace:
{noformat}
[ERROR] Tests run: 21, Failures: 0, Errors: 7, Skipped: 0, Time elapsed: 7.216
s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestAppManager
[ERROR]
testRMAppRetireZeroSetting(org.apache.hadoop.yarn.server.resourcemanager.TestAppManager)
Time elapsed: 0.054 s <<< ERROR!
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.resourcemanager.RMContextImpl.getExclusiveEnforcedPartitions(RMContextImpl.java:590)
at
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.<init>(RMAppManager.java:115)
at
org.apache.hadoop.yarn.server.resourcemanager.TestAppManager$TestRMAppManager.<init>(TestAppManager.java:192)
at
org.apache.hadoop.yarn.server.resourcemanager.TestAppManager.testRMAppRetireZeroSetting(TestAppManager.java:450)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:379)
at
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:340)
at
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
at
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:413)
{noformat}
> Support forcing configured partitions to be exclusive based on app node label
> -----------------------------------------------------------------------------
>
> Key: YARN-9730
> URL: https://issues.apache.org/jira/browse/YARN-9730
> Project: Hadoop YARN
> Issue Type: Task
> Reporter: Jonathan Hung
> Assignee: Jonathan Hung
> Priority: Major
> Labels: release-blocker
> Fix For: 2.10.0, 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9730-branch-2.001.patch, YARN-9730.001.patch,
> YARN-9730.002.patch, YARN-9730.003.patch
>
>
> Use case: queue X has all of its workload in non-default (exclusive)
> partition P (by setting app submission context's node label set to P). Node
> in partition Q != P heartbeats to RM. Capacity scheduler loops through every
> application in X, and every scheduler key in this application, and fails to
> allocate each time since the app's requested label and the node's label don't
> match. This causes huge performance degradation when number of apps in X is
> large.
> To fix the issue, allow RM to configure partitions as "forced-exclusive". If
> partition P is "forced-exclusive", then:
> * 1a. If app sets its submission context's node label to P, all its resource
> requests will be overridden to P
> * 1b. If app sets its submission context's node label Q, any of its resource
> requests whose labels are P will be overridden to Q
> * 2. In the scheduler, we add apps with node label expression P to a
> separate data structure. When a node in partition P heartbeats to scheduler,
> we only try to schedule apps in this data structure. When a node in partition
> Q heartbeats to scheduler, we schedule the rest of the apps as normal.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]