[jira] [Commented] (YARN-11353) Change debug logs in FSDownload.java to info logs for better escalations debugging

2022-10-26 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624551#comment-17624551
 ] 

Akira Ajisaka commented on YARN-11353:
--

{quote}Can you please add Sanjay Kumar Sahu as contributor. {quote}

Done.

> Change debug logs in FSDownload.java to info logs for better escalations 
> debugging
> --
>
> Key: YARN-11353
> URL: https://issues.apache.org/jira/browse/YARN-11353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
>Reporter: Sanjay Kumar Sahu
>Assignee: Sanjay Kumar Sahu
>Priority: Major
>  Labels: pull-request-available
>
> AM was stuck in Preparing Local resources step and it timed out and never 
> started the driver. This happened in one of the customer's cluster and got 
> resolved when this cluster got deleted and the customer started using another 
> cluster . The logs were not enough to look into the issue. Adding more info 
> logs will help to understand when did the download of the files start and 
> when did it end, or whether it actually reached that step like adding the 
> containerId here to know who is downloading.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11353) Change debug logs in FSDownload.java to info logs for better escalations debugging

2022-10-26 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11353:


Assignee: Sanjay Kumar Sahu

> Change debug logs in FSDownload.java to info logs for better escalations 
> debugging
> --
>
> Key: YARN-11353
> URL: https://issues.apache.org/jira/browse/YARN-11353
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, yarn
>Affects Versions: 3.4.0
>Reporter: Sanjay Kumar Sahu
>Assignee: Sanjay Kumar Sahu
>Priority: Major
>  Labels: pull-request-available
>
> AM was stuck in Preparing Local resources step and it timed out and never 
> started the driver. This happened in one of the customer's cluster and got 
> resolved when this cluster got deleted and the customer started using another 
> cluster . The logs were not enough to look into the issue. Adding more info 
> logs will help to understand when did the download of the files start and 
> when did it end, or whether it actually reached that step like adding the 
> containerId here to know who is downloading.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11187) Remove WhiteBox in yarn module.

2022-10-06 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11187.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk. Thank you [~slfan1989] for your contribution.

> Remove WhiteBox in yarn module.
> ---
>
> Key: YARN-11187
> URL: https://issues.apache.org/jira/browse/YARN-11187
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.4.0, 3.3.5
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6868) Add test scope to certain entries in hadoop-yarn-server-resourcemanager pom.xml

2022-10-01 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-6868:

Fix Version/s: 2.10.0
   3.0.0-beta1

Added missing fixed versions.

> Add test scope to certain entries in hadoop-yarn-server-resourcemanager 
> pom.xml
> ---
>
> Key: YARN-6868
> URL: https://issues.apache.org/jira/browse/YARN-6868
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0-beta1
>Reporter: Ray Chiang
>Assignee: Ray Chiang
>Priority: Major
> Fix For: 3.0.0-beta1, 2.10.0, 2.9.1
>
> Attachments: YARN-6868.001.patch
>
>
> The tag
> {noformat}
> test
> {noformat}
> is missing from a few entries in the pom.xml for 
> hadoop-yarn-server-resourcemanager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10662) [JDK 11] TestTimelineReaderWebServicesHBaseStorage fails

2022-10-01 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-10662:


Assignee: (was: Akira Ajisaka)

> [JDK 11] TestTimelineReaderWebServicesHBaseStorage fails
> 
>
> Key: YARN-10662
> URL: https://issues.apache.org/jira/browse/YARN-10662
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java11-linux-x86_64/131/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-timelineservice-hbase-tests.txt]
> {noformat}
> [INFO] Running 
> org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.515 
> s <<< FAILURE! - in 
> org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
> [ERROR] 
> org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage
>   Time elapsed: 1.514 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
>   at 
> org.apache.hadoop.yarn.server.timelineservice.reader.TestTimelineReaderWebServicesHBaseStorage.setupBeforeClass(TestTimelineReaderWebServicesHBaseStorage.java:84)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Caused by: java.lang.RuntimeException: java.io.IOException: Can not attach to 
> current VM
>   at 
> mockit.internal.startup.AgentLoader.attachToRunningVM(AgentLoader.java:150)
>   at mockit.internal.startup.AgentLoader.loadAgent(AgentLoader.java:60)
>   at 
> mockit.internal.startup.Startup.verifyInitialization(Startup.java:169)
>   at mockit.MockUp.(MockUp.java:94)
>   ... 19 more
> Caused by: java.io.IOException: Can not attach to current VM
>   at 
> jdk.attach/sun.tools.attach.HotSpotVirtualMachine.(HotSpotVirtualMachine.java:75)
>   at 
> jdk.attach/sun.tools.attach.VirtualMachineImpl.(VirtualMachineImpl.java:56)
>   at 
> jdk.attach/sun.tools.attach.AttachProviderImpl.attachVirtualMachine(AttachProviderImpl.java:58)
>   at 
> jdk.attach/com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:207)
>   at 
> mockit.internal.startup.AgentLoader.attachToRunningVM(AgentLoader.java:144)
>   ... 22 more
> [INFO] 
> [INFO] Results:
> [INFO] 
> [ERROR] Errors: 
> [ERROR]   TestTimelineReaderWebServicesHBaseStorage.setupBeforeClass:84 
> ExceptionInInitializer
> [INFO] 
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11301) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-11269

2022-09-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11301:
-
Component/s: test

> Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory 
> after YARN-11269
> ---
>
> Key: YARN-11301
> URL: https://issues.apache.org/jira/browse/YARN-11301
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After the Apache Hadoop YARN Timeline Plugin Storage module was upgraded from 
> junit 4 to 5, a compilation error occurred.
> {code:java}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:
> test (default-test) on project hadoop-yarn-server-timeline-pluginstorage: 
> Execution default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: 
> java.lang.NoClassDefFoundError: 
> org/junit/platform/launcher/core/LauncherFactory: 
> org.junit.platform.launcher.core.LauncherFactory -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11301) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-11269

2022-09-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11301.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk. Thank you [~slfan1989] for your fix.

> Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory 
> after YARN-11269
> ---
>
> Key: YARN-11301
> URL: https://issues.apache.org/jira/browse/YARN-11301
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After the Apache Hadoop YARN Timeline Plugin Storage module was upgraded from 
> junit 4 to 5, a compilation error occurred.
> {code:java}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:
> test (default-test) on project hadoop-yarn-server-timeline-pluginstorage: 
> Execution default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: 
> java.lang.NoClassDefFoundError: 
> org/junit/platform/launcher/core/LauncherFactory: 
> org.junit.platform.launcher.core.LauncherFactory -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11241) Add uncleaning option for local app log file with log-aggregation enabled

2022-09-12 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11241.
--
Fix Version/s: 3.4.0
   3.3.9
   Resolution: Fixed

Committed to trunk and branch-3.3. Thank you [~groot] for your contribution!

> Add uncleaning option for local app log file with log-aggregation enabled
> -
>
> Key: YARN-11241
> URL: https://issues.apache.org/jira/browse/YARN-11241
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: log-aggregation
>Reporter: groot
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Add uncleaning option for local app log file with log-aggregation enabled
> This will be helpful for debugging purpose.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn

2022-08-30 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597840#comment-17597840
 ] 

Akira Ajisaka commented on YARN-6939:
-

After discussing with groot offline, now I'm okay with adding the 
junit-platform-launcher dependency for each PR. It will take long time to test 
locally.

> Upgrade JUnit from 4 to 5 in hadoop-yarn
> 
>
> Key: YARN-6939
> URL: https://issues.apache.org/jira/browse/YARN-6939
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Akira Ajisaka
>Assignee: groot
>Priority: Major
>
> Feel free to create sub-tasks for each module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn

2022-08-30 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597790#comment-17597790
 ] 

Akira Ajisaka commented on YARN-6939:
-

Thank you [~groot].

For the contributors, in addition, I want to make sure we won't make any 
regressions like YARN-11287. Could you run {{mvn test}} under 
hadoop-yarn-project/hadoop-yarn directory and paste the result in the PR?

> Upgrade JUnit from 4 to 5 in hadoop-yarn
> 
>
> Key: YARN-6939
> URL: https://issues.apache.org/jira/browse/YARN-6939
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Akira Ajisaka
>Assignee: groot
>Priority: Major
>
> Feel free to create sub-tasks for each module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory after YARN-10793

2022-08-30 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11287:
-
Summary: Fix NoClassDefFoundError: 
org/junit/platform/launcher/core/LauncherFactory after YARN-10793  (was: Fix 
NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After 
YARN-10793)

> Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory 
> after YARN-10793
> ---
>
> Key: YARN-11287
> URL: https://issues.apache.org/jira/browse/YARN-11287
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After executing the yarn-project global unit test, I found the following 
> error:
> {code:java}
> ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) 
> on project hadoop-yarn-server-applicationhistoryservice: Execution 
> default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: 
> java.lang.NoClassDefFoundError: 
> org/junit/platform/launcher/core/LauncherFactory: 
> org.junit.platform.launcher.core.LauncherFactory -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR] 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :hadoop-yarn-server-applicationhistoryservice {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After YARN-10793

2022-08-30 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11287:
-
Component/s: build
 test

> Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory 
> After YARN-10793
> ---
>
> Key: YARN-11287
> URL: https://issues.apache.org/jira/browse/YARN-11287
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: build, test
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After executing the yarn-project global unit test, I found the following 
> error:
> {code:java}
> ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) 
> on project hadoop-yarn-server-applicationhistoryservice: Execution 
> default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: 
> java.lang.NoClassDefFoundError: 
> org/junit/platform/launcher/core/LauncherFactory: 
> org.junit.platform.launcher.core.LauncherFactory -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR] 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :hadoop-yarn-server-applicationhistoryservice {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11287) Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory After YARN-10793

2022-08-30 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11287.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk. Thank you [~slfan1989] for reporting and fixing this issue.

> Fix NoClassDefFoundError: org/junit/platform/launcher/core/LauncherFactory 
> After YARN-10793
> ---
>
> Key: YARN-11287
> URL: https://issues.apache.org/jira/browse/YARN-11287
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: fanshilun
>Assignee: fanshilun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> After executing the yarn-project global unit test, I found the following 
> error:
> {code:java}
> ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test (default-test) 
> on project hadoop-yarn-server-applicationhistoryservice: Execution 
> default-test of goal 
> org.apache.maven.plugins:maven-surefire-plugin:3.0.0-M1:test failed: 
> java.lang.NoClassDefFoundError: 
> org/junit/platform/launcher/core/LauncherFactory: 
> org.junit.platform.launcher.core.LauncherFactory -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR] 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :hadoop-yarn-server-applicationhistoryservice {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6939) Upgrade JUnit from 4 to 5 in hadoop-yarn

2022-08-30 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-6939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597689#comment-17597689
 ] 

Akira Ajisaka commented on YARN-6939:
-

If you have some PRs for upgrading JUnit version, please ping or mention me in 
the PR. I can review later when I have time.

> Upgrade JUnit from 4 to 5 in hadoop-yarn
> 
>
> Key: YARN-6939
> URL: https://issues.apache.org/jira/browse/YARN-6939
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Akira Ajisaka
>Assignee: groot
>Priority: Major
>
> Feel free to create sub-tasks for each module.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml

2022-08-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11254.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk. Thank you [~clara0]!

> hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml
> --
>
> Key: YARN-11254
> URL: https://issues.apache.org/jira/browse/YARN-11254
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Clara Fang
>Assignee: Clara Fang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The dependency hadoop-minikdc is defined twice in 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml
> {code:xml}
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager

2022-08-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11254:
-
Summary: hadoop-minikdc dependency duplicated in 
hadoop-yarn-server-nodemanager  (was: hadoop-minikdc dependency duplicated in 
hadoop-yarn-server-nodemanager pom.xml)

> hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager
> --
>
> Key: YARN-11254
> URL: https://issues.apache.org/jira/browse/YARN-11254
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Clara Fang
>Assignee: Clara Fang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The dependency hadoop-minikdc is defined twice in 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml
> {code:xml}
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11254) hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml

2022-08-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11254:


Assignee: Clara Fang

> hadoop-minikdc dependency duplicated in hadoop-yarn-server-nodemanager pom.xml
> --
>
> Key: YARN-11254
> URL: https://issues.apache.org/jira/browse/YARN-11254
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Clara Fang
>Assignee: Clara Fang
>Priority: Minor
>  Labels: pull-request-available
>
> The dependency hadoop-minikdc is defined twice in 
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/pom.xml
> {code:xml}
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> 
> org.apache.hadoop
> hadoop-minikdc
> test
> 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11257) Add junit5 dependency to hadoop-yarn-server-timeline-pluginstorage to fix few unit test failure

2022-08-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11257.
--
Resolution: Duplicate

This issue has been fixed by YARN-11269. Closing.

> Add junit5 dependency to hadoop-yarn-server-timeline-pluginstorage to fix few 
> unit test failure
> ---
>
> Key: YARN-11257
> URL: https://issues.apache.org/jira/browse/YARN-11257
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: groot
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
>
> We need to add Junit 5 dependency in
> {code:java}
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timeline-pluginstorage/pom.xml{code}
> as the TestLevelDBCacheTimelineStore is extending TimelineStoreTestUtils and 
> we have already upgraded from Junit 4 to 5 in TimelineStoreTestUtils.
>  
> Failing UTs 
> -[https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/957/testReport/junit/org.apache.hadoop.yarn.server.timeline/TestLevelDBCacheTimelineStore/testGetDomains/]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10793) Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice

2022-08-20 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582217#comment-17582217
 ] 

Akira Ajisaka commented on YARN-10793:
--

Thank you [~ayushtkn] and [~groot]. I'm sorry for that.

Hi [~groot], could you file a jira and add the junit5 dependency to 
hadoop-yarn-server-timeline-pluginstorage module to quickly fix this issue? I 
can review and commit it. Upgrading from junit 4 to 5 in the module can be done 
separately.

> Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice
> -
>
> Key: YARN-10793
> URL: https://issues.apache.org/jira/browse/YARN-10793
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: ANANDA G B
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11248) Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING

2022-08-16 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11248.
--
Fix Version/s: 3.4.0
   3.3.9
   Resolution: Fixed

Committed to trunk and branch-3.3. Thank you [~groot] for your contribution.

> Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING
> ---
>
> Key: YARN-11248
> URL: https://issues.apache.org/jira/browse/YARN-11248
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.3
>Reporter: groot
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> Add unit test for FINISHED_CONTAINERS_PULLED_BY_AM event on DECOMMISSIONING



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10793) Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice

2022-08-07 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-10793.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Committed to trunk. Thank you [~groot] for your contribution.

> Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice
> -
>
> Key: YARN-10793
> URL: https://issues.apache.org/jira/browse/YARN-10793
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: ANANDA G B
>Assignee: groot
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Upgrade Junit from 4 to 5 in hadoop-yarn-server-applicationhistoryservice



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception

2022-07-25 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11210:


Assignee: Kevin Wikant

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> --
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtoco

[jira] [Commented] (YARN-11210) Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration exception

2022-07-25 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571183#comment-17571183
 ] 

Akira Ajisaka commented on YARN-11210:
--

[~prabhujoseph] Yes. Done.

I think all the Hadoop committers have the privilege to make someone to 
contributor. You can go to 
https://issues.apache.org/jira/plugins/servlet/project-config/YARN/roles and 
add Kevin as contributor.

> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration 
> exception
> --
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
> synchronously can be blocked for up to 15 minutes with the default 
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
> issue in of itself, but there is a non-retryable IllegalArgumentException 
> exception thrown within the YARN ResourceManager client that is getting 
> swallowed & treated as a retryable "connection exception" meaning that it 
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it 
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with 
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt 
> to connect to the ResourceManager for up to 15 minutes when facing 
> "connection exceptions". This arguably makes sense because connection 
> exceptions are in some cases transient & can be recovered from without any 
> action needed from the client. See example below where YARN ResourceManager 
> client was able to recover from connection issues that resulted from the 
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Your endpoint configuration is wrong; For more 
> details see:  [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while 
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over 
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.Retr

[jira] [Assigned] (YARN-6946) Upgrade JUnit from 4 to 5 in hadoop-yarn-common

2022-07-19 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-6946:
---

Assignee: (was: Akira Ajisaka)

> Upgrade JUnit from 4 to 5 in hadoop-yarn-common
> ---
>
> Key: YARN-6946
> URL: https://issues.apache.org/jira/browse/YARN-6946
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: test
>Reporter: Akira Ajisaka
>Priority: Major
> Attachments: YARN-6946.wip001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11119) Backport YARN-10538 to branch-2.10

2022-06-26 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-9.
--
Fix Version/s: 2.10.3
   Resolution: Fixed

Committed to branch-2.10. Thank you [~groot] for your contribution.

> Backport YARN-10538 to branch-2.10
> --
>
> Key: YARN-9
> URL: https://issues.apache.org/jira/browse/YARN-9
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Backport YARN-10538 to branch-2.10
>  
> Add recommissioning nodes to the list of updated nodes returned to the AM



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10303) One yarn rest api example of yarn document is error

2022-06-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-10303.
--
Resolution: Fixed

> One yarn rest api example of yarn document is error
> ---
>
> Key: YARN-10303
> URL: https://issues.apache.org/jira/browse/YARN-10303
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.1.1, 3.2.1
>Reporter: bright.zhou
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: documentation, newbie, pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2020-06-02-10-27-35-020.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> deSelects value should be resourceRequests
> !image-2020-06-02-10-27-35-020.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-10303) One yarn rest api example of yarn document is error

2022-06-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reopened YARN-10303:
--

> One yarn rest api example of yarn document is error
> ---
>
> Key: YARN-10303
> URL: https://issues.apache.org/jira/browse/YARN-10303
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 3.1.1, 3.2.1
>Reporter: bright.zhou
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: documentation, newbie, pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2020-06-02-10-27-35-020.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> deSelects value should be resourceRequests
> !image-2020-06-02-10-27-35-020.png!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*

2022-05-26 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11128.
--
Fix Version/s: 3.4.0
   3.3.4
   Resolution: Fixed

Committed to trunk and branch-3.3. Thanks!

> Fix comments in TestProportionalCapacityPreemptionPolicy*
> -
>
> Key: YARN-11128
> URL: https://issues.apache.org/jira/browse/YARN-11128
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, documentation
>Reporter: groot
>Assignee: groot
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> At various places, comment for appsConfig is 
> {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}}
> but should be 
> {{// 
> queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10080) Support show app id on localizer thread pool

2022-05-18 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10080:
-
Fix Version/s: 3.2.4

Backported to branch-3.2.

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11133) YarnClient gets the wrong EffectiveMinCapacity value

2022-05-17 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11133:
-
Fix Version/s: 3.2.4
   3.3.4

Backported to branch-3.3 and branch-3.2.

> YarnClient gets the wrong EffectiveMinCapacity value
> 
>
> Key: YARN-11133
> URL: https://issues.apache.org/jira/browse/YARN-11133
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 3.2.3, 3.3.2
>Reporter: Zilong Zhu
>Assignee: Zilong Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> It calls the QueueConfigurations#getEffectiveMinCapacity to get the wrong 
> value when I use the YarnClient. I found some bugs with 
> QueueConfigurationsPBImpl#mergeLocalToBuilder.
> {code:java}
> private void mergeLocalToBuilder() {
>   if (this.effMinResource != null) {
> builder
> .setEffectiveMinCapacity(convertToProtoFormat(this.effMinResource));
>   }
>   if (this.effMaxResource != null) {
> builder
> .setEffectiveMaxCapacity(convertToProtoFormat(this.effMaxResource));
>   }
>   if (this.configuredMinResource != null) {
> builder.setEffectiveMinCapacity(
> convertToProtoFormat(this.configuredMinResource));
>   }
>   if (this.configuredMaxResource != null) {
> builder.setEffectiveMaxCapacity(
> convertToProtoFormat(this.configuredMaxResource));
>   }
> } {code}
> configuredMinResource was incorrectly assigned to effMinResource. This causes 
> the real effMinResource to be overwritten and configuredMinResource is null. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11133) YarnClient gets the wrong EffectiveMinCapacity value

2022-05-16 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11133:


Assignee: Zilong Zhu

> YarnClient gets the wrong EffectiveMinCapacity value
> 
>
> Key: YARN-11133
> URL: https://issues.apache.org/jira/browse/YARN-11133
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: api
>Affects Versions: 3.2.3, 3.3.2
>Reporter: Zilong Zhu
>Assignee: Zilong Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> It calls the QueueConfigurations#getEffectiveMinCapacity to get the wrong 
> value when I use the YarnClient. I found some bugs with 
> QueueConfigurationsPBImpl#mergeLocalToBuilder.
> {code:java}
> private void mergeLocalToBuilder() {
>   if (this.effMinResource != null) {
> builder
> .setEffectiveMinCapacity(convertToProtoFormat(this.effMinResource));
>   }
>   if (this.effMaxResource != null) {
> builder
> .setEffectiveMaxCapacity(convertToProtoFormat(this.effMaxResource));
>   }
>   if (this.configuredMinResource != null) {
> builder.setEffectiveMinCapacity(
> convertToProtoFormat(this.configuredMinResource));
>   }
>   if (this.configuredMaxResource != null) {
> builder.setEffectiveMaxCapacity(
> convertToProtoFormat(this.configuredMaxResource));
>   }
> } {code}
> configuredMinResource was incorrectly assigned to effMinResource. This causes 
> the real effMinResource to be overwritten and configuredMinResource is null. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11092) Upgrade jquery ui to 1.13.1

2022-05-16 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11092:
-
Issue Type: Bug  (was: Improvement)

> Upgrade jquery ui to 1.13.1
> ---
>
> Key: YARN-11092
> URL: https://issues.apache.org/jira/browse/YARN-11092
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: D M Murali Krishna Reddy
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current jquery-ui version used(1.12.1) in the trunk has the following 
> vulnerabilities CVE-2021-41182, CVE-2021-41183, CVE-2021-41184, so we need to 
> upgrade to at least 1.13.0.
>  
> Also currently for the UI2 we are using the shims repo which is not being 
> maintained as per the discussion 
> [https://github.com/components/jqueryui/issues/70] , so if possible we should 
> move to the main jquery repo [https://github.com/jquery/jquery-ui] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11092) Upgrade jquery ui to 1.13.1

2022-05-16 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11092.
--
Fix Version/s: 3.4.0
   3.2.4
   3.3.4
   Resolution: Fixed

Committed to trunk, branch-3.3, and branch-3.2. Thank you [~dmmkr] for the 
report and thank you [~groot] for your contribution.

> Upgrade jquery ui to 1.13.1
> ---
>
> Key: YARN-11092
> URL: https://issues.apache.org/jira/browse/YARN-11092
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: D M Murali Krishna Reddy
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current jquery-ui version used(1.12.1) in the trunk has the following 
> vulnerabilities CVE-2021-41182, CVE-2021-41183, CVE-2021-41184, so we need to 
> upgrade to at least 1.13.0.
>  
> Also currently for the UI2 we are using the shims repo which is not being 
> maintained as per the discussion 
> [https://github.com/components/jqueryui/issues/70] , so if possible we should 
> move to the main jquery repo [https://github.com/jquery/jquery-ui] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10080) Support show app id on localizer thread pool

2022-05-13 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536774#comment-17536774
 ] 

Akira Ajisaka commented on YARN-10080:
--

I want to backport to branch-3.2, but the build is now broken by HDFS-16552. 
I'll backport this after the build is fixed.

> Support show app id on localizer thread pool
> 
>
> Key: YARN-10080
> URL: https://issues.apache.org/jira/browse/YARN-10080
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Reporter: zhoukang
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.4
>
> Attachments: YARN-10080-001.patch, YARN-10080.002.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently when we are troubleshooting a container localizer issue, if we want 
> to analyze the jstack with thread detail, we can not figure out which thread 
> is processing the given container. So i want to add app id on the thread name



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11125) Backport YARN-6483 to branch-2.10

2022-05-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11125.
--
Fix Version/s: 2.10.2
   Resolution: Fixed

Merged the PR into branch-2.10. Thank you [~groot] for your contribution.

> Backport YARN-6483 to branch-2.10
> -
>
> Key: YARN-11125
> URL: https://issues.apache.org/jira/browse/YARN-11125
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Backport YARN-6483 to branch-2.10



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11125) Backport YARN-6483 to branch-2.10

2022-05-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11125:
-
Issue Type: Improvement  (was: Bug)

> Backport YARN-6483 to branch-2.10
> -
>
> Key: YARN-11125
> URL: https://issues.apache.org/jira/browse/YARN-11125
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Backport YARN-6483 to branch-2.10



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11125) Backport YARN-6483 to branch-2.10

2022-05-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11125:
-
Component/s: resourcemanager

> Backport YARN-6483 to branch-2.10
> -
>
> Key: YARN-11125
> URL: https://issues.apache.org/jira/browse/YARN-11125
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Backport YARN-6483 to branch-2.10



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases

2022-05-13 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536750#comment-17536750
 ] 

Akira Ajisaka commented on YARN-11073:
--

Cut YARN-11469 for regression tests.

> Avoid unnecessary preemption for tiny queues under certain corner cases
> ---
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases

2022-05-13 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536750#comment-17536750
 ] 

Akira Ajisaka edited comment on YARN-11073 at 5/13/22 4:16 PM:
---

Filed YARN-11149 for regression tests.


was (Author: ajisakaa):
Cut YARN-11469 for regression tests.

> Avoid unnecessary preemption for tiny queues under certain corner cases
> ---
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11149) Add regression test cases for YARN-11073

2022-05-13 Thread Akira Ajisaka (Jira)
Akira Ajisaka created YARN-11149:


 Summary: Add regression test cases for YARN-11073
 Key: YARN-11149
 URL: https://issues.apache.org/jira/browse/YARN-11149
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Reporter: Akira Ajisaka


Add regression test cases for YARN-11073



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11073) Avoid unnecessary preemption for tiny queues under certain corner cases

2022-05-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11073:
-
Summary: Avoid unnecessary preemption for tiny queues under certain corner 
cases  (was: CapacityScheduler DRF Preemption kicked in incorrectly for 
low-capacity queues)

> Avoid unnecessary preemption for tiny queues under certain corner cases
> ---
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-05-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11073.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Merged the PR into trunk. Let's add test cases in a separate JIRA.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*

2022-05-07 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11128:
-
Component/s: documentation
 Issue Type: Bug  (was: New Feature)

> Fix comments in TestProportionalCapacityPreemptionPolicy*
> -
>
> Key: YARN-11128
> URL: https://issues.apache.org/jira/browse/YARN-11128
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler, documentation
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>
> At various places, comment for appsConfig is 
> {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}}
> but should be 
> {{// 
> queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11128) Fix comments in TestProportionalCapacityPreemptionPolicy*

2022-05-07 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11128:
-
Description: 
At various places, comment for appsConfig is 

{{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}}

but should be 

{{// 
queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}}

 

  was:
At various places, comment for appsConfig is 

`// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)`

but should be 

`// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)`

 


> Fix comments in TestProportionalCapacityPreemptionPolicy*
> -
>
> Key: YARN-11128
> URL: https://issues.apache.org/jira/browse/YARN-11128
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: capacityscheduler
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>
> At various places, comment for appsConfig is 
> {{// queueName\t(priority,resource,host,expression,#repeat,reserved,pending)}}
> but should be 
> {{// 
> queueName\t(priority,resource,host,expression,#repeat,reserved,pending,user)}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11116) Migrate Times util from SimpleDateFormat to thread-safe DateTimeFormatter class

2022-05-02 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-6.
--
Fix Version/s: 3.4.0
   3.2.4
   3.3.4
   Resolution: Fixed

Committed to trunk, branch-3.3, and branch-3.2. Thank you [~jeagles] for your 
contribution!

> Migrate Times util from SimpleDateFormat to thread-safe DateTimeFormatter 
> class
> ---
>
> Key: YARN-6
> URL: https://issues.apache.org/jira/browse/YARN-6
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Jonathan Turner Eagles
>Assignee: Jonathan Turner Eagles
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
> Attachments: YARN-6.001.perftest.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Came across a stack trace with SimpleDateFormatter in it which led me to 
> investigate current practices
>  
> {noformat}
>  6578 "IPC Server handler 29 on 8032" #797 daemon prio=5 os_prio=0 
> tid=0x7fb6527d nid=0x953b runnable [0x7fb5ba034000]
>  6579    java.lang.Thread.State: RUNNABLE
>  6580     at org.apache.hadoop.yarn.util.Times.formatISO8601(Times.java:95)
>  6581     at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:810)
>  6582     at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:396)
>  6583     at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:224)
>  6584     at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:529)
>  6585     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>  6586     at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:500)
>  6587     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1069)
>  6588     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
>  6589     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:936)
>  6590     at java.security.AccessController.doPrivileged(Native Method)
>  6591     at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2135)
>  6592     at 
> org.apache.hadoop.security.UserGroupInformation.doAsPrivileged(UserGroupInformation.java:2123)
>  6593     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2875)
>  6594 
> {noformat}
>  
> DateTimeFormatter is thread-safe meaning no need to wrap the class in Thread 
> local as they can be reused safely across threads. In addition, the new 
> classes are slightly more performant.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.

2022-05-02 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-10187.
--
Fix Version/s: 3.2.4
   3.3.4
   Resolution: Fixed

Backported to branch-3.3 and branch-3.2.

> Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
> --
>
> Key: YARN-10187
> URL: https://issues.apache.org/jira/browse/YARN-10187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: N Sanketh Reddy
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.4
>
>   Original Estimate: 1h
>  Time Spent: 50m
>  Remaining Estimate: 10m
>
> hadoop-yarn-project/hadoop-yarn/README is not maintained and can be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.

2022-05-02 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10187:
-
Description: hadoop-yarn-project/hadoop-yarn/README is not maintained and 
can be removed.  (was: Converting a README to README.md for showcasing the 
markdown and for better readablity)

> Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
> --
>
> Key: YARN-10187
> URL: https://issues.apache.org/jira/browse/YARN-10187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: N Sanketh Reddy
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>   Original Estimate: 1h
>  Time Spent: 50m
>  Remaining Estimate: 10m
>
> hadoop-yarn-project/hadoop-yarn/README is not maintained and can be removed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10187) Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.

2022-05-02 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10187:
-
Fix Version/s: 3.4.0
   Issue Type: Bug  (was: Improvement)
 Priority: Minor  (was: Major)
  Summary: Removing hadoop-yarn-project/hadoop-yarn/README as it is no 
longer maintained.  (was: converting README to README.md)

Committed to trunk. Thank you [~groot] for your contribution!

> Removing hadoop-yarn-project/hadoop-yarn/README as it is no longer maintained.
> --
>
> Key: YARN-10187
> URL: https://issues.apache.org/jira/browse/YARN-10187
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation
>Reporter: N Sanketh Reddy
>Assignee: Ashutosh Gupta
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>   Original Estimate: 1h
>  Time Spent: 50m
>  Remaining Estimate: 10m
>
> Converting a README to README.md for showcasing the markdown and for better 
> readablity



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-3224) Notify AM with containers (on decommissioning node) could be preempted after timeout.

2022-05-02 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-3224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-3224.
-
  Assignee: (was: Sunil G)
Resolution: Fixed

Duplicate of YARN-6483. Closing

> Notify AM with containers (on decommissioning node) could be preempted after 
> timeout.
> -
>
> Key: YARN-3224
> URL: https://issues.apache.org/jira/browse/YARN-3224
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful
>Reporter: Junping Du
>Priority: Major
> Attachments: 0001-YARN-3224.patch, 0002-YARN-3224.patch
>
>
> We should leverage YARN preemption framework to notify AM that some 
> containers will be preempted after a timeout.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10538) Add recommissioning nodes to the list of updated nodes returned to the AM

2022-04-27 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529215#comment-17529215
 ] 

Akira Ajisaka commented on YARN-10538:
--

Hi [~groot], I tried to backport it to branch-2.10 but I faced some conflicts. 
Would you try backporting this and create a PR? I can review it.

> Add recommissioning nodes to the list of updated nodes returned to the AM
> -
>
> Key: YARN-10538
> URL: https://issues.apache.org/jira/browse/YARN-10538
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.1, 3.1.1
>Reporter: Srinivas S T
>Assignee: Srinivas S T
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> YARN-6483 introduced nodes that transitioned to DECOMMISSIONING state to the 
> list of updated nodes returned to the AM. This allows the Spark application 
> master to gracefully decommission its containers on the decommissioning node. 
> But if the node were to be recommissioned, the Spark application master would 
> not be aware of this. We propose to add recommissioned node to the list of 
> updated nodes sent to the AM when a recommission node transition occurs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-22 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-1:
-
Fix Version/s: 3.4.0
   (was: 3.3.4)

Changed the fix version to 3.4.0 because it is now only in trunk.

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
> at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
> at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
> ... 4 more
> Caused by: java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
> at 
> org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
> at 
> org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
> at 
> org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
> at 
> org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
> at 
> org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
> ... 5 more
> 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
>  {code}
> When i digging into the codebase, found that the node and labels mapping is 
> stored in the nodelabel.mirror file when configured the type of c

[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2022-04-18 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10720:
-
Fix Version/s: 3.4.0

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.2.4, 3.3.3
>
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10553) Refactor TestDistributedShell

2022-04-12 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10553:
-
Fix Version/s: 3.3.4

Backported to branch-3.3.

> Refactor TestDistributedShell
> -
>
> Key: YARN-10553
> URL: https://issues.apache.org/jira/browse/YARN-10553
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available, refactoring, test
> Fix For: 3.4.0, 3.3.4
>
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> TestDistributedShell has grown so large over time. It has 29 tests.
>  This is running the risk of exceeding 30 minutes limit for a single unit 
> class.
>  * The implementation has lots of code redundancy.
>  * The Jira splits TestDistributedShell into three different unitTest for 
> each TimeLineVersion: V1.0, 1.5, and 2.0
>  * Fixes the broken test {{testDSShellWithEnforceExecutionType}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10553) Refactor TestDistributedShell

2022-04-11 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520365#comment-17520365
 ] 

Akira Ajisaka commented on YARN-10553:
--

Opened [https://github.com/apache/hadoop/pull/4159] for branch-3.3

> Refactor TestDistributedShell
> -
>
> Key: YARN-10553
> URL: https://issues.apache.org/jira/browse/YARN-10553
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell, test
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available, refactoring, test
> Fix For: 3.4.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> TestDistributedShell has grown so large over time. It has 29 tests.
>  This is running the risk of exceeding 30 minutes limit for a single unit 
> class.
>  * The implementation has lots of code redundancy.
>  * The Jira splits TestDistributedShell into three different unitTest for 
> each TimeLineVersion: V1.0, 1.5, and 2.0
>  * Fixes the broken test {{testDSShellWithEnforceExecutionType}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11101) Fix TestYarnConfigurationFields

2022-04-06 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11101.
--
Resolution: Duplicate

> Fix TestYarnConfigurationFields
> ---
>
> Key: YARN-11101
> URL: https://issues.apache.org/jira/browse/YARN-11101
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, newbie
>Reporter: Akira Ajisaka
>Priority: Major
>
> yarn.resourcemanager.node-labels.am.default-node-label-expression is missing 
> in yarn-default.xml.
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 
> s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] testCompareConfigurationClassAgainstXml  Time elapsed: 0.082 s  <<< 
> FAILURE!
> java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
> has 1 variables missing in yarn-default.xml Entries:   
> yarn.resourcemanager.node-labels.am.default-node-label-expression 
> expected:<0> but was:<1>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at 
> org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11101) Fix TestYarnConfigurationFields

2022-04-06 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517924#comment-17517924
 ] 

Akira Ajisaka commented on YARN-11101:
--

Thank you [~zuston] for the information. I'll close this as duplicate.

> Fix TestYarnConfigurationFields
> ---
>
> Key: YARN-11101
> URL: https://issues.apache.org/jira/browse/YARN-11101
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, newbie
>Reporter: Akira Ajisaka
>Priority: Major
>
> yarn.resourcemanager.node-labels.am.default-node-label-expression is missing 
> in yarn-default.xml.
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 
> s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] testCompareConfigurationClassAgainstXml  Time elapsed: 0.082 s  <<< 
> FAILURE!
> java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
> has 1 variables missing in yarn-default.xml Entries:   
> yarn.resourcemanager.node-labels.am.default-node-label-expression 
> expected:<0> but was:<1>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at 
> org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-28 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11073:


Assignee: Jian Chen

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Assignee: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-28 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513261#comment-17513261
 ] 

Akira Ajisaka commented on YARN-11073:
--

Thank you [~jchenjc22]. Yes, I agreed that the testing the preemption behavior 
is difficult. For unit testing, I think it's sufficient to verify the behavior 
of the resetCapacity method that you have changed.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-11073.tmp-1.patch
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10548) Decouple AM runner logic from SLSRunner

2022-03-27 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513012#comment-17513012
 ] 

Akira Ajisaka commented on YARN-10548:
--

Hi all the reviewers, please do not +1 when spotbugs says -1.

Hi [~snemeth] - please fix the spotbugs error in YARN-11102 as follow-up. 
Otherwise I'll revert it because it raised not only spotbugs error but also 
increased javac, whitespace, and javadoc warnings.

> Decouple AM runner logic from SLSRunner
> ---
>
> Key: YARN-10548
> URL: https://issues.apache.org/jira/browse/YARN-10548
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Fix For: 3.4.0
>
> Attachments: YARN-10548.001.patch, YARN-10548.002.patch, 
> YARN-10548.003.patch
>
>
> SLSRunner has too many responsibilities.
>  One of them is to parse the job details from the SLS input formats and 
> launch the AMs and task containers.
>  The AM runner logic could be decoupled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11102) Fix spotbugs error in hadoop-sls module

2022-03-27 Thread Akira Ajisaka (Jira)
Akira Ajisaka created YARN-11102:


 Summary: Fix spotbugs error in hadoop-sls module
 Key: YARN-11102
 URL: https://issues.apache.org/jira/browse/YARN-11102
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Akira Ajisaka


Fix the following Spotbugs error:

- org.apache.hadoop.yarn.sls.AMRunner.setInputTraces(String[]) may expose 
internal representation by storing an externally mutable object into 
AMRunner.inputTraces At AMRunner.java:by storing an externally mutable object 
into AMRunner.inputTraces At AMRunner.java:[line 267]
- Write to static field org.apache.hadoop.yarn.sls.AMRunner.REMAINING_APPS from 
instance method org.apache.hadoop.yarn.sls.AMRunner.startAM() At 
AMRunner.java:from instance method 
org.apache.hadoop.yarn.sls.AMRunner.startAM() At AMRunner.java:[line 116]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11101) Fix TestYarnConfigurationFields

2022-03-27 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11101:
-
Component/s: newbie

> Fix TestYarnConfigurationFields
> ---
>
> Key: YARN-11101
> URL: https://issues.apache.org/jira/browse/YARN-11101
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, newbie
>Reporter: Akira Ajisaka
>Priority: Major
>
> yarn.resourcemanager.node-labels.am.default-node-label-expression is missing 
> in yarn-default.xml.
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 
> s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] testCompareConfigurationClassAgainstXml  Time elapsed: 0.082 s  <<< 
> FAILURE!
> java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
> has 1 variables missing in yarn-default.xml Entries:   
> yarn.resourcemanager.node-labels.am.default-node-label-expression 
> expected:<0> but was:<1>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at 
> org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11101) Fix TestYarnConfigurationFields

2022-03-27 Thread Akira Ajisaka (Jira)
Akira Ajisaka created YARN-11101:


 Summary: Fix TestYarnConfigurationFields
 Key: YARN-11101
 URL: https://issues.apache.org/jira/browse/YARN-11101
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Reporter: Akira Ajisaka


yarn.resourcemanager.node-labels.am.default-node-label-expression is missing in 
yarn-default.xml.
{noformat}
[INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 s 
<<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
[ERROR] testCompareConfigurationClassAgainstXml  Time elapsed: 0.082 s  <<< 
FAILURE!
java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
has 1 variables missing in yarn-default.xml Entries:   
yarn.resourcemanager.node-labels.am.default-node-label-expression expected:<0> 
but was:<1>
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:647)
at 
org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2022-03-24 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10720:
-
Fix Version/s: 2.10.2
   3.2.4

Backported to branch-3.2 and branch-2.10.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 2.10.2, 3.2.4, 3.3.3
>
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-24 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511676#comment-17511676
 ] 

Akira Ajisaka commented on YARN-11073:
--

Hello [~jchenjc22], do you want to continue the work to merge the fix into 
Apache Hadoop? If not, I'll let one of my colleague to take it over.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2022-03-24 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511652#comment-17511652
 ] 

Akira Ajisaka commented on YARN-10720:
--

Opened https://github.com/apache/hadoop/pull/4103 for branch-2.10. I'm seeing 
this issue in a prod cluster, so I want to backport the fix to all the release 
branches.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2022-03-23 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511610#comment-17511610
 ] 

Akira Ajisaka commented on YARN-10720:
--

When cherry-picking to branch-3.2, I had to fix some conflicts. Opened 
https://github.com/apache/hadoop/pull/4102 for testing.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2022-03-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10720:
-
Fix Version/s: 3.3.3

Cherry-picked to branch-3.3.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0, 3.3.3
>
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10747) Bump YARN CSI protobuf version to 3.7.1

2022-03-21 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10747:
-
Fix Version/s: 3.3.3

Backported to branch-3.3.

> Bump YARN CSI protobuf version to 3.7.1
> ---
>
> Key: YARN-10747
> URL: https://issues.apache.org/jira/browse/YARN-10747
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Bumping YARN CSI protobuf version to 3.7.1 to keep it consistent with 
> hadoop's protobuf version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10538) Add recommissioning nodes to the list of updated nodes returned to the AM

2022-03-09 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10538:
-
Fix Version/s: 3.2.3

Backported to branch-3.2 and branch-3.2.3.

> Add recommissioning nodes to the list of updated nodes returned to the AM
> -
>
> Key: YARN-10538
> URL: https://issues.apache.org/jira/browse/YARN-10538
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 2.9.1, 3.1.1
>Reporter: Srinivas S T
>Assignee: Srinivas S T
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> YARN-6483 introduced nodes that transitioned to DECOMMISSIONING state to the 
> list of updated nodes returned to the AM. This allows the Spark application 
> master to gracefully decommission its containers on the decommissioning node. 
> But if the node were to be recommissioned, the Spark application master would 
> not be aware of this. We propose to add recommissioned node to the list of 
> updated nodes sent to the AM when a recommission node transition occurs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11081) TestYarnConfigurationFields consistently keeps failing

2022-03-08 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11081.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Merged the PR into trunk. Thank you [~vjasani] for your contribution.

> TestYarnConfigurationFields consistently keeps failing
> --
>
> Key: YARN-11081
> URL: https://issues.apache.org/jira/browse/YARN-11081
> Project: Hadoop YARN
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> TestYarnConfigurationFields consistently keeps failing with error:
> {code:java}
> Error Messageclass org.apache.hadoop.yarn.conf.YarnConfiguration has 1 
> variables missing in yarn-default.xml Entries:   
> yarn.scheduler.app-placement-allocator.class expected:<0> but 
> was:<1>Stacktracejava.lang.AssertionError: class 
> org.apache.hadoop.yarn.conf.YarnConfiguration has 1 variables missing in 
> yarn-default.xml Entries:   yarn.scheduler.app-placement-allocator.class 
> expected:<0> but was:<1>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at 
> org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-03 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500616#comment-17500616
 ] 

Akira Ajisaka commented on YARN-11073:
--

Hi [~jchenjc22], the fix looks good to me. There are some comments from my side:
- Need to add regression tests
- It's better to print debug log when computeNormGuarFromAbsCapacity or 
computeNormGuarEvenly is used
- The patch should be targeted to trunk

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-03-02 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500566#comment-17500566
 ] 

Akira Ajisaka commented on YARN-11073:
--

{quote}IMO, it's okay to round up the queue's guaranteedCapacity.
{quote}
Rethinking this, it's not okay to round up the guaranteedCapacity because the 
round up can allocate more resources than the configured overall capacity. I'll 
look into your patch.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10472) Backport YARN-10314 to branch-3.2

2022-02-28 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10472:
-
Fix Version/s: (was: 3.2.3)

> Backport YARN-10314 to branch-3.2
> -
>
> Key: YARN-10472
> URL: https://issues.apache.org/jira/browse/YARN-10472
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.2.2
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Blocker
> Fix For: 3.2.2
>
>
> Filing this jira to raise the following concern:
> YARN-10314 fixes a problem with the shaded jars in 3.3.0. But it is not 
> backported to branch-3.2 yet. [~weichiu] and I ([~smeng]) are looking into 
> this.
> I have submitted a PR on branch-3.2: 
> https://github.com/apache/hadoop/pull/2412
> CC [~hexiaoqiao]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7266) Timeline Server event handler threads locked

2022-02-27 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-7266:

Fix Version/s: 3.2.3
   (was: 3.2.4)

Cherry-picked to branch-3.2.3.

> Timeline Server event handler threads locked
> 
>
> Key: YARN-7266
> URL: https://issues.apache.org/jira/browse/YARN-7266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, timelineserver
>Affects Versions: 2.7.3
>Reporter: Venkata Puneet Ravuri
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 2.7.8, 3.3.0, 2.8.6, 2.9.3, 2.10.2, 3.2.3
>
> Attachments: YARN-7266-0005.patch, YARN-7266-001.patch, 
> YARN-7266-002.patch, YARN-7266-003.patch, YARN-7266-004.patch, 
> YARN-7266-006.patch, YARN-7266-007.patch, YARN-7266-008.patch, 
> YARN-7266-branch-2.7.001.patch, YARN-7266-branch-2.8.001.patch
>
>
> Event handlers for Timeline Server seem to take a lock while parsing HTTP 
> headers of the request. This is causing all other threads to wait and slowing 
> down the overall performance of Timeline server. We have resourcemanager 
> metrics enabled to send to timeline server. Because of the high load on 
> ResourceManager, the metrics to be sent are getting backlogged and in turn 
> increasing heap footprint of Resource Manager (due to pending metrics).
> This is the complete stack trace of a blocked thread on timeline server:-
> "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 
> tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry 
> [0x7f6b9142c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82)
> - waiting to lock <0x0005c0621860> (a java.lang.Class for 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168)
> at 
> com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282)
> at 
> com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94)
> at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62)
> at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170)
> at 
> com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.find(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352)
> at 
> com.sun.jersey.server.wadl.WadlBuilder.generat

[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-21 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495860#comment-17495860
 ] 

Akira Ajisaka commented on YARN-11073:
--

Thank you [~jchenjc22] for your comment. IMO, it's okay to round up the queue's 
guaranteedCapacity.

Hi [~snemeth], do you have any suggestion?

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11068) Update transitive log4j2 dependency to 2.17.1

2022-02-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11068.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Merged the PR#3963 into trunk.

> Update transitive log4j2 dependency to 2.17.1
> -
>
> Key: YARN-11068
> URL: https://issues.apache.org/jira/browse/YARN-11068
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Similar to HADOOP-18092, we have transitive log4j2 dependency coming from 
> solr-core 8 that must be excluded.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-14 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492307#comment-17492307
 ] 

Akira Ajisaka commented on YARN-11073:
--

Note that you will need to rename patch such as 
"YARN-11073-branch-2.10-001.patch" to test patch against branch-2.10. Creating 
a GitHub pull request in https://github.com/apache/hadoop is preferred approach 
rather than attaching a patch here. In addition, the target branch must be 
trunk. All the patches first go to trunk and then backported to other release 
branches.

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11073) CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues

2022-02-14 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492305#comment-17492305
 ] 

Akira Ajisaka commented on YARN-11073:
--

Thank you [~jchenjc22] for your report. I'm +1 to always roundUp when 
calculating a queue's guaranteed capacity because it is very simple fix.

Hi [~wangda], do you have any suggestion?

> CapacityScheduler DRF Preemption kicked in incorrectly for low-capacity queues
> --
>
> Key: YARN-11073
> URL: https://issues.apache.org/jira/browse/YARN-11073
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.10.1
>Reporter: Jian Chen
>Priority: Major
> Attachments: YARN-11073.tmp-1.patch
>
>
> When running a Hive job in a low-capacity queue on an idle cluster, 
> preemption kicked in to preempt job containers even though there's no other 
> job running and competing for resources. 
> Let's take this scenario as an example:
>  * cluster resource : 
>  ** {_}*queue_low*{_}: min_capacity 1%
>  ** queue_mid: min_capacity 19%
>  ** queue_high: min_capacity 80%
>  * CapacityScheduler with DRF
> During the fifo preemption candidates selection process, the 
> _preemptableAmountCalculator_ needs to first "{_}computeIdealAllocation{_}" 
> which depends on each queue's guaranteed/min capacity. A queue's guaranteed 
> capacity is currently calculated as 
> "Resources.multiply(totalPartitionResource, absCapacity)", so the guaranteed 
> capacity of queue_low is:
>  * {_}*queue_low*{_}:  = 
> , but since the Resource object takes only Long 
> values, these Doubles values get casted into Long, and then the final result 
> becomes **
> Because the guaranteed capacity of queue_low is 0, its normalized guaranteed 
> capacity based on active queues is also 0 based on the current algorithm in 
> "{_}resetCapacity{_}". This eventually leads to the continuous preemption of 
> job containers running in {_}*queue_low*{_}. 
> In order to work around this corner case, I made a small patch (for my own 
> use case) around "{_}resetCapacity{_}" to consider a couple new scenarios: 
>  * if the sum of absoluteCapacity/minCapacity of all active queues is zero, 
> we should normalize their guaranteed capacity evenly
> {code:java}
> 1.0f / num_of_queues{code}
>  * if the sum of pre-normalized guaranteed capacity values ({_}MB or 
> VCores{_}) of all active queues is zero, meaning we might have several queues 
> like queue_low whose capacity value got casted into 0, we should normalize 
> evenly as well like the first scenario (if they are all tiny, it really makes 
> no big difference, for example, 1% vs 1.2%).
>  * if one of the active queues has a zero pre-normalized guaranteed capacity 
> value but its absoluteCapacity/minCapacity is *not* zero, then we should 
> normalize based on the weight of their configured queue 
> absoluteCapacity/minCapacity. This is to make sure _*queue_low*_ gets a small 
> but fair normalized value when _*queue_mid*_ is also active. 
> {code:java}
> minCapacity / (sum_of_min_capacity_of_active_queues)
> {code}
>  
> This is how I currently work around this issue, it might need someone who's 
> more familiar in this component to do a systematic review of the entire 
> preemption process to fix it properly. Maybe we can always apply the 
> weight-based approach using absoluteCapacity, or rewrite the code of Resource 
> to remove the casting, or always roundUp when calculating a queue's 
> guaranteed capacity, etc.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10788) TestCsiClient fails

2022-02-13 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491702#comment-17491702
 ] 

Akira Ajisaka commented on YARN-10788:
--

Thank you [~ayushtkn] for the investigation. I created a PR to reduce the path 
length.

> TestCsiClient fails
> ---
>
> Key: YARN-10788
> URL: https://issues.apache.org/jira/browse/YARN-10788
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Attachments: 
> patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestCsiClient fails to bind to unix domain socket.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/518/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.csi.client.TestCsiClient
> [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 0.67 
> s <<< FAILURE! - in org.apache.hadoop.yarn.csi.client.TestCsiClient
> [ERROR] testIdentityService(org.apache.hadoop.yarn.csi.client.TestCsiClient)  
> Time elapsed: 0.457 s  <<< ERROR!
> java.io.IOException: Failed to bind
>   at io.grpc.netty.NettyServer.start(NettyServer.java:257)
>   at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
>   at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
>   at 
> org.apache.hadoop.yarn.csi.client.FakeCsiDriver.start(FakeCsiDriver.java:56)
>   at 
> org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService(TestCsiClient.java:72)
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10788) TestCsiClient fails

2022-02-13 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-10788:


Assignee: Akira Ajisaka

> TestCsiClient fails
> ---
>
> Key: YARN-10788
> URL: https://issues.apache.org/jira/browse/YARN-10788
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
> Attachments: 
> patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt
>
>
> TestCsiClient fails to bind to unix domain socket.
> https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/518/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-csi.txt
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.csi.client.TestCsiClient
> [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 0.67 
> s <<< FAILURE! - in org.apache.hadoop.yarn.csi.client.TestCsiClient
> [ERROR] testIdentityService(org.apache.hadoop.yarn.csi.client.TestCsiClient)  
> Time elapsed: 0.457 s  <<< ERROR!
> java.io.IOException: Failed to bind
>   at io.grpc.netty.NettyServer.start(NettyServer.java:257)
>   at io.grpc.internal.ServerImpl.start(ServerImpl.java:184)
>   at io.grpc.internal.ServerImpl.start(ServerImpl.java:90)
>   at 
> org.apache.hadoop.yarn.csi.client.FakeCsiDriver.start(FakeCsiDriver.java:56)
>   at 
> org.apache.hadoop.yarn.csi.client.TestCsiClient.testIdentityService(TestCsiClient.java:72)
>  {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10561) Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog webapp

2022-01-28 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10561:
-
Summary: Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application 
catalog webapp  (was: Upgrade node.js to at least 12.x in YARN application 
catalog webapp)

> Upgrade node.js to 12.22.1 and yarn to 1.22.5 in YARN application catalog 
> webapp
> 
>
> Key: YARN-10561
> URL: https://issues.apache.org/jira/browse/YARN-10561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> YARN application catalog webapp is using node.js 8.11.3, and 8.x are already 
> EoL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp

2022-01-27 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10561:
-
Target Version/s: 3.4.0, 3.3.2  (was: 3.4.0)

> Upgrade node.js to at least 12.x in YARN application catalog webapp
> ---
>
> Key: YARN-10561
> URL: https://issues.apache.org/jira/browse/YARN-10561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN application catalog webapp is using node.js 8.11.3, and 8.x are already 
> EoL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp

2022-01-27 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-10561:
-
Priority: Critical  (was: Major)

> Upgrade node.js to at least 12.x in YARN application catalog webapp
> ---
>
> Key: YARN-10561
> URL: https://issues.apache.org/jira/browse/YARN-10561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN application catalog webapp is using node.js 8.11.3, and 8.x are already 
> EoL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10561) Upgrade node.js to at least 12.x in YARN application catalog webapp

2022-01-27 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483550#comment-17483550
 ] 

Akira Ajisaka commented on YARN-10561:
--

I'm facing the error when building. Raising the priority:
{code}
[INFO] --- frontend-maven-plugin:1.11.2:yarn (yarn install) @ 
hadoop-yarn-applications-catalog-webapp ---
[INFO] testFailureIgnore property is ignored in non test phases
[INFO] Running 'yarn ' in 
/home/runner/work/hadoop-document/hadoop-document/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-webapp/target
[INFO] yarn install v1.7.0
[INFO] info No lockfile found.
[INFO] [1/4] Resolving packages...
[INFO] [2/4] Fetching packages...
[INFO] error safe-stable-stringify@2.3.1: The engine "node" is incompatible 
with this module. Expected version ">=10".
[INFO] error Found incompatible module
[INFO] info Visit https://yarnpkg.com/en/docs/cli/install for documentation 
about this command.
{code}

> Upgrade node.js to at least 12.x in YARN application catalog webapp
> ---
>
> Key: YARN-10561
> URL: https://issues.apache.org/jira/browse/YARN-10561
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: webapp
>Reporter: Akira Ajisaka
>Assignee: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> YARN application catalog webapp is using node.js 8.11.3, and 8.x are already 
> EoL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11068) Exclude transitive log4j2 dependency coming from solr 8

2022-01-27 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483223#comment-17483223
 ] 

Akira Ajisaka commented on YARN-11068:
--

Hadoop 3.3 has the same issue.
{code}
[INFO] +- org.apache.solr:solr-core:jar:7.7.0:test
[INFO] |  +- org.apache.logging.log4j:log4j-1.2-api:jar:2.11.0:test
[INFO] |  +- org.apache.logging.log4j:log4j-api:jar:2.11.0:test
[INFO] |  +- org.apache.logging.log4j:log4j-core:jar:2.11.0:test
[INFO] |  +- org.apache.logging.log4j:log4j-slf4j-impl:jar:2.11.0:test
{code}
BTW, I don't think this issue is blocker because the scope is test and the jar 
files are not in the binary tarball.

> Exclude transitive log4j2 dependency coming from solr 8
> ---
>
> Key: YARN-11068
> URL: https://issues.apache.org/jira/browse/YARN-11068
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.4.0
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Similar to HADOOP-18092, we have transitive log4j2 dependency coming from 
> solr-core 8 that must be excluded.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui

2022-01-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11065:
-
Fix Version/s: 3.3.3

Backported to branch-3.3.

> Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
> -
>
> Key: YARN-11065
> URL: https://issues.apache.org/jira/browse/YARN-11065
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui

2022-01-20 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11065.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

> Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
> -
>
> Key: YARN-11065
> URL: https://issues.apache.org/jira/browse/YARN-11065
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui

2022-01-20 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479336#comment-17479336
 ] 

Akira Ajisaka commented on YARN-11065:
--

Merged [https://github.com/apache/hadoop/pull/3890] into trunk. The PR is 
created by dependabot, so let the assignee empty.

> Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui
> -
>
> Key: YARN-11065
> URL: https://issues.apache.org/jira/browse/YARN-11065
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Akira Ajisaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-11065) Bump follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui

2022-01-20 Thread Akira Ajisaka (Jira)
Akira Ajisaka created YARN-11065:


 Summary: Bump follow-redirects from 1.13.3 to 1.14.7 in 
hadoop-yarn-ui
 Key: YARN-11065
 URL: https://issues.apache.org/jira/browse/YARN-11065
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-ui-v2
Reporter: Akira Ajisaka


Upgrade follow-redirects from 1.13.3 to 1.14.7 in hadoop-yarn-ui.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-11055) In cgroups-operations.c some fprintf format strings don't end with "\n"

2022-01-16 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11055.
--
Fix Version/s: 3.4.0
   3.2.4
   3.3.3
   Resolution: Fixed

Committed to trunk, branch-3.3, and branch-3.2. Thank you [~jira.shegalov] for 
your contribution.

> In cgroups-operations.c some fprintf format strings don't end with "\n" 
> 
>
> Key: YARN-11055
> URL: https://issues.apache.org/jira/browse/YARN-11055
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0, 3.3.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Minor
>  Labels: cgroups, easyfix, pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> In cgroup-operations.c some {{{}fprintf{}}}s are missing a newline char at 
> the end leading to a hard-to-parse error message output 
> example: 
> https://github.com/apache/hadoop/blame/b225287913ac366a531eacfa0266adbdf03d883e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/cgroups/cgroups-operations.c#L130
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-11053) AuxService should not use class name as default system classes

2022-01-03 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-11053:
-
Fix Version/s: 3.3.2
   (was: 3.3.3)

Cherry-picked to branch-3.3.2.

> AuxService should not use class name as default system classes
> --
>
> Key: YARN-11053
> URL: https://issues.apache.org/jira/browse/YARN-11053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices
>Affects Versions: 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Following Apache Spark document to configure Spark Shuffle Service as YARN 
> AuxService,
> [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service]
>  
> {code:java}
>   
> yarn.nodemanager.aux-services
> spark_shuffle
>   
>   
> yarn.nodemanager.aux-services.spark_shuffle.classpath
> /opt/apache/spark/yarn/*
>   
>   
> 
> yarn.nodemanager.aux-services.spark_shuffle.class&amp;lt;/name>
> org.apache.spark.network.yarn.YarnShuffleService
>{code}
>  but failed with exception
> {code:java}
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: 
> [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar]
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: 
> [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-02 15:34:00,887 INFO service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
> ... 10 more
> {code}
> A workaround is adding
> {code:java}
> 
> yarn.nodemanager.aux-services.spark_shuffle.system-classes
> not.existed.class
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-9967.
-
Resolution: Duplicate

Fixed by YARN-11053. Closing as duplicate.

> Fix NodeManager failing to start when Hdfs Auxillary Jar is set
> ---
>
> Key: YARN-9967
> URL: https://issues.apache.org/jira/browse/YARN-9967
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices, nodemanager
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> Loading an auxiliary jar from a Hdfs location on a node manager fails with 
> ClassNotFound Exception
> {code:java}
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: []
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [java., javax.accessibility., javax.activation., 
> javax.activity., javax.annotation., javax.annotation.processing., 
> javax.crypto., javax.imageio., javax.jws., javax.lang.model., 
> -javax.management.j2ee., javax.management., javax.naming., javax.net., 
> javax.print., javax.rmi., javax.script., -javax.security.auth.message., 
> javax.security.auth., javax.security.cert., javax.security.sasl., 
> javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., 
> -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., 
> org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
> {code}
> *Repro:*
> {code:java}
> 1. Prepare a custom auxiliary service jar and place it on hdfs
> [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java 
> package org;
> import org.apache.hadoop.yarn.server.api.AuxiliaryService;
> import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
> import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext;
> import java.nio.ByteBuffer;
> public class TestShuffleHandler2 extends AuxiliaryService {
> public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = 
> "test_shuffle2";
> public TestShuffleHandler2() {
>   super("testshuffle2");
> }
> @Override
> public void initializeApplication(ApplicationInitializationContext 
> context) {
> }
> @Override
> public void stopApplication(ApplicationTerminationContext context) {
> }
> @Override
> public synchronized ByteBuffer getMetaData() {
>   return ByteBuffer.allocate(0); 
> }
>   }
>   
> [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` 
> TestShuffleHandler2.java 
> [hdfs@yarndocker-1 yarn]$ jar cvf auxhdfs.jar org/
> [hdfs@yarndocker-1 mapreduce]$ hadoop fs -mkdir

[jira] [Resolved] (YARN-11053) AuxService should not use class name as default system classes

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka resolved YARN-11053.
--
Fix Version/s: 3.4.0
   3.3.3
   Resolution: Fixed

Committed to trunk and branch-3.3.

> AuxService should not use class name as default system classes
> --
>
> Key: YARN-11053
> URL: https://issues.apache.org/jira/browse/YARN-11053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices
>Affects Versions: 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Following Apache Spark document to configure Spark Shuffle Service as YARN 
> AuxService,
> [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service]
>  
> {code:java}
>   
> yarn.nodemanager.aux-services
> spark_shuffle
>   
>   
> yarn.nodemanager.aux-services.spark_shuffle.classpath
> /opt/apache/spark/yarn/*
>   
>   
> 
> yarn.nodemanager.aux-services.spark_shuffle.class&amp;lt;/name>
> org.apache.spark.network.yarn.YarnShuffleService
>{code}
>  but failed with exception
> {code:java}
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: 
> [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar]
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: 
> [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-02 15:34:00,887 INFO service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
> ... 10 more
> {code}
> A workaround is adding
> {code:java}
> 
> yarn.nodemanager.aux-services.spark_shuffle.system-classes
> not.existed.class
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath

2021-12-23 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464861#comment-17464861
 ] 

Akira Ajisaka commented on YARN-11030:
--

Thank you for your response. I'll merge the PR shortly.

> ClassNotFoundException when aux service class is loaded from customized 
> classpath
> -
>
> Key: YARN-11030
> URL: https://issues.apache.org/jira/browse/YARN-11030
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Hiroyuki Adachi
>Priority: Minor
>
> NodeManager failed to load the aux service with ClassNotFoundException while 
> loading the class from the customized classpath.
> {noformat}
> 
>   
>    value="org.apache.spark.network.yarn.YarnShuffleService"/>
>    value="/tmp/spark-3.1.2-yarn-shuffle.jar"/>
>   
>  {noformat}
> {noformat}
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar]
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in
>  state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:348)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja
> va:165)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
>         ... 10 more
> 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  
> failed in state INITED{noformat}
>  
> YARN-9075 may cause this problem. The default system classes were changed by 
> this patch.
> Before YARN-9075: isSystemClass() returns false since the system classes does 
> not contain the aux service class itself, and the class will be loaded from 
> the customized classpath.
> [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176]
> {noformat}
> 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar]
> 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClas

[jira] [Assigned] (YARN-11053) AuxService should not use class name as default system classes

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-11053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned YARN-11053:


Assignee: Cheng Pan

> AuxService should not use class name as default system classes
> --
>
> Key: YARN-11053
> URL: https://issues.apache.org/jira/browse/YARN-11053
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices
>Affects Versions: 3.3.1
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following Apache Spark document to configure Spark Shuffle Service as YARN 
> AuxService,
> [https://spark.apache.org/docs/3.2.0/running-on-yarn.html#running-multiple-versions-of-the-spark-shuffle-service]
>  
> {code:java}
>   
> yarn.nodemanager.aux-services
> spark_shuffle
>   
>   
> yarn.nodemanager.aux-services.spark_shuffle.classpath
> /opt/apache/spark/yarn/*
>   
>   
> yarn.nodemanager.aux-services.spark_shuffle.class&lt;/name>
> org.apache.spark.network.yarn.YarnShuffleService
>{code}
>  but failed with exception
> {code:java}
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: classpath: 
> [file:/opt/apache/spark/yarn/spark-3.2.0-yarn-shuffle.jar]
> 2021-12-02 15:34:00,886 INFO util.ApplicationClassLoader: system classes: 
> [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-02 15:34:00,887 INFO service.AbstractService: Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
> at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
> at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:165)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
> ... 10 more
> {code}
> An workaround is add
> {code:java}
> 
> yarn.nodemanager.aux-services.spark_shuffle.system-classes
> not.existed.class
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set

2021-12-23 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464429#comment-17464429
 ] 

Akira Ajisaka commented on YARN-9967:
-

Hi [~minni31], there is already a PR: https://github.com/apache/hadoop/pull/3816
Would you check this?

> Fix NodeManager failing to start when Hdfs Auxillary Jar is set
> ---
>
> Key: YARN-9967
> URL: https://issues.apache.org/jira/browse/YARN-9967
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices, nodemanager
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> Loading an auxiliary jar from a Hdfs location on a node manager fails with 
> ClassNotFound Exception
> {code:java}
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: []
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [java., javax.accessibility., javax.activation., 
> javax.activity., javax.annotation., javax.annotation.processing., 
> javax.crypto., javax.imageio., javax.jws., javax.lang.model., 
> -javax.management.j2ee., javax.management., javax.naming., javax.net., 
> javax.print., javax.rmi., javax.script., -javax.security.auth.message., 
> javax.security.auth., javax.security.cert., javax.security.sasl., 
> javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., 
> -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., 
> org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
> {code}
> *Repro:*
> {code:java}
> 1. Prepare a custom auxiliary service jar and place it on hdfs
> [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java 
> package org;
> import org.apache.hadoop.yarn.server.api.AuxiliaryService;
> import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
> import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext;
> import java.nio.ByteBuffer;
> public class TestShuffleHandler2 extends AuxiliaryService {
> public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = 
> "test_shuffle2";
> public TestShuffleHandler2() {
>   super("testshuffle2");
> }
> @Override
> public void initializeApplication(ApplicationInitializationContext 
> context) {
> }
> @Override
> public void stopApplication(ApplicationTerminationContext context) {
> }
> @Override
> public synchronized ByteBuffer getMetaData() {
>   return ByteBuffer.allocate(0); 
> }
>   }
>   
> [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` 
> TestShuffleHandler2.java 
> [hdfs@yarnd

[jira] [Commented] (YARN-11030) ClassNotFoundException when aux service class is loaded from customized classpath

2021-12-23 Thread Akira Ajisaka (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-11030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17464426#comment-17464426
 ] 

Akira Ajisaka commented on YARN-11030:
--

I found YARN-11053 duplicates this issue, and there is a PR 
https://github.com/apache/hadoop/pull/3816

Hi [~hadachi], is the PR fix this problem?

> ClassNotFoundException when aux service class is loaded from customized 
> classpath
> -
>
> Key: YARN-11030
> URL: https://issues.apache.org/jira/browse/YARN-11030
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Hiroyuki Adachi
>Priority: Minor
>
> NodeManager failed to load the aux service with ClassNotFoundException while 
> loading the class from the customized classpath.
> {noformat}
> 
>   
>    value="org.apache.spark.network.yarn.YarnShuffleService"/>
>    value="/tmp/spark-3.1.2-yarn-shuffle.jar"/>
>   
>  {noformat}
> {noformat}
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: [file:/tmp/spark-3.1.2-yarn-shuffle.jar]
> 2021-12-06 15:32:09,168 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [org.apache.spark.network.yarn.YarnShuffleService]
> 2021-12-06 15:32:09,169 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in
>  state INITED
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:482)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:761)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:327)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:494)
>         at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:962)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1042)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.network.yarn.YarnShuffleService
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>         at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:348)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.ja
> va:165)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxServiceFromLocalClasspath(AuxServices.java:242)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.createAuxService(AuxServices.java:271)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.initAuxService(AuxServices.java:452)
>         ... 10 more
> 2021-12-06 15:32:09,172 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  
> failed in state INITED{noformat}
>  
> YARN-9075 may cause this problem. The default system classes were changed by 
> this patch.
> Before YARN-9075: isSystemClass() returns false since the system classes does 
> not contain the aux service class itself, and the class will be loaded from 
> the customized classpath.
> [https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ApplicationClassLoader.java#L176]
> {noformat}
> 2021-12-06 15:50:21,332 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: [file:/tmp/spark-3

[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-8234:

Release Note: When Timeline Service V1 or V1.5 is used, if 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" 
is set to true, ResourceManager sends timeline events in batch. The default 
value is false. If this functionality is enabled, the maximum number that 
events published in batch is configured by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". 
The default value is 1000. The interval of publishing events can be configured 
by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds".
 By default, it is set to 60 seconds.  (was: If 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" 
is set to true, ResourceManager sends timeline events in batch. The default 
value is false. If this functionality is enabled, the maximum number that 
events published in batch is configured by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". 
The default value is 1000. The interval of publishing events can be configured 
by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds".
 By default, it is set to 60 seconds.)

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-8234:

Release Note: If 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.enable-batch" 
is set to true, ResourceManager sends timeline events in batch. The default 
value is false. If this functionality is enabled, the maximum number that 
events published in batch is configured by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.batch-size". 
The default value is 1000. The interval of publishing events can be configured 
by 
"yarn.resourcemanager.system-metrics-publisher.timeline-server-v1.interval-seconds".
 By default, it is set to 60 seconds.

> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM stop-world and 
> cause a timeout from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event 
> as fast as it generated. Now, RM system metrics publisher put only one event 
> in a request, and most time costs on handling http header or some thing about 
> the net connection on timeline side. Only few time is spent on dealing with 
> the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher 
> send events to timeline server in batch via one request. When sets the batch 
> size to 1000, in out experiment the speed of the timeline server receives 
> events has 100x improvement. We have implement this function int our product 
> environment which accepts 2 app's in one hour and it works fine.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch

2021-12-23 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated YARN-8234:

Description: 
When system metrics publisher is enabled, RM will push events to timeline 
server via restful api. If the cluster load is heavy, many events are sent to 
timeline server and the timeline server's event handler thread locked. 
YARN-7266 talked about the detail of this problem. Because of the lock, 
timeline server can't receive event as fast as it generated in RM and lots of 
timeline event stays in RM's memory. Finally, those events will consume all 
RM's memory and RM will start a full gc (which cause an JVM stop-world and 
cause a timeout from rm to zookeeper) or even get an OOM. 

The main problem here is that timeline can't receive timeline server's event as 
fast as it generated. Now, RM system metrics publisher put only one event in a 
request, and most time costs on handling http header or some thing about the 
net connection on timeline side. Only few time is spent on dealing with the 
timeline event which is truly valuable.

In this issue, we add a buffer in system metrics publisher and let publisher 
send events to timeline server in batch via one request. When sets the batch 
size to 1000, in out experiment the speed of the timeline server receives 
events has 100x improvement. We have implement this function int our product 
environment which accepts 2 app's in one hour and it works fine.

  was:
When system metrics publisher is enabled, RM will push events to timeline 
server via restful api. If the cluster load is heavy, many events are sent to 
timeline server and the timeline server's event handler thread locked. 
YARN-7266 talked about the detail of this problem. Because of the lock, 
timeline server can't receive event as fast as it generated in RM and lots of 
timeline event stays in RM's memory. Finally, those events will consume all 
RM's memory and RM will start a full gc (which cause an JVM stop-world and 
cause a timeout from rm to zookeeper) or even get an OOM. 

The main problem here is that timeline can't receive timeline server's event as 
fast as it generated. Now, RM system metrics publisher put only one event in a 
request, and most time costs on handling http header or some thing about the 
net connection on timeline side. Only few time is spent on dealing with the 
timeline event which is truly valuable.

In this issue, we add a buffer in system metrics publisher and let publisher 
send events to timeline server in batch via one request. When sets the batch 
size to 1000, in out experiment the speed of the timeline server receives 
events has 100x improvement. We have implement this function int our product 
environment which accepts 2 app's in one hour and it works fine.

We add following configuration:
 * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of system 
metrics publisher sending events in one request. Default value is 1000
 * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the 
event buffer in system metrics publisher.
 * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When enable 
batch publishing, we must avoid that the publisher waits for a batch to be 
filled up and hold events in buffer for long time. So we add another thread 
which send event's in the buffer periodically. This config sets the interval of 
the cyclical sending thread. The default value is 60s.
 


> Improve RM system metrics publisher's performance by pushing events to 
> timeline server in batch
> ---
>
> Key: YARN-8234
> URL: https://issues.apache.org/jira/browse/YARN-8234
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager, timelineserver
>Affects Versions: 2.8.3
>Reporter: Hu Ziqian
>Assignee: Ashutosh Gupta
>Priority: Critical
>  Labels: pull-request-available
> Attachments: YARN-8234-branch-2.8.3.001.patch, 
> YARN-8234-branch-2.8.3.002.patch, YARN-8234-branch-2.8.3.003.patch, 
> YARN-8234-branch-2.8.3.004.patch, YARN-8234.001.patch, YARN-8234.002.patch, 
> YARN-8234.003.patch, YARN-8234.004.patch
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> When system metrics publisher is enabled, RM will push events to timeline 
> server via restful api. If the cluster load is heavy, many events are sent to 
> timeline server and the timeline server's event handler thread locked. 
> YARN-7266 talked about the detail of this problem. Because of the lock, 
> timeline server can't receive event as fast as it generated in RM and lots of 
> timeline event stays in RM's memory. Finally, those events will consume all 
> RM's memory and RM will start a full gc (which cause an JVM st

  1   2   3   4   5   6   7   8   9   10   >