[jira] [Created] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty
Jim Brennan created YARN-10855: -- Summary: yarn logs cli fails to retrieve logs if any TFile is corrupt or empty Key: YARN-10855 URL: https://issues.apache.org/jira/browse/YARN-10855 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.3.1, 2.10.1, 3.2.2, 3.4.0 Reporter: Jim Brennan When attempting to retrieve yarn logs via the CLI command, it failed with the following stack trace (on branch-2.10): {noformat} yarn logs -applicationId application_1591017890475_1049740 > logs 20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History server Exception in thread "main" java.io.EOFException: Cannot seek to negative offset at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701) at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65) at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624) at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503) at org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227) at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333) at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) {noformat} The problem was that there was a zero-length TFile for one of the containers in the application aggregated log directory in hdfs. When we removed the zero length file, {{yarn logs}} was able to retrieve the logs. A corrupt or zero length TFile for one container should not prevent loading logs for the rest of the application. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10
[ https://issues.apache.org/jira/browse/YARN-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-10733. Fix Version/s: 2.10.2 Resolution: Fixed Thanks [~ahussein], I have committed this to branch-2.10. > TimelineService Hbase tests are failing with timeout error on branch-2.10 > - > > Key: YARN-10733 > URL: https://issues.apache.org/jira/browse/YARN-10733 > Project: Hadoop YARN > Issue Type: Bug > Components: test, timelineserver, yarn >Affects Versions: 2.10.0 >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Labels: pull-request-available > Fix For: 2.10.2 > > Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, > 2021-04-12T12-40-58_857.dumpstream, > org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip > > Time Spent: 0.5h > Remaining Estimate: 0h > > {code:bash} > 03:54:41 [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on > project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout > or other error in the fork -> [Help 1] > 03:54:41 [ERROR] > 03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with > the -e switch. > 03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug > logging. > 03:54:41 [ERROR] > 03:54:41 [ERROR] For more information about the errors and possible > solutions, please read the following articles: > 03:54:41 [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException > 03:54:41 [ERROR] > 03:54:41 [ERROR] After correcting the problems, you can resume the build with > the command > 03:54:41 [ERROR] mvn -rf > :hadoop-yarn-server-timelineservice-hbase-tests > {code} > Failure of the tests is due to test unit > {{TestHBaseStorageFlowRunCompaction}} getting stuck. > Upon checking the surefire reports, I found several Class no Found Exceptions. > {code:bash} > Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:763) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66) > at > org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698) > at > org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895) > at > org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638) > ... 33 more > Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 51 more > {code} > and > {code:bash} > Caused by: java.lang.NoClassDefFoundError: Could not initialize class > org.apache.hadoop.hbase.regionserver.StoreFileInfo > at > org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698) > at > org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895) > at > org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009) > at > org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523) > at > org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638) > ... 10 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Created] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor
Jim Brennan created YARN-10702: -- Summary: Add cluster metric for amount of CPU used by RM Event Processor Key: YARN-10702 URL: https://issues.apache.org/jira/browse/YARN-10702 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.10.1, 3.4.0 Reporter: Jim Brennan Assignee: Jim Brennan Add a cluster metric to track the cpu usage of the ResourceManager Event Processing thread. This lets us know when the critical path of the RM is running out of headroom. This feature was originally added for us internally by [~nroberts] and we've been running with it on production clusters for nearly four years. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10665) TestContainerManagerRecover sometimes fails
Jim Brennan created YARN-10665: -- Summary: TestContainerManagerRecover sometimes fails Key: YARN-10665 URL: https://issues.apache.org/jira/browse/YARN-10665 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Jim Brennan Assignee: Jim Brennan TestContainerManagerRecovery sometimes fails when I run it on the mac because it cannot bind to a port. I believe this is because it calls getPort with a hard-coded port number (49160) instead of just passing zero. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV
Jim Brennan created YARN-10664: -- Summary: Allow parameter expansion in NM_ADMIN_USER_ENV Key: YARN-10664 URL: https://issues.apache.org/jira/browse/YARN-10664 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.10.1, 3.4.0 Reporter: Jim Brennan Assignee: Jim Brennan Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter expansion. That is, you cannot specify an environment variable such as {code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside the container. We have a need for this in specifying different java gc options for java processing running inside yarn containers based on which version of java is being used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-5853) TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently on Power
[ https://issues.apache.org/jira/browse/YARN-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-5853. --- Resolution: Duplicate This is fixed by YARN-10500 > TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently > on Power > -- > > Key: YARN-5853 > URL: https://issues.apache.org/jira/browse/YARN-5853 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0-alpha1 > Environment: # uname -a > Linux pts00452-vm10 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT > 2015 ppc64le ppc64le ppc64le GNU/Linux > # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: Yussuf Shaikh >Priority: Major > > The test testRMRestartWithExpiredToken fails intermittently with the > following error: > Stacktrace: > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertNotNull(Assert.java:621) > at org.junit.Assert.assertNotNull(Assert.java:631) > at > org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken(TestDelegationTokenRenewer.java:1060) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race
Jim Brennan created YARN-10562: -- Summary: Alternate fix for DirectoryCollection.checkDirs() race Key: YARN-10562 URL: https://issues.apache.org/jira/browse/YARN-10562 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Jim Brennan Assignee: Jim Brennan In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and related methods were returning an unmodifiable view of the lists. These accesses were protected by read/write locks, but because the lists are CopyOnWriteArrayLists, subsequent changes to the list, even when done under the writelock, were exposed when a caller started iterating the list view. CopyOnWriteArrayLists cache the current underlying list in the iterator, so it is safe to iterate them even while they are being changed - at least the view will be consistent. The problem was that checkDirs() was clearing the lists and rebuilding them from scratch every time, so if a caller called getGoodDirs() just before checkDirs cleared it, and then started iterating right after the clear, they could get an empty list. The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to return a copy of the list, which definitely fixes the race condition. The disadvantage is that now we create a new copy of these lists every time we launch a container. The advantage using CopyOnWriteArrayList was that the lists should rarely ever change, and we can avoid all the copying. Unfortunately, the way checkDirs() was written, it guaranteed that it would modify those lists multiple times every time. So this Jira proposes an alternate solution for YARN-9833, which mainly just rewrites checkDirs() to minimize the changes to the underlying lists. There are still some small windows where a disk will have been added to one list, but not yet removed from another if you hit it just right, but I think these should be pretty rare and relatively harmless, and in the vast majority of cases I suspect only one disk will be moving from one list to another at any time. The question is whether this type of inconsistency (which was always there before -YARN-9833- is worth reducing all the copying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10542) Node Utilization on UI is misleading if nodes don't report utilization
Jim Brennan created YARN-10542: -- Summary: Node Utilization on UI is misleading if nodes don't report utilization Key: YARN-10542 URL: https://issues.apache.org/jira/browse/YARN-10542 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Jim Brennan Assignee: Jim Brennan As reported in YARN-10540, if the ResourceCalculatorPlugin fails to initialize, the nodes will report no utilization. This makes the RM UI misleading, because it presents cluster-wide and per node utilization as 0 instead of indicating that it is not being tracked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException
[ https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-10485. Fix Version/s: 3.2.3 3.4.1 3.1.5 3.3.1 Resolution: Fixed Thanks for the contribution [~ahussein] and [~daryn]! I have committed this to trunk, branch-3.3, branch-3.2, and branch-3.1. > TimelineConnector swallows InterruptedException > --- > > Key: YARN-10485 > URL: https://issues.apache.org/jira/browse/YARN-10485 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3 > > > Some tests timeout or take excessively long to shutdown because the > {{TimelineConnector}} will catch InterruptedException and go into a retry > loop instead of aborting. > [~daryn] reported that this makes debugging more difficult and he suggests > the exception to be thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException
[ https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-10485. Resolution: Fixed > TimelineConnector swallows InterruptedException > --- > > Key: YARN-10485 > URL: https://issues.apache.org/jira/browse/YARN-10485 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ahmed Hussein >Assignee: Ahmed Hussein >Priority: Major > Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3 > > > Some tests timeout or take excessively long to shutdown because the > {{TimelineConnector}} will catch InterruptedException and go into a retry > loop instead of aborting. > [~daryn] reported that this makes debugging more difficult and he suggests > the exception to be thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions
Jim Brennan created YARN-10479: -- Summary: RMProxy should retry on SocketTimeout Exceptions Key: YARN-10479 URL: https://issues.apache.org/jira/browse/YARN-10479 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.10.1, 3.4.1 Reporter: Jim Brennan Assignee: Jim Brennan During an incident involving a DNS outage, a large number of nodemanagers failed to come back into service because they hit a socket timeout when trying to re-register with the RM. SocketTimeoutException is not currently one of the exceptions that the RMProxy will retry. Based on this incident, it seems like it should be. We made this change internally about a year ago and it has been running in production since. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10478) Make RM-NM heartbeat scaling calculator pluggable
Jim Brennan created YARN-10478: -- Summary: Make RM-NM heartbeat scaling calculator pluggable Key: YARN-10478 URL: https://issues.apache.org/jira/browse/YARN-10478 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Jim Brennan [YARN-10475] adds a feature to enable scaling the interval for heartbeats between the RM and NM based on CPU utilization. [~bibinchundatt] suggested that we make this pluggable so that other calculations can be used if desired. The configuration properties added in [YARN-10475] should be applicable to any heartbeat calculator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy
[ https://issues.apache.org/jira/browse/YARN-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-10477. Resolution: Invalid Closing this as invalid. The problem was only there in our internal version of container-executor. I should have checked the code in trunk before filing. > runc launch failure should not cause nodemanager to go unhealthy > > > Key: YARN-10477 > URL: https://issues.apache.org/jira/browse/YARN-10477 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.1, 3.4.1 >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > We have observed some failures when launching containers with runc. We have > not yet identified the root cause of those failures, but a side-effect of > these failures was the Nodemanager marked itself unhealthy. Since these are > rare failures that only affect a single launch, they should not cause the > Nodemanager to be marked unhealthy. > Here is an example RM log: > {noformat} > resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event > dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with > details: Linux Container Executor reached unrecoverable exception > {noformat} > And here is an example of the NM log: > {noformat} > 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO > runtime.RuncContainerRuntime: Launch container failed for > container_e25_1601602719874_10691_01_001723 > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: > ExitCodeException exitCode=24: OCI command has bad/missing local dire > ctories > {noformat} > The problem is that the runc code in container-executor is re-using exit code > 24 (INVALID_CONFIG_FILE) which is intended for problems with the > container-executor.cfg file, and those failures are fatal for the NM. We > should use a different exit code for these. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy
Jim Brennan created YARN-10477: -- Summary: runc launch failure should not cause nodemanager to go unhealthy Key: YARN-10477 URL: https://issues.apache.org/jira/browse/YARN-10477 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.3.1, 3.4.1 Reporter: Jim Brennan Assignee: Jim Brennan We have observed some failures when launching containers with runc. We have not yet identified the root cause of those failures, but a side-effect of these failures was the Nodemanager marked itself unhealthy. Since these are rare failures that only affect a single launch, they should not cause the Nodemanager to be marked unhealthy. Here is an example RM log: {noformat} resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with details: Linux Container Executor reached unrecoverable exception {noformat} And here is an example of the NM log: {noformat} 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO runtime.RuncContainerRuntime: Launch container failed for container_e25_1601602719874_10691_01_001723 org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=24: OCI command has bad/missing local dire ctories {noformat} The problem is that the runc code in container-executor is re-using exit code 24 (INVALID_CONFIG_FILE) which is intended for problems with the container-executor.cfg file, and those failures are fatal for the NM. We should use a different exit code for these. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization
Jim Brennan created YARN-10475: -- Summary: Scale RM-NM heartbeat interval based on node utilization Key: YARN-10475 URL: https://issues.apache.org/jira/browse/YARN-10475 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 2.10.1, 3.4.1 Reporter: Jim Brennan Assignee: Jim Brennan Add the ability to scale the RM-NM heartbeat interval based on node cpu utilization compared to overall cluster cpu utilization. If a node is over-utilized compared to the rest of the cluster, it's heartbeat interval slows down. If it is under-utilized compared to the rest of the cluster, it's heartbeat interval speeds up. This is a feature we have been running with internally in production for several years. It was developed by [~nroberts], based on the observation that larger faster nodes on our cluster were under-utilized compared to smaller slower nodes. This feature is dependent on [YARN-10450], which added cluster-wide utilization metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics
Jim Brennan created YARN-10450: -- Summary: Add cpu and memory utilization per node and cluster-wide metrics Key: YARN-10450 URL: https://issues.apache.org/jira/browse/YARN-10450 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.3.1 Reporter: Jim Brennan Assignee: Jim Brennan Add metrics to show actual cpu and memory utilization for each node and aggregated for the entire cluster. This is information is already passed from NM to RM in the node status update. We have been running with this internally for quite a while and found it useful to be able to quickly see the actual cpu/memory utilization on the node/cluster. It's especially useful if some form of overcommit is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10369) Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG
Jim Brennan created YARN-10369: -- Summary: Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG Key: YARN-10369 URL: https://issues.apache.org/jira/browse/YARN-10369 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Jim Brennan This message is logged at the info level, but it doesn't really add much information. We changed this to DEBUG internally years ago and haven't missed it. {noformat} 2020-07-27 21:51:29,027 INFO [RM Event dispatcher] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for nodeId : localhost.localdomain:45454 for container : container_1595886659189_0001_01_01 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10363) TestRMAdminCLI.testHelp is failing in branch-2.10
Jim Brennan created YARN-10363: -- Summary: TestRMAdminCLI.testHelp is failing in branch-2.10 Key: YARN-10363 URL: https://issues.apache.org/jira/browse/YARN-10363 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.10.1 Reporter: Jim Brennan TestRMAdminCLI.testHelp is failing in branch-2.10. Example failure: {noformat} --- Test set: org.apache.hadoop.yarn.client.cli.TestRMAdminCLI --- Tests run: 31, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 18.668 s <<< FAILURE! - in org.apache.hadoop.yarn.client.cli.TestRMAdminCLI testHelp(org.apache.hadoop.yarn.client.cli.TestRMAdminCLI) Time elapsed: 0.043 s <<< FAILURE! java.lang.AssertionError: Expected error message: Usage: yarn rmadmin [-failover [--forcefence] [--forceactive] ] is not included in messages: Usage: yarn rmadmin -refreshQueues -refreshNodes [-g|graceful [timeout in seconds] -client|server] -refreshNodesResources -refreshSuperUserGroupsConfiguration -refreshUserToGroupsMappings -refreshAdminAcls -refreshServiceAcl -getGroups [username] -addToClusterNodeLabels <"label1(exclusive=true),label2(exclusive=false),label3"> -removeFromClusterNodeLabels (label splitted by ",") -replaceLabelsOnNode <"node1[:port]=label1,label2 node2[:port]=label1,label2"> [-failOnUnknownNodes] -directlyAccessNodeLabelStore -refreshClusterMaxPriority -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) -help [cmd] Generic options supported are: -conf specify an application configuration file -Ddefine a value for a given property -fs specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations. -jt specify a ResourceManager -files specify a comma-separated list of files to be copied to the map reduce cluster -libjarsspecify a comma-separated list of jar files to be included in the classpath -archives specify a comma-separated list of archives to be unarchived on the compute machines The general command line syntax is: command [genericOptions] [commandOptions] at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:859) at org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:585) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
[jira] [Created] (YARN-10353) Log vcores used and cumulative cpu in containers monitor
Jim Brennan created YARN-10353: -- Summary: Log vcores used and cumulative cpu in containers monitor Key: YARN-10353 URL: https://issues.apache.org/jira/browse/YARN-10353 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.4.0 Reporter: Jim Brennan Assignee: Jim Brennan We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the Containers Monitor log. It would be useful to also log vcores used vs vcores assigned, and total accumulated CPU time. For example, currently we have an audit log that looks like this: {noformat} 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 CPU/core:35.772625 {noformat} The proposal is to add two more fields to show vCores and Cumulative CPU ms: {noformat} 2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit (ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 CPU/core:35.772625 vCores:2/1 CPU-ms:4180 {noformat} This is a snippet of a log from one of our clusters running branch-2.8 with a similar change. {noformat} 2020-07-16 21:00:02,240 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 5267 for container-id container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 10 CPU vCores used. Cumulative CPU time: 157410 2020-07-16 21:00:02,269 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 18801 for container-id container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU vCores used. Cumulative CPU time: 113830 2020-07-16 21:00:02,298 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 5279 for container-id container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 10 CPU vCores used. Cumulative CPU time: 128630 2020-07-16 21:00:02,339 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 24189 for container-id container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU vCores used. Cumulative CPU time: 96060 2020-07-16 21:00:02,367 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 6751 for container-id container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB physical memory used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 10 CPU vCores used. Cumulative CPU time: 116820 2020-07-16 21:00:02,396 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 12138 for container-id container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB physical memory used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 10 CPU vCores used. Cumulative CPU time: 45900 2020-07-16 21:00:02,424 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 101918 for container-id container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB physical memory used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 10 CPU vCores used. Cumulative CPU time: 2572390 2020-07-16 21:00:02,456 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 26596 for container-id container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 GB physical memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU vCores used. Cumulative CPU time: 101210 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10348) Allow RM to always cancel tokens after app completes
Jim Brennan created YARN-10348: -- Summary: Allow RM to always cancel tokens after app completes Key: YARN-10348 URL: https://issues.apache.org/jira/browse/YARN-10348 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.1.3, 2.10.0 Reporter: Jim Brennan Assignee: Jim Brennan (Note: this change was originally done on our internal branch by [~daryn]). The RM currently has an option for a client to specify disabling token cancellation when a job completes. This feature was an initial attempt to address the use case of a job launching sub-jobs (ie. oozie launcher) and the original job finishing prior to the sub-job(s) completion - ex. original job completion triggered premature cancellation of tokens needed by the sub-jobs. Many years ago, [~daryn] added a more robust implementation to ref count tokens ([YARN-3055]). This prevented premature cancellation of the token until all apps using the token complete, and invalidated the need for a client to specify cancel=false. Unfortunately the config option was not removed. We have seen cases where oozie "java actions" and some users were explicitly disabling token cancellation. This can lead to a buildup of defunct tokens that may overwhelm the ZK buffer used by the KDC's backing store. At which point the KMS fails to connect to ZK and is unable to issue/validate new tokens - rendering the KDC only able to authenticate pre-existing tokens. Production incidents have occurred due to the buffer size issue. To avoid these issues, the RM should have the option to ignore/override the client's request to not cancel tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility
Jim Brennan created YARN-10312: -- Summary: Add support for yarn logs -logFile to retain backward compatibility Key: YARN-10312 URL: https://issues.apache.org/jira/browse/YARN-10312 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.10.0, 3.4.1 Reporter: Jim Brennan The YARN CLI logs command line option {{-logFiles}} was changed to {{-log_files}} in 2.9 and later releases. This change was made as part of YARN-5363. Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and while testing integration with Spark, we ran into this issue. We are concerned that we will run into more cases of this as we roll out to production, and rather than break user scripts, we'd prefer to add {{-logFiles}} as an alias of {{-log_files}}. If both are provided, {{-logFiles}} will be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10227) Pull YARN-8242 back to branch-2.10
Jim Brennan created YARN-10227: -- Summary: Pull YARN-8242 back to branch-2.10 Key: YARN-10227 URL: https://issues.apache.org/jira/browse/YARN-10227 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 2.10.0, 2.10.1 Reporter: Jim Brennan Assignee: Jim Brennan We have recently seen the nodemanager OOM issue reported in YARN-8242 during a rolling upgrade. Our code is currently based on branch-2.8, but we are in the process of moving to 2.10. I checked and YARN-8242 pulls back to branch-2.10 pretty cleanly. The only conflict was a minor one in TestNMLeveldbStateStoreService.java. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT
Jim Brennan created YARN-10161: -- Summary: TestRouterWebServicesREST is corrupting STDOUT Key: YARN-10161 URL: https://issues.apache.org/jira/browse/YARN-10161 Project: Hadoop YARN Issue Type: Test Components: yarn Affects Versions: 2.10.0 Reporter: Jim Brennan TestRouterWebServicesREST is creating processes that inherit stdin/stdout from the current process, so the output from those jobs goes into the standard output of mvn test. Here's an example from a recent build: {noformat} [WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 1. See FAQ web page and the dump file /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream [INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.644 s - in org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST [WARNING] ForkStarter IOException: 506 INFO [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 522 INFO [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - registered UNIX signal handlers for [TERM, HUP, INT] 876 INFO [main] conf.Configuration (Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not found 879 INFO [main] security.Groups (Groups.java:refresh(402)) - clearing userToGroupsMap cache 930 INFO [main] conf.Configuration (Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml not found 930 INFO [main] resource.ResourceUtils (ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 'resource-types.xml'. 940 INFO [main] resource.ResourceUtils (ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE 940 INFO [main] resource.ResourceUtils (ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name = vcores, units = , type = COUNTABLE 974 INFO [main] conf.Configuration (Configuration.java:getConfResourceAsInputStream(2591)) - found resource yarn-site.xml at file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml 001 INFO [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher 053 INFO [main] security.NMTokenSecretManagerInRM (NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 8640ms and NMTokenKeyActivationDelay: 90ms 060 INFO [main] security.RMContainerTokenSecretManager (RMContainerTokenSecretManager.java:(79)) - ContainerTokenKeyRollingInterval: 8640ms and ContainerTokenKeyActivationDelay: 90ms ... {noformat} It seems like these processes should be rerouting stdout/stderr to a file instead of dumping it to the console. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-10072) TestCSAllocateCustomResource failures
Jim Brennan created YARN-10072: -- Summary: TestCSAllocateCustomResource failures Key: YARN-10072 URL: https://issues.apache.org/jira/browse/YARN-10072 Project: Hadoop YARN Issue Type: Test Components: yarn Affects Versions: 2.10.0 Reporter: Jim Brennan This test is failing for us consistently in our internal 2.10 based branch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks
Jim Brennan created YARN-9914: - Summary: Use separate configs for free disk space checking for full and not-full disks Key: YARN-9914 URL: https://issues.apache.org/jira/browse/YARN-9914 Project: Hadoop YARN Issue Type: Improvement Components: yarn Reporter: Jim Brennan Assignee: Jim Brennan [YARN-3943] added separate configurations for the nodemanager health check disk utilization full disk check: {{max-disk-utilization-per-disk-percentage}} - threshold for marking a good disk full {{disk-utilization-watermark-low-per-disk-percentage}} - threshold for marking a full disk as not full. On our clusters, we do not use these configs. We instead use {{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of percent of utilization. We have observed the same oscillation behavior as described in [YARN-3943] with this parameter. I would like to add an optional config to specify a separate threshold for marking a full disk as not full: {{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full {{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full disk is marked good. So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which would cause a disk to be marked full when free space goes below 5GB, and {{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the full state until free space goes above 10GB. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9906) When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" setting is not valid
[ https://issues.apache.org/jira/browse/YARN-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Brennan resolved YARN-9906. --- Resolution: Invalid > When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" > setting is not valid > --- > > Key: YARN-9906 > URL: https://issues.apache.org/jira/browse/YARN-9906 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: lynn >Priority: Major > Attachments: docker_volume_mounts.patch > > > As > [https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html#Application_Submission] > described, when I set the item "{{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" to > multi volumes mounts, the value is a comma-separated list of mounts.}} > > {quote}vars="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker, > > YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro;/etc/hadoop/conf:/etc/hadoop/conf" > hadoop jar hadoop-examples.jar pi -Dyarn.app.mapreduce.am.env=$vars \ > -Dmapreduce.map.env=$vars -Dmapreduce.reduce.env=$vars 10 100{quote} > I found the docker container can mount the first volume, so it can't be > running successfully without report error! > The code of > [DockerLinuxContainerRuntime.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java] > as follows: > {quote}if (environment.containsKey(ENV_DOCKER_CONTAINER_MOUNTS)) { > Matcher parsedMounts = USER_MOUNT_PATTERN.matcher( > environment.get(ENV_DOCKER_CONTAINER_MOUNTS)); > if (!parsedMounts.find()) { > throw new ContainerExecutionException( > "Unable to parse user supplied mount list: " > + environment.get(ENV_DOCKER_CONTAINER_MOUNTS)); > }{quote} > The regex pattern is in > [OCIContainerRuntime|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/OCIContainerRuntime.java] > as follows > {quote}static final Pattern USER_MOUNT_PATTERN = Pattern.compile( > "(?<=^|,)([^:\\x00]+):([^:\\x00]+)" + > "(:(r[ow]|(r[ow][+])?(r?shared|r?slave|r?private)))?(?:,|$)");{quote} > it is seperated by comma indeed, but when i read the code of submit the jar > to yarn , i find the code > [Apps.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java] > {quote}private static final Pattern VARVAL_SPLITTER = Pattern.compile( > "(?<=^|,)"// preceded by ',' or line begin > + '(' + Shell.ENV_NAME_REGEX + ')' // var group > + '=' > + "([^,]*)" // val group > ); > {quote} > It is sepearted by comma as the same. > So, I just modify the comma to semicolon(";") of the item > "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2
Jim Brennan created YARN-9844: - Summary: TestCapacitySchedulerPerf test errors in branch-2 Key: YARN-9844 URL: https://issues.apache.org/jira/browse/YARN-9844 Project: Hadoop YARN Issue Type: Bug Components: test, yarn Affects Versions: 2.10.0 Reporter: Jim Brennan **These TestCapacitySchedulerPerf throughput tests are failing in branch-2: {{[ERROR] TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114 » ArrayIndexOutOfBounds}}{{[ERROR] TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114 » ArrayIndexOutOfBounds}}{{[ERROR] TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114 » ArrayIndexOutOfBounds}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file
Jim Brennan created YARN-9527: - Summary: Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file Key: YARN-9527 URL: https://issues.apache.org/jira/browse/YARN-9527 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.1.2, 2.8.5 Reporter: Jim Brennan A rogue ContainerLocalizer can get stuck in a loop continuously downloading the same file while generating an "Invalid event: LOCALIZED at LOCALIZED" exception on each iteration. Sometimes this continues long enough that it fills up a disk or depletes available inodes for the filesystem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-9442) container working directory has group read permissions
Jim Brennan created YARN-9442: - Summary: container working directory has group read permissions Key: YARN-9442 URL: https://issues.apache.org/jira/browse/YARN-9442 Project: Hadoop YARN Issue Type: Improvement Components: yarn Affects Versions: 3.2.2 Reporter: Jim Brennan Container working directories are currently created with permissions 0750, owned by the user and with the group set to the node manager group. Is there any reason why these directories need group read permissions? I have been testing with group read permissions removed and so far I haven't encountered any problems. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8656) container-executor should not write cgroup tasks files for docker containers
Jim Brennan created YARN-8656: - Summary: container-executor should not write cgroup tasks files for docker containers Key: YARN-8656 URL: https://issues.apache.org/jira/browse/YARN-8656 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker run}} to ensure that all processes for the container are placed into a cgroup under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. Docker creates a cgroup there with the docker container id as the name and all of the processes in the container go into that cgroup. container-executor has code in {{launch_docker_container_as_user()}} that then cherry-picks the PID of the docker container (usually the launch shell) and writes that into the {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively moving it from {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. So you end up with one process out of the container in the {{container_id}} cgroup, and the rest in the {{container_id/docker_container_id}} cgroup. Since we are passing the {{--cgroup-parent}} to docker, there is no need to manually write the container pid to the tasks file - we can just remove the code that does this in the docker case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8648) Container cgroups are leaked when using docker
Jim Brennan created YARN-8648: - Summary: Container cgroups are leaked when using docker Key: YARN-8648 URL: https://issues.apache.org/jira/browse/YARN-8648 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan When you run with docker and enable cgroups for cpu, docker creates cgroups for all resources on the system, not just for cpu. For instance, if the {{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, the nodemanager will create a cgroup for each container under {{/sys/fs/cgroup/cpu/hadoop-yarn}}. In the docker case, we pass this path via the {{--cgroup-parent}} command line argument. Docker then creates a cgroup for the docker container under that, for instance: {{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}. When the container exits, docker cleans up the {{docker_container_id}} cgroup, and the nodemanager cleans up the {{container_id}} cgroup, All is good under {{/sys/fs/cgroup/hadoop-yarn}}. The problem is that docker also creates that same hierarchy under every resource under {{/sys/fs/cgroup}}. On the rhel7 system I am using, these are: blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, perf_event, and systemd.So for instance, docker creates {{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but it only cleans up the leaf cgroup {{docker_container_id}}. Nobody cleans up the {{container_id}} cgroups for these other resources. On one of our busy clusters, we found > 100,000 of these leaked cgroups. I found this in our 2.8-based version of hadoop, but I have been able to repro with current hadoop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails
Jim Brennan created YARN-8640: - Summary: Restore previous state in container-executor if write_exit_code_file_as_nm fails Key: YARN-8640 URL: https://issues.apache.org/jira/browse/YARN-8640 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan The container-executor function {{write_exit_code_file_as_nm}} had a number of failure conditions where it just returns -1 without restoring previous state. This is not a problem in any of the places where it is currently called, but it could be a problem if future code changes call it before code that depends on the previous state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8518) test-container-executor test_is_empty() is broken
Jim Brennan created YARN-8518: - Summary: test-container-executor test_is_empty() is broken Key: YARN-8518 URL: https://issues.apache.org/jira/browse/YARN-8518 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan A new test was recently added to test-container-executor.c that has some problems. It is attempting to mkdir() a hard-coded path: /tmp/2938rf2983hcqnw8ud/emptydir This fails because the base directory is not there. These directories are not being cleaned up either. It should be using TEST_ROOT. I don't know what Jira this change was made under - the git commit from July 9 2018 does not reference a Jira. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
Jim Brennan created YARN-8515: - Summary: container-executor can crash with SIGPIPE after nodemanager restart Key: YARN-8515 URL: https://issues.apache.org/jira/browse/YARN-8515 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running. Upon investigation, we noticed that this always seemed to happen after a nodemanager restart. The sequence leading to the stranded docker containers is: # Nodemanager restarts # Containers are recovered and then run for a while # Containers are killed for some (legitimate) reason # Container-executor exits without removing the docker container. After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE. What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr. When the NM is restarted, these threads are killed. Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container. We ran into this in branch 2.8. The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value
Jim Brennan created YARN-8444: - Summary: NodeResourceMonitor crashes on bad swapFree value Key: YARN-8444 URL: https://issues.apache.org/jira/browse/YARN-8444 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.2, 2.8.3 Reporter: Jim Brennan Assignee: Jim Brennan Saw this on a node that was having difficulty preempting containers. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time so it may only be something that happens when normal preemption isn't work right, but we should fix since this is a critical monitor to the health of the node. {noformat} 2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception. java.lang.NumberFormatException: For input string: "18446744073709551596" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Long.parseLong(Long.java:592) at java.lang.Long.parseLong(Long.java:631) at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257) at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591) at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601) at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74) at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193) 2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas
Jim Brennan created YARN-8071: - Summary: Provide Spark-like API for setting Environment Variables to enable vars with commas Key: YARN-8071 URL: https://issues.apache.org/jira/browse/YARN-8071 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jim Brennan Assignee: Jim Brennan YARN-6830 describes a problem where environment variables that contain commas cannot be specified via {{-Dmapreduce.map.env}}. For example: {{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}} will set {{MOUNTS}} to {{/tmp/foo}} In that Jira, [~aw] suggested that we change the API to provide a way to specify environment variables individually, the same way that Spark does. {quote}Rather than fight with a regex why not redefine the API instead? -Dmapreduce.map.env.MODE=bar -Dmapreduce.map.env.IMAGE_NAME=foo -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar ... e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar This greatly simplifies the input validation needed and makes it clear what is actually being defined. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators
Jim Brennan created YARN-8029: - Summary: YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators Key: YARN-8029 URL: https://issues.apache.org/jira/browse/YARN-8029 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jim Brennan The following docker-related environment variables specify a comma-separated list of mounts: YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS This is a problem because hadoop -Dmapreduce.map.env and related options use comma as a delimiter. So if I put more than one mount in YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be treated as a delimiter for the hadoop command line option and all but the first mount will be ignored. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13
Jim Brennan created YARN-8027: - Summary: Setting hostname of docker container breaks for --net=host in docker 1.13 Key: YARN-8027 URL: https://issues.apache.org/jira/browse/YARN-8027 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jim Brennan Assignee: Jim Brennan In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname argument to the docker run command to set the hostname in the container to something like: ctr-e84-1520889172376-0001-01-01. This does not work when combined with the --net=host command line option in Docker 1.13.1. It causes multiple failures when the client tries to resolve the hostname and it fails. We haven't seen this before because we were using docker 1.12.6 which seems to ignore --hostname when you are using --net=host. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7857) -fstack-check compilation flag causes binary incompatibility for container-executor between RHEL 6 and RHEL 7
Jim Brennan created YARN-7857: - Summary: -fstack-check compilation flag causes binary incompatibility for container-executor between RHEL 6 and RHEL 7 Key: YARN-7857 URL: https://issues.apache.org/jira/browse/YARN-7857 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0 Reporter: Jim Brennan Assignee: Jim Brennan The segmentation fault in container-executor reported in [YARN-7796] appears to be due to a binary compatibility issue with the {{-fstack-check}} flag that was added in [YARN-6721] Based on my testing, a container-executor (without the patch from [YARN-7796]) compiled on RHEL 6 with the -fstack-check flag always hits this segmentation fault when run on RHEL 7. But if you compile without this flag, the container-executor runs on RHEL 7 with no problems. I also verified this with a simple program that just does the copy_file. I think we need to either remove this flag, or find a suitable alternative. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
[jira] [Created] (YARN-7678) Logging of container memory stats is missing in 2.8
Jim Brennan created YARN-7678: - Summary: Logging of container memory stats is missing in 2.8 Key: YARN-7678 URL: https://issues.apache.org/jira/browse/YARN-7678 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 3.0.0, 2.8.0 Reporter: Jim Brennan Assignee: Jim Brennan YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO to DEBUG. We have found these log messages to be useful information in Out-of-Memory situations - they provide detail that helps show the memory profile of the container over time, which can be helpful in determining root cause. Here's an example message from YARN-3424: {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Propose to change this to use a separate logger for this message, so that we can enable debug logging for this without enabling all of the other debug logging for ContainersMonitorImpl. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org