[jira] [Created] (YARN-10855) yarn logs cli fails to retrieve logs if any TFile is corrupt or empty

2021-07-15 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10855:
--

 Summary: yarn logs cli fails to retrieve logs if any TFile is 
corrupt or empty
 Key: YARN-10855
 URL: https://issues.apache.org/jira/browse/YARN-10855
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.1, 2.10.1, 3.2.2, 3.4.0
Reporter: Jim Brennan


When attempting to retrieve yarn logs via the CLI command, it failed with the 
following stack trace (on branch-2.10):
{noformat}
yarn logs -applicationId application_1591017890475_1049740 > logs
20/06/05 19:15:50 INFO client.RMProxy: Connecting to ResourceManager 
20/06/05 19:15:51 INFO client.AHSProxy: Connecting to Application History 
server 
Exception in thread "main" java.io.EOFException: Cannot seek to negative offset
at org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1701)
at 
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
at org.apache.hadoop.io.file.tfile.BCFile$Reader.(BCFile.java:624)
at org.apache.hadoop.io.file.tfile.TFile$Reader.(TFile.java:804)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.(AggregatedLogFormat.java:503)
at 
org.apache.hadoop.yarn.logaggregation.LogCLIHelpers.dumpAllContainersLogs(LogCLIHelpers.java:227)
at org.apache.hadoop.yarn.client.cli.LogsCLI.run(LogsCLI.java:333)
at org.apache.hadoop.yarn.client.cli.LogsCLI.main(LogsCLI.java:367) 
{noformat}
The problem was that there was a zero-length TFile for one of the containers in 
the application aggregated log directory in hdfs.  When we removed the zero 
length file, {{yarn logs}} was able to retrieve the logs.

A corrupt or zero length TFile for one container should not prevent loading 
logs for the rest of the application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10733) TimelineService Hbase tests are failing with timeout error on branch-2.10

2021-04-14 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10733.

Fix Version/s: 2.10.2
   Resolution: Fixed

Thanks [~ahussein], I have committed this to branch-2.10.



> TimelineService Hbase tests are failing with timeout error on branch-2.10
> -
>
> Key: YARN-10733
> URL: https://issues.apache.org/jira/browse/YARN-10733
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: test, timelineserver, yarn
>Affects Versions: 2.10.0
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.2
>
> Attachments: 2021-04-12T12-40-21_403-jvmRun1.dump, 
> 2021-04-12T12-40-58_857.dumpstream, 
> org.apache.hadoop.yarn.server.timelineservice.storage.flow.TestHBaseStorageFlowRunCompaction-output.txt.zip
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> {code:bash}
> 03:54:41 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.2:test (default-test) on 
> project hadoop-yarn-server-timelineservice-hbase-tests: There was a timeout 
> or other error in the fork -> [Help 1]
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] To see the full stack trace of the errors, re-run Maven with 
> the -e switch.
> 03:54:41 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 03:54:41 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> 03:54:41 [ERROR] 
> 03:54:41 [ERROR] After correcting the problems, you can resume the build with 
> the command
> 03:54:41 [ERROR]   mvn  -rf 
> :hadoop-yarn-server-timelineservice-hbase-tests
> {code}
> Failure of the tests is due to test unit 
> {{TestHBaseStorageFlowRunCompaction}} getting stuck.
> Upon checking the surefire reports, I found several Class no Found Exceptions.
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/fs/CanUnbuffer
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo.(StoreFileInfo.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 33 more
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.CanUnbuffer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 51 more
> {code}
> and 
> {code:bash}
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.hadoop.hbase.regionserver.StoreFileInfo
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.createStoreFileAndReader(HStore.java:698)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.validateStoreFile(HStore.java:1895)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1009)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2523)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2638)
>   ... 10 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Created] (YARN-10702) Add cluster metric for amount of CPU used by RM Event Processor

2021-03-17 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10702:
--

 Summary: Add cluster metric for amount of CPU used by RM Event 
Processor
 Key: YARN-10702
 URL: https://issues.apache.org/jira/browse/YARN-10702
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Add a cluster metric to track the cpu usage of the ResourceManager Event 
Processing thread.   This lets us know when the critical path of the RM is 
running out of headroom.
This feature was originally added for us internally by [~nroberts] and we've 
been running with it on production clusters for nearly four years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10665) TestContainerManagerRecover sometimes fails

2021-03-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10665:
--

 Summary: TestContainerManagerRecover sometimes fails
 Key: YARN-10665
 URL: https://issues.apache.org/jira/browse/YARN-10665
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


TestContainerManagerRecovery sometimes fails when I run it on the mac because 
it cannot bind to a port.   I believe this is because it calls getPort with a 
hard-coded port number (49160) instead of just passing zero.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10664) Allow parameter expansion in NM_ADMIN_USER_ENV

2021-03-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10664:
--

 Summary: Allow parameter expansion in NM_ADMIN_USER_ENV
 Key: YARN-10664
 URL: https://issues.apache.org/jira/browse/YARN-10664
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


Currently, {{YarnConfiguration.NM_ADMIN_USER_ENV}} does not do parameter 
expansion.  That is, you cannot specify an environment variable such as 
{code}{{JAVA_HOME}}{code} and have it be expanded to {{$JAVA_HOME}} inside the 
container.

We have a need for this in specifying different java gc options for java 
processing running inside yarn containers based on which version of java is 
being used.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-5853) TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently on Power

2021-02-11 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-5853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-5853.
---
Resolution: Duplicate

This is fixed by YARN-10500

> TestDelegationTokenRenewer#testRMRestartWithExpiredToken fails intermittently 
> on Power
> --
>
> Key: YARN-5853
> URL: https://issues.apache.org/jira/browse/YARN-5853
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
> Environment: # uname -a
> Linux pts00452-vm10 3.10.0-327.el7.ppc64le #1 SMP Thu Oct 29 17:31:13 EDT 
> 2015 ppc64le ppc64le ppc64le GNU/Linux
> # cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: Yussuf Shaikh
>Priority: Major
>
> The test testRMRestartWithExpiredToken fails intermittently with the 
> following error:
> Stacktrace:
> java.lang.AssertionError: null
> at org.junit.Assert.fail(Assert.java:86)
> at org.junit.Assert.assertTrue(Assert.java:41)
> at org.junit.Assert.assertNotNull(Assert.java:621)
> at org.junit.Assert.assertNotNull(Assert.java:631)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer.testRMRestartWithExpiredToken(TestDelegationTokenRenewer.java:1060)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10562) Alternate fix for DirectoryCollection.checkDirs() race

2021-01-06 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10562:
--

 Summary: Alternate fix for DirectoryCollection.checkDirs() race
 Key: YARN-10562
 URL: https://issues.apache.org/jira/browse/YARN-10562
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In YARN-9833, a race condition in DirectoryCollection. {{getGoodDirs()}} and 
related methods were returning an unmodifiable view of the lists. These 
accesses were protected by read/write locks, but because the lists are 
CopyOnWriteArrayLists, subsequent changes to the list, even when done under the 
writelock, were exposed when a caller started iterating the list view. 
CopyOnWriteArrayLists cache the current underlying list in the iterator, so it 
is safe to iterate them even while they are being changed - at least the view 
will be consistent.

The problem was that checkDirs() was clearing the lists and rebuilding them 
from scratch every time, so if a caller called getGoodDirs() just before 
checkDirs cleared it, and then started iterating right after the clear, they 
could get an empty list.

The fix in YARN-9833 was to change {{getGoodDirs()}} and related methods to 
return a copy of the list, which definitely fixes the race condition. The 
disadvantage is that now we create a new copy of these lists every time we 
launch a container. The advantage using CopyOnWriteArrayList was that the lists 
should rarely ever change, and we can avoid all the copying. Unfortunately, the 
way checkDirs() was written, it guaranteed that it would modify those lists 
multiple times every time.

So this Jira proposes an alternate solution for YARN-9833, which mainly just 
rewrites checkDirs() to minimize the changes to the underlying lists. There are 
still some small windows where a disk will have been added to one list, but not 
yet removed from another if you hit it just right, but I think these should be 
pretty rare and relatively harmless, and in the vast majority of cases I 
suspect only one disk will be moving from one list to another at any time.   
The question is whether this type of inconsistency (which was always there 
before -YARN-9833- is worth reducing all the copying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10542) Node Utilization on UI is misleading if nodes don't report utilization

2020-12-21 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10542:
--

 Summary: Node Utilization on UI is misleading if nodes don't 
report utilization
 Key: YARN-10542
 URL: https://issues.apache.org/jira/browse/YARN-10542
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan
Assignee: Jim Brennan


As reported in YARN-10540, if the ResourceCalculatorPlugin fails to initialize, 
the nodes will report no utilization.  This makes the RM UI misleading, because 
it presents cluster-wide and per node utilization as 0 instead of indicating 
that it is not being tracked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10485.

Fix Version/s: 3.2.3
   3.4.1
   3.1.5
   3.3.1
   Resolution: Fixed

Thanks for the contribution [~ahussein] and [~daryn]!
I have committed this to trunk, branch-3.3, branch-3.2, and branch-3.1.

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10485) TimelineConnector swallows InterruptedException

2020-11-13 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10485.

Resolution: Fixed

> TimelineConnector swallows InterruptedException
> ---
>
> Key: YARN-10485
> URL: https://issues.apache.org/jira/browse/YARN-10485
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Ahmed Hussein
>Assignee: Ahmed Hussein
>Priority: Major
> Fix For: 3.3.1, 3.1.5, 3.4.1, 3.2.3
>
>
> Some tests timeout or take excessively long to shutdown because the 
> {{TimelineConnector}} will catch InterruptedException and go into a retry 
> loop instead of aborting.
> [~daryn] reported that this makes debugging more difficult and he suggests 
> the exception to be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10479) RMProxy should retry on SocketTimeout Exceptions

2020-11-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10479:
--

 Summary: RMProxy should retry on SocketTimeout Exceptions
 Key: YARN-10479
 URL: https://issues.apache.org/jira/browse/YARN-10479
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


During an incident involving a DNS outage, a large number of nodemanagers 
failed to come back into service because they hit a socket timeout when trying 
to re-register with the RM.

SocketTimeoutException is not currently one of the exceptions that the RMProxy 
will retry.  Based on this incident, it seems like it should be.  We made this 
change internally about a year ago and it has been running in production since.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10478) Make RM-NM heartbeat scaling calculator pluggable

2020-11-02 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10478:
--

 Summary: Make RM-NM heartbeat scaling calculator pluggable
 Key: YARN-10478
 URL: https://issues.apache.org/jira/browse/YARN-10478
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan


[YARN-10475] adds a feature to enable scaling the interval for heartbeats 
between the RM and NM based on CPU utilization.  [~bibinchundatt] suggested 
that we make this pluggable so that other calculations can be used if desired.

The configuration properties added in [YARN-10475] should be applicable to any 
heartbeat calculator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

2020-10-28 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-10477.

Resolution: Invalid

Closing this as invalid.  The problem was only there in our internal version of 
container-executor.  I should have checked the code in trunk before filing.


> runc launch failure should not cause nodemanager to go unhealthy
> 
>
> Key: YARN-10477
> URL: https://issues.apache.org/jira/browse/YARN-10477
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.1, 3.4.1
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> We have observed some failures when launching containers with runc.  We have 
> not yet identified the root cause of those failures, but a side-effect of 
> these failures was the Nodemanager marked itself unhealthy.  Since these are 
> rare failures that only affect a single launch, they should not cause the 
> Nodemanager to be marked unhealthy.
> Here is an example RM log:
> {noformat}
> resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
> dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
> details: Linux Container Executor reached unrecoverable exception
> {noformat}
> And here is an example of the NM log:
> {noformat}
> 2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
> runtime.RuncContainerRuntime: Launch container failed for 
> container_e25_1601602719874_10691_01_001723
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
>  ExitCodeException exitCode=24: OCI command has bad/missing local dire
> ctories
> {noformat}
> The problem is that the runc code in container-executor is re-using exit code 
> 24 (INVALID_CONFIG_FILE) which is intended for problems with the 
> container-executor.cfg file, and those failures are fatal for the NM.  We 
> should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10477) runc launch failure should not cause nodemanager to go unhealthy

2020-10-28 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10477:
--

 Summary: runc launch failure should not cause nodemanager to go 
unhealthy
 Key: YARN-10477
 URL: https://issues.apache.org/jira/browse/YARN-10477
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.3.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


We have observed some failures when launching containers with runc.  We have 
not yet identified the root cause of those failures, but a side-effect of these 
failures was the Nodemanager marked itself unhealthy.  Since these are rare 
failures that only affect a single launch, they should not cause the 
Nodemanager to be marked unhealthy.

Here is an example RM log:
{noformat}
resourcemanager.log.2020-10-02-03.bz2:2020-10-02 03:20:10,255 [RM Event 
dispatcher] INFO rmnode.RMNodeImpl: Node node:8041 reported UNHEALTHY with 
details: Linux Container Executor reached unrecoverable exception
{noformat}
And here is an example of the NM log:
{noformat}
2020-10-02 03:20:02,033 [ContainersLauncher #434] INFO 
runtime.RuncContainerRuntime: Launch container failed for 
container_e25_1601602719874_10691_01_001723
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException:
 ExitCodeException exitCode=24: OCI command has bad/missing local dire
ctories
{noformat}

The problem is that the runc code in container-executor is re-using exit code 
24 (INVALID_CONFIG_FILE) which is intended for problems with the 
container-executor.cfg file, and those failures are fatal for the NM.  We 
should use a different exit code for these.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10475) Scale RM-NM heartbeat interval based on node utilization

2020-10-27 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10475:
--

 Summary: Scale RM-NM heartbeat interval based on node utilization
 Key: YARN-10475
 URL: https://issues.apache.org/jira/browse/YARN-10475
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.10.1, 3.4.1
Reporter: Jim Brennan
Assignee: Jim Brennan


Add the ability to scale the RM-NM heartbeat interval based on node cpu 
utilization compared to overall cluster cpu utilization.  If a node is 
over-utilized compared to the rest of the cluster, it's heartbeat interval 
slows down.  If it is under-utilized compared to the rest of the cluster, it's 
heartbeat interval speeds up.

This is a feature we have been running with internally in production for 
several years.  It was developed by [~nroberts], based on the observation that 
larger faster nodes on our cluster were under-utilized compared to smaller 
slower nodes. 

This feature is dependent on [YARN-10450], which added cluster-wide utilization 
metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10450) Add cpu and memory utilization per node and cluster-wide metrics

2020-09-29 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10450:
--

 Summary: Add cpu and memory utilization per node and cluster-wide 
metrics
 Key: YARN-10450
 URL: https://issues.apache.org/jira/browse/YARN-10450
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.3.1
Reporter: Jim Brennan
Assignee: Jim Brennan


Add metrics to show actual cpu and memory utilization for each node and 
aggregated for the entire cluster.  This is information is already passed from 
NM to RM in the node status update.
We have been running with this internally for quite a while and found it useful 
to be able to quickly see the actual cpu/memory utilization on the 
node/cluster.  It's especially useful if some form of overcommit is used.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10369) Make NMTokenSecretManagerInRM sending NMToken for nodeId DEBUG

2020-07-27 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10369:
--

 Summary: Make NMTokenSecretManagerInRM sending NMToken for nodeId 
DEBUG
 Key: YARN-10369
 URL: https://issues.apache.org/jira/browse/YARN-10369
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan


This message is logged at the info level, but it doesn't really add much 
information.
We changed this to DEBUG internally years ago and haven't missed it.
{noformat}
2020-07-27 21:51:29,027 INFO  [RM Event dispatcher] 
security.NMTokenSecretManagerInRM 
(NMTokenSecretManagerInRM.java:createAndGetNMToken(200)) - Sending NMToken for 
nodeId : localhost.localdomain:45454 for container : 
container_1595886659189_0001_01_01
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10363) TestRMAdminCLI.testHelp is failing in branch-2.10

2020-07-22 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10363:
--

 Summary: TestRMAdminCLI.testHelp is failing in branch-2.10
 Key: YARN-10363
 URL: https://issues.apache.org/jira/browse/YARN-10363
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.10.1
Reporter: Jim Brennan


TestRMAdminCLI.testHelp is failing in branch-2.10.

Example failure:
{noformat}
---
Test set: org.apache.hadoop.yarn.client.cli.TestRMAdminCLI
---
Tests run: 31, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 18.668 s <<< 
FAILURE! - in org.apache.hadoop.yarn.client.cli.TestRMAdminCLI
testHelp(org.apache.hadoop.yarn.client.cli.TestRMAdminCLI)  Time elapsed: 0.043 
s  <<< FAILURE!
java.lang.AssertionError: 
Expected error message: 
Usage: yarn rmadmin [-failover [--forcefence] [--forceactive]  
] is not included in messages: 
Usage: yarn rmadmin
   -refreshQueues 
   -refreshNodes [-g|graceful [timeout in seconds] -client|server]
   -refreshNodesResources 
   -refreshSuperUserGroupsConfiguration 
   -refreshUserToGroupsMappings 
   -refreshAdminAcls 
   -refreshServiceAcl 
   -getGroups [username]
   -addToClusterNodeLabels 
<"label1(exclusive=true),label2(exclusive=false),label3">
   -removeFromClusterNodeLabels  (label splitted by ",")
   -replaceLabelsOnNode <"node1[:port]=label1,label2 
node2[:port]=label1,label2"> [-failOnUnknownNodes] 
   -directlyAccessNodeLabelStore 
   -refreshClusterMaxPriority 
   -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout])
   -help [cmd]

Generic options supported are:
-conf specify an application configuration file
-Ddefine a value for a given property
-fs  specify default filesystem URL to use, 
overrides 'fs.defaultFS' property from configurations.
-jt   specify a ResourceManager
-files specify a comma-separated list of files to be 
copied to the map reduce cluster
-libjarsspecify a comma-separated list of jar files 
to be included in the classpath
-archives   specify a comma-separated list of archives to 
be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]


at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testError(TestRMAdminCLI.java:859)
at 
org.apache.hadoop.yarn.client.cli.TestRMAdminCLI.testHelp(TestRMAdminCLI.java:585)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)


[jira] [Created] (YARN-10353) Log vcores used and cumulative cpu in containers monitor

2020-07-16 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10353:
--

 Summary: Log vcores used and cumulative cpu in containers monitor
 Key: YARN-10353
 URL: https://issues.apache.org/jira/browse/YARN-10353
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


We currently log the percentage/cpu and percentage/cpus-used-by-yarn in the 
Containers Monitor log. It would be useful to also log vcores used vs vcores 
assigned, and total accumulated CPU time.

For example, currently we have an audit log that looks like this:
{noformat}
2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
(ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
CPU/core:35.772625
{noformat}
The proposal is to add two more fields to show vCores and Cumulative CPU ms:
{noformat}
2020-07-16 20:33:51,550 DEBUG [Container Monitor] ContainersMonitorImpl.audit 
(ContainersMonitorImpl.java:recordUsage(651)) - Resource usage of ProcessTree 
809 for container-id container_1594931466123_0002_01_07: 309.5 MB of 2 GB 
physical memory used; 2.8 GB of 4.2 GB virtual memory used CPU:143.0905 
CPU/core:35.772625 vCores:2/1 CPU-ms:4180
{noformat}
This is a snippet of a log from one of our clusters running branch-2.8 with a 
similar change.
{noformat}
2020-07-16 21:00:02,240 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 5267 for container-id 
container_e04_1594079801456_1397450_01_001992: 1.6 GB of 2.5 GB physical memory 
used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 18 of 10 CPU vCores 
used. Cumulative CPU time: 157410
2020-07-16 21:00:02,269 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 18801 for container-id 
container_e04_1594079801456_1390375_01_19: 413.2 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 113830
2020-07-16 21:00:02,298 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 5279 for container-id 
container_e04_1594079801456_1397450_01_001991: 2.2 GB of 2.5 GB physical memory 
used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 17 of 10 CPU vCores 
used. Cumulative CPU time: 128630
2020-07-16 21:00:02,339 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 24189 for container-id 
container_e04_1594079801456_1390430_01_000415: 392.7 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 96060
2020-07-16 21:00:02,367 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 6751 for container-id 
container_e04_1594079801456_1397923_01_003248: 1.3 GB of 3 GB physical memory 
used; 4.3 GB of 6.3 GB virtual memory used. CPU usage: 12 of 10 CPU vCores 
used. Cumulative CPU time: 116820
2020-07-16 21:00:02,396 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 12138 for container-id 
container_e04_1594079801456_1397760_01_44: 4.4 GB of 6 GB physical memory 
used; 6.9 GB of 12.6 GB virtual memory used. CPU usage: 15 of 10 CPU vCores 
used. Cumulative CPU time: 45900
2020-07-16 21:00:02,424 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 101918 for container-id 
container_e04_1594079801456_1391130_01_002378: 2.4 GB of 4 GB physical memory 
used; 5.8 GB of 8.4 GB virtual memory used. CPU usage: 13 of 10 CPU vCores 
used. Cumulative CPU time: 2572390
2020-07-16 21:00:02,456 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 26596 for container-id 
container_e04_1594079801456_1390446_01_000665: 418.6 MB of 2.5 GB physical 
memory used; 3.8 GB of 5.3 GB virtual memory used. CPU usage: 0 of 10 CPU 
vCores used. Cumulative CPU time: 101210
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10348) Allow RM to always cancel tokens after app completes

2020-07-08 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10348:
--

 Summary: Allow RM to always cancel tokens after app completes
 Key: YARN-10348
 URL: https://issues.apache.org/jira/browse/YARN-10348
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.3, 2.10.0
Reporter: Jim Brennan
Assignee: Jim Brennan


(Note: this change was originally done on our internal branch by [~daryn]).

The RM currently has an option for a client to specify disabling token 
cancellation when a job completes. This feature was an initial attempt to 
address the use case of a job launching sub-jobs (ie. oozie launcher) and the 
original job finishing prior to the sub-job(s) completion - ex. original job 
completion triggered premature cancellation of tokens needed by the sub-jobs.

Many years ago, [~daryn] added a more robust implementation to ref count tokens 
([YARN-3055]). This prevented premature cancellation of the token until all 
apps using the token complete, and invalidated the need for a client to specify 
cancel=false. Unfortunately the config option was not removed.

We have seen cases where oozie "java actions" and some users were explicitly 
disabling token cancellation. This can lead to a buildup of defunct tokens that 
may overwhelm the ZK buffer used by the KDC's backing store. At which point the 
KMS fails to connect to ZK and is unable to issue/validate new tokens - 
rendering the KDC only able to authenticate pre-existing tokens. Production 
incidents have occurred due to the buffer size issue.

To avoid these issues, the RM should have the option to ignore/override the 
client's request to not cancel tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10312) Add support for yarn logs -logFile to retain backward compatibility

2020-06-11 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10312:
--

 Summary: Add support for yarn logs -logFile to retain backward 
compatibility
 Key: YARN-10312
 URL: https://issues.apache.org/jira/browse/YARN-10312
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.10.0, 3.4.1
Reporter: Jim Brennan


The YARN CLI logs command line option {{-logFiles}} was changed to 
{{-log_files}}  in 2.9 and later releases.   This change was made as part of 
YARN-5363.

Verizon Media is in the process of moving from Hadoop-2.8 to Hadoop-2.10, and 
while testing integration with Spark, we ran into this issue.   We are 
concerned that we will run into more cases of this as we roll out to 
production, and rather than break user scripts, we'd prefer to add 
{{-logFiles}} as an alias of {{-log_files}}.  If both are provided, 
{{-logFiles}} will be ignored.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10227) Pull YARN-8242 back to branch-2.10

2020-04-08 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10227:
--

 Summary: Pull YARN-8242 back to branch-2.10
 Key: YARN-10227
 URL: https://issues.apache.org/jira/browse/YARN-10227
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.10.0, 2.10.1
Reporter: Jim Brennan
Assignee: Jim Brennan


We have recently seen the nodemanager OOM issue reported in YARN-8242 during a 
rolling upgrade.  Our code is currently based on branch-2.8, but we are in the 
process of moving to 2.10.  I checked and YARN-8242 pulls back to branch-2.10 
pretty cleanly.  The only conflict was a minor one in 
TestNMLeveldbStateStoreService.java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10161) TestRouterWebServicesREST is corrupting STDOUT

2020-02-24 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10161:
--

 Summary: TestRouterWebServicesREST is corrupting STDOUT
 Key: YARN-10161
 URL: https://issues.apache.org/jira/browse/YARN-10161
 Project: Hadoop YARN
  Issue Type: Test
  Components: yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


TestRouterWebServicesREST is creating processes that inherit stdin/stdout from 
the current process, so the output from those jobs goes into the standard 
output of mvn test.

Here's an example from a recent build:
{noformat}
[WARNING] Corrupted STDOUT by directly writing to native stream in forked JVM 
1. See FAQ web page and the dump file 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/surefire-reports/2020-02-24T08-00-54_776-jvmRun1.dumpstream
[INFO] Tests run: 41, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.644 
s - in org.apache.hadoop.yarn.server.router.webapp.TestRouterWebServicesREST
[WARNING] ForkStarter IOException: 506 INFO  [main] 
resourcemanager.ResourceManager (LogAdapter.java:info(49)) - STARTUP_MSG: 
522 INFO  [main] resourcemanager.ResourceManager (LogAdapter.java:info(49)) - 
registered UNIX signal handlers for [TERM, HUP, INT]
876 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2588)) - core-site.xml not 
found
879 INFO  [main] security.Groups (Groups.java:refresh(402)) - clearing 
userToGroupsMap cache
930 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2588)) - resource-types.xml 
not found
930 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addResourcesFileToConf(421)) - Unable to find 
'resource-types.xml'.
940 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addMandatoryResources(126)) - Adding resource type - name = 
memory-mb, units = Mi, type = COUNTABLE
940 INFO  [main] resource.ResourceUtils 
(ResourceUtils.java:addMandatoryResources(135)) - Adding resource type - name = 
vcores, units = , type = COUNTABLE
974 INFO  [main] conf.Configuration 
(Configuration.java:getConfResourceAsInputStream(2591)) - found resource 
yarn-site.xml at 
file:/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-router/target/test-classes/yarn-site.xml
001 INFO  [main] event.AsyncDispatcher (AsyncDispatcher.java:register(227)) - 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
053 INFO  [main] security.NMTokenSecretManagerInRM 
(NMTokenSecretManagerInRM.java:(75)) - NMTokenKeyRollingInterval: 
8640ms and NMTokenKeyActivationDelay: 90ms
060 INFO  [main] security.RMContainerTokenSecretManager 
(RMContainerTokenSecretManager.java:(79)) - 
ContainerTokenKeyRollingInterval: 8640ms and 
ContainerTokenKeyActivationDelay: 90ms
... {noformat}
It seems like these processes should be rerouting stdout/stderr to a file 
instead of dumping it to the console.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-10072) TestCSAllocateCustomResource failures

2020-01-07 Thread Jim Brennan (Jira)
Jim Brennan created YARN-10072:
--

 Summary: TestCSAllocateCustomResource failures
 Key: YARN-10072
 URL: https://issues.apache.org/jira/browse/YARN-10072
 Project: Hadoop YARN
  Issue Type: Test
  Components: yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


This test is failing for us consistently in our internal 2.10 based branch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9914) Use separate configs for free disk space checking for full and not-full disks

2019-10-18 Thread Jim Brennan (Jira)
Jim Brennan created YARN-9914:
-

 Summary: Use separate configs for free disk space checking for 
full and not-full disks
 Key: YARN-9914
 URL: https://issues.apache.org/jira/browse/YARN-9914
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Reporter: Jim Brennan
Assignee: Jim Brennan


[YARN-3943] added separate configurations for the nodemanager health check disk 
utilization full disk check:

{{max-disk-utilization-per-disk-percentage}} - threshold for marking a good 
disk full

{{disk-utilization-watermark-low-per-disk-percentage}} - threshold for marking 
a full disk as not full.

On our clusters, we do not use these configs. We instead use 
{{min-free-space-per-disk-mb}} so we can specify the limit in mb instead of 
percent of utilization. We have observed the same oscillation behavior as 
described in [YARN-3943] with this parameter. I would like to add an optional 
config to specify a separate threshold for marking a full disk as not full:

{{min-free-space-per-disk-mb}} - threshold at which a good disk is marked full

{{disk-free-space-per-disk-high-watermark-mb}} - threshold at which a full disk 
is marked good.

So for example, we could set {{min-free-space-per-disk-mb = 5GB}}, which would 
cause a disk to be marked full when free space goes below 5GB, and 
{{disk-free-space-per-disk-high-watermark-mb = 10GB}} to keep the disk in the 
full state until free space goes above 10GB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9906) When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" setting is not valid

2019-10-16 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan resolved YARN-9906.
---
Resolution: Invalid

> When setting multi volumes throurh the "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" 
> setting is not  valid
> ---
>
> Key: YARN-9906
> URL: https://issues.apache.org/jira/browse/YARN-9906
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: lynn
>Priority: Major
> Attachments: docker_volume_mounts.patch
>
>
> As 
> [https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html#Application_Submission]
>  described, when I set the item "{{YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS" to 
> multi volumes mounts, the value is a comma-separated list of mounts.}}
>  
> {quote}vars="YARN_CONTAINER_RUNTIME_TYPE=docker,YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=hadoop-docker,
>  
> YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro;/etc/hadoop/conf:/etc/hadoop/conf"
>  hadoop jar hadoop-examples.jar pi -Dyarn.app.mapreduce.am.env=$vars \
>  -Dmapreduce.map.env=$vars -Dmapreduce.reduce.env=$vars 10 100{quote}
> I found the docker container can mount the first volume, so it can't be 
> running successfully without report error!
> The code of 
> [DockerLinuxContainerRuntime.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java]
>  as follows:
> {quote}if (environment.containsKey(ENV_DOCKER_CONTAINER_MOUNTS)) {
>   Matcher parsedMounts = USER_MOUNT_PATTERN.matcher(
>   environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   if (!parsedMounts.find()) {
> throw new ContainerExecutionException(
> "Unable to parse user supplied mount list: "
> + environment.get(ENV_DOCKER_CONTAINER_MOUNTS));
>   }{quote}
> The regex pattern is in 
> [OCIContainerRuntime|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/OCIContainerRuntime.java]
>  as follows
> {quote}static final Pattern USER_MOUNT_PATTERN = Pattern.compile(
>   "(?<=^|,)([^:\\x00]+):([^:\\x00]+)" +
>   "(:(r[ow]|(r[ow][+])?(r?shared|r?slave|r?private)))?(?:,|$)");{quote}
> it is seperated by comma indeed, but when i read the code of submit the jar 
> to yarn , i find the code 
> [Apps.java|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java]
> {quote}private static final Pattern VARVAL_SPLITTER = Pattern.compile(
> "(?<=^|,)"// preceded by ',' or line begin
>   + '(' + Shell.ENV_NAME_REGEX + ')'  // var group
>   + '='
>   + "([^,]*)" // val group
>   );
> {quote}
> It is sepearted by comma as the same.
> So, I just modify the comma to semicolon(";") of the item 
> "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9844) TestCapacitySchedulerPerf test errors in branch-2

2019-09-19 Thread Jim Brennan (Jira)
Jim Brennan created YARN-9844:
-

 Summary: TestCapacitySchedulerPerf test errors in branch-2
 Key: YARN-9844
 URL: https://issues.apache.org/jira/browse/YARN-9844
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test, yarn
Affects Versions: 2.10.0
Reporter: Jim Brennan


**These TestCapacitySchedulerPerf throughput tests are failing in branch-2:

{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFiveResources:263->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForFourResources:258->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}{{[ERROR]   
TestCapacitySchedulerPerf.testUserLimitThroughputForThreeResources:253->testUserLimitThroughputWithNumberOfResourceTypes:114
 » ArrayIndexOutOfBounds}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9527) Rogue LocalizerRunner/ContainerLocalizer repeatedly downloading same file

2019-05-02 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-9527:
-

 Summary: Rogue LocalizerRunner/ContainerLocalizer repeatedly 
downloading same file
 Key: YARN-9527
 URL: https://issues.apache.org/jira/browse/YARN-9527
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.1.2, 2.8.5
Reporter: Jim Brennan


A rogue ContainerLocalizer can get stuck in a loop continuously downloading the 
same file while generating an "Invalid event: LOCALIZED at LOCALIZED" exception 
on each iteration.  Sometimes this continues long enough that it fills up a 
disk or depletes available inodes for the filesystem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-9442) container working directory has group read permissions

2019-04-04 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-9442:
-

 Summary: container working directory has group read permissions
 Key: YARN-9442
 URL: https://issues.apache.org/jira/browse/YARN-9442
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 3.2.2
Reporter: Jim Brennan


Container working directories are currently created with permissions 0750, 
owned by the user and with the group set to the node manager group.

Is there any reason why these directories need group read permissions?

I have been testing with group read permissions removed and so far I haven't 
encountered any problems.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8656) container-executor should not write cgroup tasks files for docker containers

2018-08-13 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8656:
-

 Summary: container-executor should not write cgroup tasks files 
for docker containers
 Key: YARN-8656
 URL: https://issues.apache.org/jira/browse/YARN-8656
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan


If cgroups are enabled, we pass the {{--cgroup-parent}} option to {{docker 
run}} to ensure that all processes for the container are placed into a cgroup 
under (for example) {{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}. 
Docker creates a cgroup there with the docker container id as the name and all 
of the processes in the container go into that cgroup.

container-executor has code in {{launch_docker_container_as_user()}} that then 
cherry-picks the PID of the docker container (usually the launch shell) and 
writes that into the 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/tasks}} file, effectively 
moving it from 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id/docker_container_id}} to 
{{/sys/fs/cgroup/cpu/cgroups.hierarchy/container_id}}.  So you end up with one 
process out of the container in the {{container_id}} cgroup, and the rest in 
the {{container_id/docker_container_id}} cgroup.

Since we are passing the {{--cgroup-parent}} to docker, there is no need to 
manually write the container pid to the tasks file - we can just remove the 
code that does this in the docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8648) Container cgroups are leaked when using docker

2018-08-10 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8648:
-

 Summary: Container cgroups are leaked when using docker
 Key: YARN-8648
 URL: https://issues.apache.org/jira/browse/YARN-8648
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


When you run with docker and enable cgroups for cpu, docker creates cgroups for 
all resources on the system, not just for cpu.  For instance, if the 
{{yarn.nodemanager.linux-container-executor.cgroups.hierarchy=/hadoop-yarn}}, 
the nodemanager will create a cgroup for each container under 
{{/sys/fs/cgroup/cpu/hadoop-yarn}}.  In the docker case, we pass this path via 
the {{--cgroup-parent}} command line argument.   Docker then creates a cgroup 
for the docker container under that, for instance: 
{{/sys/fs/cgroup/cpu/hadoop-yarn/container_id/docker_container_id}}.

When the container exits, docker cleans up the {{docker_container_id}} cgroup, 
and the nodemanager cleans up the {{container_id}} cgroup,   All is good under 
{{/sys/fs/cgroup/hadoop-yarn}}.

The problem is that docker also creates that same hierarchy under every 
resource under {{/sys/fs/cgroup}}.  On the rhel7 system I am using, these are: 
blkio, cpuset, devices, freezer, hugetlb, memory, net_cls, net_prio, 
perf_event, and systemd.So for instance, docker creates 
{{/sys/fs/cgroup/cpuset/hadoop-yarn/container_id/docker_container_id}}, but it 
only cleans up the leaf cgroup {{docker_container_id}}.  Nobody cleans up the 
{{container_id}} cgroups for these other resources.  On one of our busy 
clusters, we found > 100,000 of these leaked cgroups.

I found this in our 2.8-based version of hadoop, but I have been able to repro 
with current hadoop.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8640) Restore previous state in container-executor if write_exit_code_file_as_nm fails

2018-08-09 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8640:
-

 Summary: Restore previous state in container-executor if 
write_exit_code_file_as_nm fails
 Key: YARN-8640
 URL: https://issues.apache.org/jira/browse/YARN-8640
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


The container-executor function {{write_exit_code_file_as_nm}} had a number of 
failure conditions where it just returns -1 without restoring previous state.
This is not a problem in any of the places where it is currently called, but it 
could be a problem if future code changes call it before code that depends on 
the previous state.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8518) test-container-executor test_is_empty() is broken

2018-07-11 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8518:
-

 Summary: test-container-executor test_is_empty() is broken
 Key: YARN-8518
 URL: https://issues.apache.org/jira/browse/YARN-8518
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan


A new test was recently added to test-container-executor.c that has some 
problems.

It is attempting to mkdir() a hard-coded path: /tmp/2938rf2983hcqnw8ud/emptydir

This fails because the base directory is not there.  These directories are not 
being cleaned up either.

It should be using TEST_ROOT.

I don't know what Jira this change was made under - the git commit from July 9 
2018 does not reference a Jira.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8515:
-

 Summary: container-executor can crash with SIGPIPE after 
nodemanager restart
 Key: YARN-8515
 URL: https://issues.apache.org/jira/browse/YARN-8515
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


When running with docker on large clusters, we have noticed that sometimes 
docker containers are not removed - they remain in the exited state, and the 
corresponding container-executor is no longer running.  Upon investigation, we 
noticed that this always seemed to happen after a nodemanager restart.   The 
sequence leading to the stranded docker containers is:
 # Nodemanager restarts
 # Containers are recovered and then run for a while
 # Containers are killed for some (legitimate) reason
 # Container-executor exits without removing the docker container.

After reproducing this on a test cluster, we found that the container-executor 
was exiting due to a SIGPIPE.

What is happening is that the shell command executor that is used to start 
container-executor has threads reading from c-e's stdout and stderr.  When the 
NM is restarted, these threads are killed.  Then when the container-executor 
continues executing after the container exits with error, it tries to write to 
stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this 
crashes the container-executor before it can actually remove the docker 
container.

We ran into this in branch 2.8.  The way docker containers are removed has been 
completely redesigned in trunk, so I don't think it will lead to this exact 
failure, but after an NM restart, potentially any write to stderr or stdout in 
the container-executor could cause it to crash.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

2018-06-20 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8444:
-

 Summary: NodeResourceMonitor crashes on bad swapFree value
 Key: YARN-8444
 URL: https://issues.apache.org/jira/browse/YARN-8444
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.2, 2.8.3
Reporter: Jim Brennan
Assignee: Jim Brennan


Saw this on a node that was having difficulty preempting containers. Can't have 
NodeResourceMonitor exiting. System was above 99% memory used at the time so it 
may only be something that happens when normal preemption isn't work right, but 
we should fix since this is a critical monitor to the health of the node.

 

{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 110564 for container-id 
container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory 
used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR 
yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] 
threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at 
org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at 
org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at 
org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at 
org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 
[org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO 
util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of 
approximately 9330ms
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8071) Provide Spark-like API for setting Environment Variables to enable vars with commas

2018-03-23 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8071:
-

 Summary: Provide Spark-like API for setting Environment Variables 
to enable vars with commas
 Key: YARN-8071
 URL: https://issues.apache.org/jira/browse/YARN-8071
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jim Brennan
Assignee: Jim Brennan


YARN-6830 describes a problem where environment variables that contain commas 
cannot be specified via {{-Dmapreduce.map.env}}.

For example:

{{-Dmapreduce.map.env="MODE=bar,IMAGE_NAME=foo,MOUNTS=/tmp/foo,/tmp/bar"}}

will set {{MOUNTS}} to {{/tmp/foo}}

In that Jira, [~aw] suggested that we change the API to provide a way to 
specify environment variables individually, the same way that Spark does.
{quote}Rather than fight with a regex why not redefine the API instead?

 

-Dmapreduce.map.env.MODE=bar
 -Dmapreduce.map.env.IMAGE_NAME=foo
 -Dmapreduce.map.env.MOUNTS=/tmp/foo,/tmp/bar

...

e.g, mapreduce.map.env.[foo]=bar gets turned into foo=bar

This greatly simplifies the input validation needed and makes it clear what is 
actually being defined.
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8029) YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use commas as separators

2018-03-14 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8029:
-

 Summary: YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS should not use 
commas as separators
 Key: YARN-8029
 URL: https://issues.apache.org/jira/browse/YARN-8029
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jim Brennan


The following docker-related environment variables specify a comma-separated 
list of mounts:

YARN_CONTAINER_RUNTIME_DOCKER_LOCAL_RESOURCE_MOUNTS
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS

This is a problem because hadoop -Dmapreduce.map.env and related options use  
comma as a delimiter.   So if I put more than one mount in 
YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS the comma in the variable will be treated 
as a delimiter for the hadoop command line option and all but the first mount 
will be ignored.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8027) Setting hostname of docker container breaks for --net=host in docker 1.13

2018-03-12 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-8027:
-

 Summary: Setting hostname of docker container breaks for 
--net=host in docker 1.13
 Key: YARN-8027
 URL: https://issues.apache.org/jira/browse/YARN-8027
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In DockerLinuxContainerRuntime:launchContainer, we are adding the --hostname 
argument to the docker run command to set the hostname in the container to 
something like:  ctr-e84-1520889172376-0001-01-01.

This does not work when combined with the --net=host command line option in 
Docker 1.13.1.  It causes multiple failures when the client tries to resolve 
the hostname and it fails.

We haven't seen this before because we were using docker 1.12.6 which seems to 
ignore --hostname when you are using --net=host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7857) -fstack-check compilation flag causes binary incompatibility for container-executor between RHEL 6 and RHEL 7

2018-01-30 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-7857:
-

 Summary: -fstack-check compilation flag causes binary 
incompatibility for container-executor between RHEL 6 and RHEL 7
 Key: YARN-7857
 URL: https://issues.apache.org/jira/browse/YARN-7857
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0
Reporter: Jim Brennan
Assignee: Jim Brennan


The segmentation fault in container-executor reported in [YARN-7796]  appears 
to be due to a binary compatibility issue with the {{-fstack-check}} flag that 
was added in [YARN-6721]

Based on my testing, a container-executor (without the patch from [YARN-7796]) 
compiled on RHEL 6 with the -fstack-check flag always hits this segmentation 
fault when run on RHEL 7.  But if you compile without this flag, the 
container-executor runs on RHEL 7 with no problems.  I also verified this with 
a simple program that just does the copy_file.

I think we need to either remove this flag, or find a suitable alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7678) Logging of container memory stats is missing in 2.8

2017-12-21 Thread Jim Brennan (JIRA)
Jim Brennan created YARN-7678:
-

 Summary: Logging of container memory stats is missing in 2.8
 Key: YARN-7678
 URL: https://issues.apache.org/jira/browse/YARN-7678
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 3.0.0, 2.8.0
Reporter: Jim Brennan
Assignee: Jim Brennan


YARN-3424 changed logging of memory stats from ContainersMonitorImpl to INFO to 
DEBUG.
We have found these log messages to be useful information in Out-of-Memory 
situations - they provide detail that helps show the memory profile of the 
container over time, which can be helpful in determining root cause.

Here's an example message from YARN-3424:
{noformat}
2015-03-27 09:32:48,905 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Memory usage of ProcessTree 9215 for container-id 
container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 
2.6 GB of 2.1 GB virtual memory used
{noformat}

Propose to change this to use a separate logger for this message, so that we 
can enable debug logging for this without enabling all of the other debug 
logging for ContainersMonitorImpl.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org