[jira] [Updated] (YARN-9507) Fix NPE if NM fails to init
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-9507: Priority: Minor (was: Major) > Fix NPE if NM fails to init > --- > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
[ https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825705#comment-16825705 ] Prabhu Joseph commented on YARN-6272: - [~giovanni.fumarola] Could you please review this jira when you get time. This fixes intermittent failure of {{TestAMRMClient#testAMRMClientWithContainerResourceChange}}. Thanks. > TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently > - > > Key: YARN-6272 > URL: https://issues.apache.org/jira/browse/YARN-6272 > Project: Hadoop YARN > Issue Type: Test > Components: yarn >Affects Versions: 3.0.0-alpha4 >Reporter: Ray Chiang >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6272-001.patch > > > I'm seeing this unit test fail fairly often in trunk: > testAMRMClientWithContainerResourceChange(org.apache.hadoop.yarn.client.api.impl.TestAMRMClient) > Time elapsed: 5.113 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:1087) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:963) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9440) Improve diagnostics for scheduler and app activities
[ https://issues.apache.org/jira/browse/YARN-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825704#comment-16825704 ] Tao Yang commented on YARN-9440: Thanks [~cheersyang] for the review. I think we should discuss the collect approach according to both two factors: performance and invasion, and we should find a more balanced way between performance and invasion. The purpose of current way( pass an collector option, which we can give a non-empty option only when we actually need to record the diagnostics for current allocation, into the final violation methods) is to lower overhead of the activities recording as much as possible. But this way will increase invasion and the cost is that basic class like ResourceCalculator and PlacementConstraintsUtils and the calling stacks should be affected. {quote} Right now a lot of changes are due to passing an instance in the method signature. Can we use a singleton instead? {quote} I think it's hard to control and pass the diagnostics from basic class to top-level calling process since it may be called by multiple threads belongs to different processes like allocation, commit and preemption. Another way is to pass diagnostics via return object, but it seems to be more invasive. {quote} However, precheck might fail with other reasons, not just PC violation. {quote} Yes, there may be other reasons like "partition doesn't match" and "request doesn't exist yet" which can't be tracked by PCDiagnosticsCollector, perhaps we should replace PCDiagnosticsCollector with a more generic collector to track common diagnostics including all these reasons. Thoughts? {quote} So we can put a detail error message in the exception. And use that for logging the activity too. {quote} I think exception is too expensive to used as a medium to pass the diagnostics, especially for the most frequently scheduling process. {quote} We should keep RC class as it is, without adding {{ResourceDiagnosticsCollector}} to any of method signatures. We can collect info outside of this class. Same comment applies to {{Resources}} too. {quote} As the concern about the performance, collect outside of these classes will raise requirements to calculate again if necessary(means this activities need to be recored) to get the diagnostics, is this acceptable? > Improve diagnostics for scheduler and app activities > > > Key: YARN-9440 > URL: https://issues.apache.org/jira/browse/YARN-9440 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9440.001.patch, YARN-9440.002.patch > > > [Design doc > #4.1|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.cyw6zeehzqmx] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail
[ https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu reassigned YARN-6325: Assignee: Yuanbo Liu > ParentQueue and LeafQueue with same name can cause queue name based > operations to fail > -- > > Key: YARN-6325 > URL: https://issues.apache.org/jira/browse/YARN-6325 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Jonathan Hung >Assignee: Yuanbo Liu >Priority: Major > Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, > capacity-scheduler.xml > > > For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} > and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in > that order). > Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and > call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: > java.io.IOException: Failed to re-init queues : mapping contains invalid or > non-leaf queue a > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653) > Caused by: java.io.IOException: Failed to re-init queues : mapping contains > invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386) > ... 10 more > Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400) > ... 12 more > {noformat} > Part of the issue is that the {{queues}} map in > {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do > one of a few things: > # Disallow ParentQueues and LeafQueues to have the same queue name. (this > breaks compatibility) > # Store queues by queue path instead of queue name. But this might require > changes in lots of places, e.g. in this case the queue-mappings would have to > map to a queue path instead of a queue name (which also breaks compatibility) > and possibly others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6325) ParentQueue and LeafQueue with same name can cause queue name based operations to fail
[ https://issues.apache.org/jira/browse/YARN-6325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825656#comment-16825656 ] Yuanbo Liu commented on YARN-6325: -- [~leftnoteasy] we have such kind of issue in our environment. I'd like to patch it. Any further comment will be welcome. > ParentQueue and LeafQueue with same name can cause queue name based > operations to fail > -- > > Key: YARN-6325 > URL: https://issues.apache.org/jira/browse/YARN-6325 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Jonathan Hung >Priority: Major > Attachments: Screen Shot 2017-03-13 at 2.28.30 PM.png, > capacity-scheduler.xml > > > For example, configure capacity scheduler with two leaf queues: {{root.a.a1}} > and {{root.b.a}}, with {{yarn.scheduler.capacity.root.queues}} as {{b,a}} (in > that order). > Then add a mapping e.g. {{u:username:a}} to {{capacity-scheduler.xml}} and > call {{refreshQueues}}. Operation fails with {noformat}refreshQueues: > java.io.IOException: Failed to re-init queues : mapping contains invalid or > non-leaf queue a > at > org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.logAndWrapException(AdminService.java:866) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:391) > at > org.apache.hadoop.yarn.server.api.impl.pb.service.ResourceManagerAdministrationProtocolPBServiceImpl.refreshQueues(ResourceManagerAdministrationProtocolPBServiceImpl.java:114) > at > org.apache.hadoop.yarn.proto.ResourceManagerAdministrationProtocol$ResourceManagerAdministrationProtocolService$2.callBlockingMethod(ResourceManagerAdministrationProtocol.java:271) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:522) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:867) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:813) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1857) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2653) > Caused by: java.io.IOException: Failed to re-init queues : mapping contains > invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:404) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:396) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:386) > ... 10 more > Caused by: java.io.IOException: mapping contains invalid or non-leaf queue a > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getUserGroupMappingPlacementRule(CapacityScheduler.java:547) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updatePlacementRules(CapacityScheduler.java:571) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:595) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:400) > ... 12 more > {noformat} > Part of the issue is that the {{queues}} map in > {{CapacitySchedulerQueueManager}} stores queues by queue name. We could do > one of a few things: > # Disallow ParentQueues and LeafQueues to have the same queue name. (this > breaks compatibility) > # Store queues by queue path instead of queue name. But this might require > changes in lots of places, e.g. in this case the queue-mappings would have to > map to a queue path instead of a queue name (which also breaks compatibility) > and possibly others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825581#comment-16825581 ] Hadoop QA commented on YARN-9486: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 24s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 70m 16s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966949/YARN-9486.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 9d1db0582e98 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a703dae | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24020/testReport/ | | Max. process+thread count | 446 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24020/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Docker container exited with failure does not get clean up
[jira] [Updated] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-9486: Attachment: YARN-9486.004.patch > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch, YARN-9486.004.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825534#comment-16825534 ] Eric Yang commented on YARN-9486: - I also tried: {code} boolean alreadyLaunched = launch.isLaunchCompleted(); {code} This prevent the container relaunch from happening. Completed flag is not set, if relaunch a container that is still running. As the result, we need to check both markedLaunched and isLaunchCompleted to get a better picture if the contained failed to launch, still running, or has not started at all. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9473) [Umbrella] Support Vector Engine ( a new accelerator hardware) based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9473: --- Component/s: nodemanager > [Umbrella] Support Vector Engine ( a new accelerator hardware) based on > pluggable device framework > -- > > Key: YARN-9473 > URL: https://issues.apache.org/jira/browse/YARN-9473 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Zhankun Tang >Assignee: Peter Bacsko >Priority: Major > > As the heterogeneous computation trend rises, new acceleration hardware like > GPU, FPGA is used to satisfy various requirements. > And a new hardware Vector Engine (VE) which released by NEC company is > another example. The VE is like GPU but has different characteristics. It's > suitable for machine learning and HPC due to better memory bandwidth and no > PCIe bottleneck. > Please Check here for more VE details: > [https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/] > [https://www.hotchips.org/hc30/2conf/2.14_NEC_vector_NEC_SXAurora_TSUBASA_HotChips30_finalb.pdf] > As we know, YARN-8851 is a pluggable device framework which provides an easy > way to develop a plugin for such new accelerators. This JIRA proposes to > develop a new VE plugin based on that framework and be implemented as current > GPU's "NvidiaGPUPluginForRuntimeV2" plugin. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
[ https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825509#comment-16825509 ] Hadoop QA commented on YARN-6272: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 26m 20s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-6272 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966939/YARN-6272-001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c04708fe4810 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a703dae | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24019/testReport/ | | Max. process+thread count | 693 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24019/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently >
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825503#comment-16825503 ] Eric Yang commented on YARN-9486: - {quote}The only option I can think of other than adding the isLaunchCompleted check in ContainerCleanup would be to call markLaunched() when you catch an exception in ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a comment to say we need to mark isLaunched in this case to ensure the original container is cleaned up.{quote} Tried this, and this approach creates another problem. If container relaunch failed, container is marked as launched. Reattempt on the failed container does not happen, and the container is reporting it is running. The decision of launching container is based on containerAlreadyLaunched flag. Therefore, manually changing the state of this flag can create undesired side effect. For clean up, maybe it is cleaner to base on isLaunchCompleted because it is always set even if container failed to launch. Thoughts? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code}
[jira] [Commented] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
[ https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825473#comment-16825473 ] Prabhu Joseph commented on YARN-6272: - There are two issues: 1. Test fails with below when both increase and decrease request processed in same heartbeat. {code} java.lang.AssertionError: expected:<1> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) {code} 2. Test fails when Updated Container is allocated on some other NM. The requests are added back and reallocated on subsequent node updates. {code} ContainerUpdateContext.java // Allocation happened on NM on the same host, but not on the NM // we need.. We need to signal that this container has to be released. // We also need to add these requests back.. to be reallocated. {code} Have added Wait and Retry with node update from the container's allocated node. The testcase fails consistently while running in a for loop. The fix works fine with allocateAttempts goes up to 50 in few runs. The issue was earlier fixed by YARN-5537 but then missed out during YARN-5221 feature merge. > TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently > - > > Key: YARN-6272 > URL: https://issues.apache.org/jira/browse/YARN-6272 > Project: Hadoop YARN > Issue Type: Test > Components: yarn >Affects Versions: 3.0.0-alpha4 >Reporter: Ray Chiang >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6272-001.patch > > > I'm seeing this unit test fail fairly often in trunk: > testAMRMClientWithContainerResourceChange(org.apache.hadoop.yarn.client.api.impl.TestAMRMClient) > Time elapsed: 5.113 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:1087) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:963) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-6272) TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently
[ https://issues.apache.org/jira/browse/YARN-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-6272: Attachment: YARN-6272-001.patch > TestAMRMClient#testAMRMClientWithContainerResourceChange fails intermittently > - > > Key: YARN-6272 > URL: https://issues.apache.org/jira/browse/YARN-6272 > Project: Hadoop YARN > Issue Type: Test > Components: yarn >Affects Versions: 3.0.0-alpha4 >Reporter: Ray Chiang >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6272-001.patch > > > I'm seeing this unit test fail fairly often in trunk: > testAMRMClientWithContainerResourceChange(org.apache.hadoop.yarn.client.api.impl.TestAMRMClient) > Time elapsed: 5.113 sec <<< FAILURE! > java.lang.AssertionError: expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.doContainerResourceChange(TestAMRMClient.java:1087) > at > org.apache.hadoop.yarn.client.api.impl.TestAMRMClient.testAMRMClientWithContainerResourceChange(TestAMRMClient.java:963) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825465#comment-16825465 ] Jim Brennan commented on YARN-9486: --- {quote}Patch 003 added the safe guard for missing pid file, and reverted the isLaunchCompleted logic. If IOException is thrown by disk health check, it will leave containers behind. Is that ok? I feel safer to check isLaunchCompleted flag to catch the corner cases, but I understand it may not be helpful in code readability. {quote} Yeah - really anything that throws before you actually call relaunchContainer() will put you in that state - the new call to getLocalPathForWrite() can throw IOException as well. I don't think it's ok to leave containers behind. The only option I can think of other than adding the isLaunchCompleted check in ContainerCleanup would be to call markLaunched() when you catch an exception in ContainerRelaunch.call(). That's a little unexpected, so you'd need to add a comment to say we need to mark isLaunched in this case to ensure the original container is cleaned up. My concern about the isLaunchCompleted check is that we always set that in the finally clause for ContainerLaunch.call(), so any failure before the launchContainer() call will now cause a cleanup where it didn't before (like if we fail on the areDisksHealthy() check like you mentioned for the relaunch case. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO >
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825460#comment-16825460 ] Hadoop QA commented on YARN-9486: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 18s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 3 new + 48 unchanged - 0 fixed = 51 total (was 48) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 6s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 42s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9486 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966930/YARN-9486.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 6f9c9c79b953 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a703dae | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/24018/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24018/testReport/ | | Max. process+thread count | 412 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825416#comment-16825416 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] Patch 003 added the safe guard for missing pid file, and reverted the isLaunchCompleted logic. If IOException is thrown by disk health check, it will leave containers behind. Is that ok? I feel safer to check isLaunchCompleted flag to catch the corner cases, but I understand it may not be helpful in code readability. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9507) Fix NPE if NM fails to init
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-9507: --- Summary: Fix NPE if NM fails to init (was: NPE when u stop NM if NM Init failed) > Fix NPE if NM fails to init > --- > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-9486: Attachment: YARN-9486.003.patch > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch, > YARN-9486.003.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825404#comment-16825404 ] Jim Brennan commented on YARN-9486: --- [~eyang] {quote} The right logic is probably try to locate it first, if it is not found, then create a new path. {quote} I agree. I think it we fix this, we won't need to change the cleanup logic. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825344#comment-16825344 ] Eric Yang edited comment on YARN-9486 at 4/24/19 5:11 PM: -- [~Jim_Brennan] getLocalPathForWrite will choose the first disk that can write, where getLocalPathForRead will locate the file, if it exists or throw IOException if the file does not exist. If it is changed to use getLocalPathForWrite, then we may end up with pid files on multiple disks. The right logic is probably try to locate it first, if it is not found, then create a new path. was (Author: eyang): [~Jim_Brennan] getLocalPathForWrite will choose the first disk that can write, where getLocalPathForRead will locate the file, if it exists or throw IOException if the file does not exist. If it is changed to use getLocalPathForWrite, then we may end up with pid files on multiple disks. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825344#comment-16825344 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] getLocalPathForWrite will choose the first disk that can write, where getLocalPathForRead will locate the file, if it exists or throw IOException if the file does not exist. If it is changed to use getLocalPathForWrite, then we may end up with pid files on multiple disks. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9509) Capped cpu usage with cgroup strict-resource-usage based on a mulitplier
[ https://issues.apache.org/jira/browse/YARN-9509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825338#comment-16825338 ] Hadoop QA commented on YARN-9509: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 3m 2s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 5s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 14s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 5s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 6 new + 220 unchanged - 0 fixed = 226 total (was 220) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 56s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 0m 49s{color} | {color:red} hadoop-yarn-api in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 10s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 38s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}101m 2s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | TEST-TestYarnConfigurationFields | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-766/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/766 | | JIRA Issue | YARN-9509 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux cf20ec8f1d32 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / e1c5ddf | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs |
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825333#comment-16825333 ] Jim Brennan commented on YARN-9486: --- [~eyang] I am not too familiar with the ContainerRelaunch path, but why is it using getLocalPathForRead() ? Doesn't it need to overwrite that file? ContainerLaunch is using: {noformat} String pidFileSubpath = getPidFileSubpath(appIdStr, containerIdStr); pidFilePath = dirsHandler.getLocalPathForWrite(pidFileSubpath); {noformat} > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825312#comment-16825312 ] Eric Yang commented on YARN-9486: - [~Jim_Brennan] This stacktrace tells the whole story: {code} 2019-04-23 22:34:08,919 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch: Failed to relaunch container. java.io.IOException: Could not find nmPrivate/application_1556058714621_0001/container_1556058714621_0001_01_02//container_1556058714621_0001_01_02.pid in any of the directories at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getPathToRead(LocalDirsHandlerService.java:597) at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getLocalPathForRead(LocalDirsHandlerService.java:612) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.getPidFilePath(ContainerRelaunch.java:200) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:90) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} It got a IOException because pidFilePath does not exist, and this causes relaunchContainer logic to skip prepareForLaunch and setContainerCompletedStatus to true. This means if pidFile does not exist, relaunch logic can not work. This is problematic for container that fail to start, and relaunch would not retry. It looks like we may want to put a empty pid file to allow pidPathFile finder to work, even if no pid file could be found in Docker. We may want to remove docker container in method cleanupContainerFiles in ContainerLaunch class. Otherwise, the existence of previous docker container will prevent the relaunch from happening as well. > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15
[jira] [Commented] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825295#comment-16825295 ] Hadoop QA commented on YARN-9504: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 27m 38s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 4s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 40m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9504 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966893/YARN-9504.002.patch | | Optional Tests | dupname asflicense shadedclient | | uname | Linux a67882fde946 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / e1c5ddf | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 412 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24017/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch, YARN-9504.002.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9486) Docker container exited with failure does not get clean up correctly
[ https://issues.apache.org/jira/browse/YARN-9486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825222#comment-16825222 ] Jim Brennan commented on YARN-9486: --- [~eyang] I'm not sure I agree. This suggests that containerAlreadyLaunched has not been set yet when we get here. It seems to me that the bug is in the relaunch case - shouldn't we be marking the container launched when we relaunch it? It looks like the ContainerLaunch.relaunchContainer() calls prepareForLaunch(), which should set it. Do you know why this is not happening in this case? > Docker container exited with failure does not get clean up correctly > > > Key: YARN-9486 > URL: https://issues.apache.org/jira/browse/YARN-9486 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-9486.001.patch, YARN-9486.002.patch > > > When docker container encounters error and exit prematurely > (EXITED_WITH_FAILURE), ContainerCleanup does not remove container, instead we > get messages that look like this: > {code} > java.io.IOException: Could not find > nmPrivate/application_1555111445937_0008/container_1555111445937_0008_01_07//container_1555111445937_0008_01_07.pid > in any of the directories > 2019-04-15 20:42:16,454 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > RELAUNCHING to EXITED_WITH_FAILURE > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Cleaning up container container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,455 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerCleanup: > Container container_1555111445937_0008_01_07 not launched. No cleanup > needed to be done > 2019-04-15 20:42:16,455 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hbase > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE > APPID=application_1555111445937_0008 > CONTAINERID=container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1555111445937_0008_01_07 transitioned from > EXITED_WITH_FAILURE to DONE > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: > Removing container_1555111445937_0008_01_07 from application > application_1555111445937_0008 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,458 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1555111445937_0008_01_07 for > log-aggregation > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting container-status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Getting localization status for container_1555111445937_0008_01_07 > 2019-04-15 20:42:16,804 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Returning ContainerStatus: [ContainerId: > container_1555111445937_0008_01_07, ExecutionType: GUARANTEED, State: > COMPLETE, Capability: , Diagnostics: ..., ExitStatus: > -1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] > 2019-04-15 20:42:18,464 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed containers from NM context: [container_1555111445937_0008_01_07] > 2019-04-15 20:43:50,476 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Stopping container with container Id: container_1555111445937_0008_01_07 > {code} > There is no docker rm command performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825158#comment-16825158 ] Zoltan Siegl commented on YARN-9504: [~shuzirra] Thank you for the review. Changes done, new patch on the way. > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch, YARN-9504.002.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl updated YARN-9504: --- Attachment: YARN-9504.002.patch > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch, YARN-9504.002.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825154#comment-16825154 ] Gergely Pollak commented on YARN-9504: -- [~zsiegl] thank you for the patch! I see you've added some data validation, which improve the stability greatly. On line 385 you check null == partitionMap, but it won't ensure you can access partitionMap[filter], or partitionMap[filter].absoluteUsedCapacity. So those calculations are not protected by the condition. Otherwise LGTM+1 (non-binding). > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9509) Capped cpu usage with cgroup strict-resource-usage based on a mulitplier
Nicolas Fraison created YARN-9509: - Summary: Capped cpu usage with cgroup strict-resource-usage based on a mulitplier Key: YARN-9509 URL: https://issues.apache.org/jira/browse/YARN-9509 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Nicolas Fraison Add a multiplier configuration on strict resource usage to authorize container to use spare cpu up to a limit. Currently with strict resource usage you can't get more than what you request which is sometime not good for jobs that doesn't have a constant usage of cpu (for ex. spark jobs with multiple stages). But without strict resource usage we have seen some bad behaviour from our users that don't tune at all their needs and it leads to some containers requesting 2 vcore but constantly using 20. The idea here is to still authorize containers to get more cpu than what they request if some are free but also to avoid too big differencies so SLA on jobs is not breached if the cluster is full (at least increase of runtime is contain) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9440) Improve diagnostics for scheduler and app activities
[ https://issues.apache.org/jira/browse/YARN-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825072#comment-16825072 ] Weiwei Yang commented on YARN-9440: --- Hi [~Tao Yang] Some more comments RegularContainerAllocator#preCheckForNodeCandidateSet it gets a PCDiagnosticsCollector and uses that to collect diaglosis info from #precheckNode. However, precheck might fail with other reasons, not just PC violation. I think it makes more sense to let precheckNode throw exception instead of returning a bool value. So we can put a detail error message in the exception. And use that for logging the activity too. This patch modifies \{{ResourceCalculator}} base class, let's not do that. We should keep RC class as it is, without adding {{ResourceDiagnosticsCollector}} to any of method signatures. We can collect info outside of this class. Same comment applies to \{{Resources}} too. Thanks > Improve diagnostics for scheduler and app activities > > > Key: YARN-9440 > URL: https://issues.apache.org/jira/browse/YARN-9440 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9440.001.patch, YARN-9440.002.patch > > > [Design doc > #4.1|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.cyw6zeehzqmx] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9476) Create unit tests for VE plugin
[ https://issues.apache.org/jira/browse/YARN-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825041#comment-16825041 ] Hadoop QA commented on YARN-9476: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 57s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 30s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 27s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 59s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9476 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966869/YARN-9476-003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ada94d66fd4e 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 64f30da | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24016/testReport/ | | Max. process+thread count | 468 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24016/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Create unit tests for VE plugin >
[jira] [Commented] (YARN-9490) applicationresourceusagereport return wrong number of reserved containers
[ https://issues.apache.org/jira/browse/YARN-9490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825010#comment-16825010 ] Tao Yang commented on YARN-9490: Thanks [~cheersyang] for the review. The comments make sense to me. [~zyb], could you please update the patch according to above comments? > applicationresourceusagereport return wrong number of reserved containers > - > > Key: YARN-9490 > URL: https://issues.apache.org/jira/browse/YARN-9490 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.3.0 >Reporter: yanbing zhang >Assignee: yanbing zhang >Priority: Minor > Fix For: 3.3.0 > > Attachments: YARN-9490.002.patch, YARN-9490.patch, > YARN-9490.patch1.patch > > > when getting an ApplicationResourceUsageReport instance from the class of > SchedulerApplicationAttempt, I found the input constructor > parameter(reservedContainers.size()) is wrong. because the type of this > variable is Map>, so > "reservedContainer.size()" is not the number of containers, but the number of > SchedulerRequestKey. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9507) NPE when u stop NM if NM Init failed
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825000#comment-16825000 ] Tao Yang commented on YARN-9507: Thanks [~BilwaST] for fixing this NPE. LGTM, +1 for the patch. > NPE when u stop NM if NM Init failed > > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9477) Implement VE discovery using libudev
[ https://issues.apache.org/jira/browse/YARN-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-9477: -- Assignee: Peter Bacsko > Implement VE discovery using libudev > > > Key: YARN-9477 > URL: https://issues.apache.org/jira/browse/YARN-9477 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > Right now we have a Python script which is able to discover VE cards using > pyudev: https://pyudev.readthedocs.io/en/latest/ > Java does not officially support libudev. There are some projects on Github > (example: https://github.com/Zubnix/udev-java-bindings) but they're not > available as Maven artifacts. > However it's not that difficult to create a minimal layer around libudev > using JNA. We don't have to wrap every function, we need to call 4-5 methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9473) [Umbrella] Support Vector Engine ( a new accelerator hardware) based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-9473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-9473: -- Assignee: Peter Bacsko > [Umbrella] Support Vector Engine ( a new accelerator hardware) based on > pluggable device framework > -- > > Key: YARN-9473 > URL: https://issues.apache.org/jira/browse/YARN-9473 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Zhankun Tang >Assignee: Peter Bacsko >Priority: Major > > As the heterogeneous computation trend rises, new acceleration hardware like > GPU, FPGA is used to satisfy various requirements. > And a new hardware Vector Engine (VE) which released by NEC company is > another example. The VE is like GPU but has different characteristics. It's > suitable for machine learning and HPC due to better memory bandwidth and no > PCIe bottleneck. > Please Check here for more VE details: > [https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/] > [https://www.hotchips.org/hc30/2conf/2.14_NEC_vector_NEC_SXAurora_TSUBASA_HotChips30_finalb.pdf] > As we know, YARN-8851 is a pluggable device framework which provides an easy > way to develop a plugin for such new accelerators. This JIRA proposes to > develop a new VE plugin based on that framework and be implemented as current > GPU's "NvidiaGPUPluginForRuntimeV2" plugin. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9477) Implement VE discovery using libudev
[ https://issues.apache.org/jira/browse/YARN-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9477: --- Summary: Implement VE discovery using libudev (was: Investigate device discovery mechanisms) > Implement VE discovery using libudev > > > Key: YARN-9477 > URL: https://issues.apache.org/jira/browse/YARN-9477 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Priority: Major > > Right now we have a Python script which is able to discover VE cards using > pyudev: https://pyudev.readthedocs.io/en/latest/ > Java does not officially support libudev. There are some projects on Github > (example: https://github.com/Zubnix/udev-java-bindings) but they're not > available as Maven artifacts. > However it's not that difficult to create a minimal layer around libudev > using JNA. We don't have to wrap every function, we need to call 4-5 methods. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9476) Create unit tests for VE plugin
[ https://issues.apache.org/jira/browse/YARN-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824986#comment-16824986 ] Peter Bacsko commented on YARN-9476: Thanks [~snemeth] updated the patch with your suggestions. > Create unit tests for VE plugin > --- > > Key: YARN-9476 > URL: https://issues.apache.org/jira/browse/YARN-9476 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9476-001.patch, YARN-9476-002.patch, > YARN-9476-003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9476) Create unit tests for VE plugin
[ https://issues.apache.org/jira/browse/YARN-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9476: --- Attachment: YARN-9476-003.patch > Create unit tests for VE plugin > --- > > Key: YARN-9476 > URL: https://issues.apache.org/jira/browse/YARN-9476 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9476-001.patch, YARN-9476-002.patch, > YARN-9476-003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9507) NPE when u stop NM if NM Init failed
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824970#comment-16824970 ] Hadoop QA commented on YARN-9507: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 35m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 48s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 7s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 92m 48s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9507 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966854/YARN-9507-001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d0c4a0503102 4.4.0-144-generic #170~14.04.1-Ubuntu SMP Mon Mar 18 15:02:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 64f30da | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24015/testReport/ | | Max. process+thread count | 306 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24015/console | | Powered by | Apache Yetus 0.8.0
[jira] [Commented] (YARN-9440) Improve diagnostics for scheduler and app activities
[ https://issues.apache.org/jira/browse/YARN-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824960#comment-16824960 ] Weiwei Yang commented on YARN-9440: --- Hi [~Tao Yang] I just did a high-level review about the patch. One thought is we probably can simplify the invocation of DiagnosticsCollector classes. Right now a lot of changes are due to passing an instance in the method signature. Can we use a singleton instead? Please take a look, thanks > Improve diagnostics for scheduler and app activities > > > Key: YARN-9440 > URL: https://issues.apache.org/jira/browse/YARN-9440 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9440.001.patch, YARN-9440.002.patch > > > [Design doc > #4.1|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.cyw6zeehzqmx] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9432) Reserved containers leak after its request has been cancelled or satisfied when multi-nodes enabled
[ https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824912#comment-16824912 ] Weiwei Yang commented on YARN-9432: --- Hi [~Tao Yang] Sorry for the late response. I just looked at the description and the patch. I have some doubts regarding the patch. Looking at the calling stack HB triggered scheduling * CS#NodeUpdate * CS#allocateContainersToNode * allocateContainerOnSingleNode * LeafQueue#assignContainers * LeafQueue#allocateFromReservedContainer Async * CS#allocateContainersOnMultiNodes * CS#allocateOrReserveNewContainers * LeafQueue#assignContainers * LeafQueue#allocateFromReservedContainer They call the same method to allocate resource for reserved containers. Why such leak happens in async mode only? Let me know if I miss anything Thank you. > Reserved containers leak after its request has been cancelled or satisfied > when multi-nodes enabled > --- > > Key: YARN-9432 > URL: https://issues.apache.org/jira/browse/YARN-9432 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9432.001.patch, YARN-9432.002.patch > > > Reserved containers may change to be excess after its request has been > cancelled or satisfied, excess reserved containers need to be unreserved > quickly to release resource for others. > For multi-nodes disabled scenario, excess reserved containers can be quickly > released in next node heartbeat, the calling stack is > CapacityScheduler#nodeUpdate --> CapacityScheduler#allocateContainersToNode > --> CapacityScheduler#allocateContainerOnSingleNode. > But for multi-nodes enabled scenario, excess reserved containers have chance > to be released only in allocation process, key phase of the calling stack is > LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. > According to this, excess reserved containers may not be released until its > queue has pending request and has chance to be allocated, and the worst is > that excess reserved containers will never be released and keep holding > resource if there is no additional pending request for this queue. > To solve this problem, my opinion is to directly kill excess reserved > containers when request is satisfied (in FiCaSchedulerApp#apply) or the > allocation number of resource-requests/scheduling-requests is updated to be 0 > (in SchedulerApplicationAttempt#updateResourceRequests / > SchedulerApplicationAttempt#updateSchedulingRequests). > Please feel free to give your suggestions. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824909#comment-16824909 ] Hadoop QA commented on YARN-9504: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 32m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 2s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 47m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9504 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12966851/YARN-9504.001.patch | | Optional Tests | dupname asflicense shadedclient | | uname | Linux b5e3b5db1bb0 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 64f30da | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 342 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/24014/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl updated YARN-9504: --- Attachment: (was: YARN-9504.001.patch) > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824903#comment-16824903 ] Zoltan Siegl commented on YARN-9504: I have reapplied the patch. [https://builds.apache.org/job/PreCommit-YARN-Build/24011/console] is +1 overall from jenkins, not sure why it is not showing up here. > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch, YARN-9504.001.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9508) YarnConfiguration areNodeLabel enabled is costly in allocation flow
[ https://issues.apache.org/jira/browse/YARN-9508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned YARN-9508: --- Assignee: Bilwa S T > YarnConfiguration areNodeLabel enabled is costly in allocation flow > --- > > Key: YARN-9508 > URL: https://issues.apache.org/jira/browse/YARN-9508 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Critical > > For every allocate request locking can be avoided. Improving performance > {noformat} > "pool-6-thread-300" #624 prio=5 os_prio=0 tid=0x7f2f91152800 nid=0x8ec5 > waiting for monitor entry [0x7f1ec6a8d000] > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) > - waiting to lock <0x7f1f8107c748> (a > org.apache.hadoop.yarn.conf.YarnConfiguration) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) > at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) > at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1674) > at > org.apache.hadoop.yarn.conf.YarnConfiguration.areNodeLabelsEnabled(YarnConfiguration.java:3646) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:274) > at > org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:261) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:242) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:427) > - locked <0x7f24dd3f9e40> (a > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9508) YarnConfiguration areNodeLabel enabled is costly in allocation flow
Bibin A Chundatt created YARN-9508: -- Summary: YarnConfiguration areNodeLabel enabled is costly in allocation flow Key: YARN-9508 URL: https://issues.apache.org/jira/browse/YARN-9508 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt For every allocate request locking can be avoided. Improving performance {noformat} "pool-6-thread-300" #624 prio=5 os_prio=0 tid=0x7f2f91152800 nid=0x8ec5 waiting for monitor entry [0x7f1ec6a8d000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2841) - waiting to lock <0x7f1f8107c748> (a org.apache.hadoop.yarn.conf.YarnConfiguration) at org.apache.hadoop.conf.Configuration.get(Configuration.java:1214) at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1268) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1674) at org.apache.hadoop.yarn.conf.YarnConfiguration.areNodeLabelsEnabled(YarnConfiguration.java:3646) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:234) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:274) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:261) at org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:242) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) at org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:427) - locked <0x7f24dd3f9e40> (a org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService$AllocateResponseLock) at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352) at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348) at org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212) at org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9507) NPE when u stop NM if NM Init failed
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-9507: Attachment: YARN-9507-001.patch > NPE when u stop NM if NM Init failed > > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-9507-001.patch > > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9504) [UI2] Fair scheduler queue view page is broken
[ https://issues.apache.org/jira/browse/YARN-9504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zoltan Siegl updated YARN-9504: --- Attachment: YARN-9504.001.patch > [UI2] Fair scheduler queue view page is broken > -- > > Key: YARN-9504 > URL: https://issues.apache.org/jira/browse/YARN-9504 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, yarn-ui-v2 >Affects Versions: 3.2.0, 3.3.0, 3.2.1 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: Screenshot 2019-04-23 at 14.52.57.png, Screenshot > 2019-04-23 at 14.59.35.png, YARN-9504.001.patch, YARN-9504.001.patch > > Original Estimate: 24h > Remaining Estimate: 24h > > UI2 queue page currently displays white screen for Fair Scheduler. > > In src/main/webapp/app/components/tree-selector.js:377 (getUsedCapacity) code > refers to > queueData.get("partitionMap") which is null for fair scheduler queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9507) NPE when u stop NM if NM Init failed
[ https://issues.apache.org/jira/browse/YARN-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T updated YARN-9507: Summary: NPE when u stop NM if NM Init failed (was: NPE when u stop NM if context is null) > NPE when u stop NM if NM Init failed > > > Key: YARN-9507 > URL: https://issues.apache.org/jira/browse/YARN-9507 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Major > > 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When > stopping the service NodeManager > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) > at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) > at > org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) > at > org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9507) NPE when u stop NM if context is null
Bilwa S T created YARN-9507: --- Summary: NPE when u stop NM if context is null Key: YARN-9507 URL: https://issues.apache.org/jira/browse/YARN-9507 Project: Hadoop YARN Issue Type: Bug Reporter: Bilwa S T Assignee: Bilwa S T 2019-04-24 14:06:44,101 WARN org.apache.hadoop.service.AbstractService: When stopping the service NodeManager java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStop(NodeManager.java:492) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:220) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:54) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:102) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:947) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1018) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org