[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602466#comment-14602466 ] Varun Saxena commented on YARN-3850: There seems to be some issue with whitespace check. The line it shows in result doesnt have any whitespace. The one below has, but that hasnt been added by me. NM fails to read files from full disks which can lead to container logs being lost and other issues --- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch, YARN-3850.02.patch *Container logs* can be lost if disk has become full(~90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3508) Preemption processing occuring on the main RM dispatcher
[ https://issues.apache.org/jira/browse/YARN-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602474#comment-14602474 ] Varun Saxena commented on YARN-3508: [~leftnoteasy], ok if thats the consensus, I will do so. Preemption processing occuring on the main RM dispatcher Key: YARN-3508 URL: https://issues.apache.org/jira/browse/YARN-3508 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Varun Saxena Attachments: YARN-3508.002.patch, YARN-3508.01.patch We recently saw the RM for a large cluster lag far behind on the AsyncDispacher event queue. The AsyncDispatcher thread was consistently blocked on the highly-contended CapacityScheduler lock trying to dispatch preemption-related events for RMContainerPreemptEventDispatcher. Preemption processing should occur on the scheduler event dispatcher thread or a separate thread to avoid delaying the processing of other events in the primary dispatcher queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602483#comment-14602483 ] Tsuyoshi Ozawa commented on YARN-3798: -- Sure. After fixing this, I'd like to release 2.7.2 soon. ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED --- Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-2.7.002.patch, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602487#comment-14602487 ] Tsuyoshi Ozawa commented on YARN-3798: -- [~zxu] thank for your explanation. Based on your log, when ZKRMStateStore meets SessionMovedException, I think we should close the session and fail over to another RM as a workaround since we cannot recover from the exception. If we close and open new session without fencing, same issue as Bibin reported will come up. I'll create a patch to going standby mode when ZKRMSTateStore meets SessionMovedException. Please let me know if I have something missing. ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED --- Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-2.7.002.patch, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error updating appAttempt: appattempt_1433764310492_7152_01 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at
[jira] [Created] (YARN-3856) YARN shoud allocate container that is closest to the data
jaehoon ko created YARN-3856: Summary: YARN shoud allocate container that is closest to the data Key: YARN-3856 URL: https://issues.apache.org/jira/browse/YARN-3856 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.7.0 Environment: Hadoop cluster with multi-level network hierarchy Reporter: jaehoon ko Currently, given a Container request for a host, ResourceManager allocates a Container with following priorities (RMContainerAllocator.java): - Requested host - a host in the same rack as the requested host - any host This can lead to a sub-optimal allocation if Hadoop cluster is deployed on multi-level networked hosts (which is typical). For example, let's suppose a network architecture with one core switches, two aggregate switches, four ToR switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are as follows: h1, h2: /c/a1/t1 h3, h4: /c/a1/t2 h5, h6: /c/a2/t3 h7, h8: /c/a2/t4 To allocate a container for data in h1, Hadoop first tries h1 itself, then h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of network distance and bandwidth. However, current implementation choose one from h3~h8 with equal probabilities. This limitation is more obvious when considering hadoop clusters deployed on VM or containers. In this case, only the VMs or containers running in the same physical host are considered rack local, and actual rack-local hosts are chosen with same probabilities as far hosts. The root cause of this limitation is that RMContainerAllocator.java performs exact matching on rack id to find a rack local host. Alternatively, we can perform longest-prefix matching to find a closest host. Using the same network architecture as above, with longest-prefix matching, hosts are selected with the following priorities: h1 h2 h3 or h4 h5 or h6 or h7 or h8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2369) Environment variable handling assumes values should be appended
[ https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602500#comment-14602500 ] Hadoop QA commented on YARN-2369: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 12s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 33s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 48s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 14s | The applied patch generated 1 new checkstyle issues (total was 176, now 173). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 5m 56s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 22m 5s | Tests passed in hadoop-common. | | {color:green}+1{color} | mapreduce tests | 0m 46s | Tests passed in hadoop-mapreduce-client-common. | | {color:green}+1{color} | mapreduce tests | 1m 42s | Tests passed in hadoop-mapreduce-client-core. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | | | 75m 45s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742041/YARN-2369-6.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8ef07f7 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/diffcheckstylehadoop-common.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-mapreduce-client-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-mapreduce-client-common.txt | | hadoop-mapreduce-client-core test log | https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-mapreduce-client-core.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8355/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8355/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8355/console | This message was automatically generated. Environment variable handling assumes values should be appended --- Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Dustin Cote Attachments: YARN-2369-1.patch, YARN-2369-2.patch, YARN-2369-3.patch, YARN-2369-4.patch, YARN-2369-5.patch, YARN-2369-6.patch When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3856) YARN shoud allocate container that is closest to the data
[ https://issues.apache.org/jira/browse/YARN-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jaehoon ko updated YARN-3856: - Attachment: YARN-3856.001.patch This patch changes RMContainerAllocator's behaviour so that longest prefix matching on rack id is performed to find a rack local host YARN shoud allocate container that is closest to the data - Key: YARN-3856 URL: https://issues.apache.org/jira/browse/YARN-3856 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.7.0 Environment: Hadoop cluster with multi-level network hierarchy Reporter: jaehoon ko Attachments: YARN-3856.001.patch Currently, given a Container request for a host, ResourceManager allocates a Container with following priorities (RMContainerAllocator.java): - Requested host - a host in the same rack as the requested host - any host This can lead to a sub-optimal allocation if Hadoop cluster is deployed on multi-level networked hosts (which is typical). For example, let's suppose a network architecture with one core switches, two aggregate switches, four ToR switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are as follows: h1, h2: /c/a1/t1 h3, h4: /c/a1/t2 h5, h6: /c/a2/t3 h7, h8: /c/a2/t4 To allocate a container for data in h1, Hadoop first tries h1 itself, then h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of network distance and bandwidth. However, current implementation choose one from h3~h8 with equal probabilities. This limitation is more obvious when considering hadoop clusters deployed on VM or containers. In this case, only the VMs or containers running in the same physical host are considered rack local, and actual rack-local hosts are chosen with same probabilities as far hosts. The root cause of this limitation is that RMContainerAllocator.java performs exact matching on rack id to find a rack local host. Alternatively, we can perform longest-prefix matching to find a closest host. Using the same network architecture as above, with longest-prefix matching, hosts are selected with the following priorities: h1 h2 h3 or h4 h5 or h6 or h7 or h8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3855) If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup
[ https://issues.apache.org/jira/browse/YARN-3855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602512#comment-14602512 ] Hadoop QA commented on YARN-3855: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 35s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 26s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 20s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 10s | The applied patch generated 5 new checkstyle issues (total was 53, now 51). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 11s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 5m 44s | Tests passed in hadoop-mapreduce-client-hs. | | {color:green}+1{color} | yarn tests | 50m 34s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 95m 41s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742043/YARN-3855.2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8ef07f7 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/diffcheckstylehadoop-yarn-server-resourcemanager.txt | | hadoop-mapreduce-client-hs test log | https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/testrun_hadoop-mapreduce-client-hs.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8354/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8354/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8354/console | This message was automatically generated. If acl is enabled and http.authentication.type is simple, user cannot view the app page in default setup Key: YARN-3855 URL: https://issues.apache.org/jira/browse/YARN-3855 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-3855.1.patch, YARN-3855.2.patch If all ACLs (admin acl, queue-admin-acls etc.) are setup properly and http.authentication.type is 'simple' in secure mode , user cannot view the application web page in default setup because the incoming user is always considered as dr.who . User also cannot pass user.name to indicate the incoming user name, because AuthenticationFilterInitializer is not enabled by default. This is inconvenient from user's perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3856) YARN shoud allocate container that is closest to the data
[ https://issues.apache.org/jira/browse/YARN-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602550#comment-14602550 ] Hadoop QA commented on YARN-3856: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 16m 54s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 35s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 36s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 16s | The applied patch generated 19 new checkstyle issues (total was 0, now 19). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:red}-1{color} | findbugs | 2m 41s | The patch appears to introduce 1 new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | mapreduce tests | 9m 3s | Tests passed in hadoop-mapreduce-client-app. | | {color:green}+1{color} | yarn tests | 1m 56s | Tests passed in hadoop-yarn-common. | | | | 51m 33s | | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-common | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742067/YARN-3856.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8ef07f7 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/trunkFindbugsWarningshadoop-mapreduce-client-app.html | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt | | Findbugs warnings | https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html | | hadoop-mapreduce-client-app test log | https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/testrun_hadoop-mapreduce-client-app.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8356/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8356/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8356/console | This message was automatically generated. YARN shoud allocate container that is closest to the data - Key: YARN-3856 URL: https://issues.apache.org/jira/browse/YARN-3856 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.7.0 Environment: Hadoop cluster with multi-level network hierarchy Reporter: jaehoon ko Attachments: YARN-3856.001.patch Currently, given a Container request for a host, ResourceManager allocates a Container with following priorities (RMContainerAllocator.java): - Requested host - a host in the same rack as the requested host - any host This can lead to a sub-optimal allocation if Hadoop cluster is deployed on multi-level networked hosts (which is typical). For example, let's suppose a network architecture with one core switches, two aggregate switches, four ToR switches, and 8 hosts. Each switch has two downlinks. Rack IDs of hosts are as follows: h1, h2: /c/a1/t1 h3, h4: /c/a1/t2 h5, h6: /c/a2/t3 h7, h8: /c/a2/t4 To allocate a container for data in h1, Hadoop first tries h1 itself, then h2, then any of h3 ~ h8. Clearly, h3 or h4 are better than h5~h8 in terms of network distance and bandwidth. However, current implementation choose one from h3~h8 with equal probabilities. This limitation is more obvious when considering hadoop clusters deployed on VM or containers. In this case, only the VMs or containers running in the same physical host are considered rack local, and actual rack-local hosts are chosen with same probabilities as far hosts. The root cause of this limitation is that RMContainerAllocator.java performs exact matching on rack id to find a rack local host.
[jira] [Commented] (YARN-3798) ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED
[ https://issues.apache.org/jira/browse/YARN-3798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602546#comment-14602546 ] zhihai xu commented on YARN-3798: - [~ozawa], thanks for the information. For SessionMovedException, most likely we can workaround it by increasing the Session Timeout. For example if we increase the session timeout from 10 seconds to 30 seconds, the timeout for connection will be increased to 10 seconds from 3.3 seconds, which is calculated by {{connectTimeout = negotiatedSessionTimeout / hostProvider.size();}}. The above SessionMovedException can't happen because the Leader processed the request from client after 5 seconds which is less than 10 seconds time out. One question: For SessionExpiredException, we will close and open new session without fencing, Why the issue Bibin reported won't come up for SessionExpiredException? ZKRMStateStore shouldn't create new session without occurrance of SESSIONEXPIED --- Key: YARN-3798 URL: https://issues.apache.org/jira/browse/YARN-3798 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Environment: Suse 11 Sp3 Reporter: Bibin A Chundatt Assignee: Varun Saxena Priority: Blocker Attachments: RM.log, YARN-3798-2.7.002.patch, YARN-3798-branch-2.7.002.patch, YARN-3798-branch-2.7.patch RM going down with NoNode exception during create of znode for appattempt *Please find the exception logs* {code} 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session connected 2015-06-09 10:09:44,732 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZKRMStateStore Session restored 2015-06-09 10:09:44,886 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1405) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:1310) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:926) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1101) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1122) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:923) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:937) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:970) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:671) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:275) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppAttemptTransition.transition(RMStateStore.java:260) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:900) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:895) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108) at java.lang.Thread.run(Thread.java:745) 2015-06-09 10:09:44,887 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! 2015-06-09 10:09:44,887 ERROR
[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603086#comment-14603086 ] Hudson commented on YARN-3850: -- FAILURE: Integrated in Hadoop-trunk-Commit #8072 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8072/]) YARN-3850. NM fails to read files from full disks which can lead to container logs being lost and other issues. Contributed by Varun Saxena (jlowe: rev 40b256949ad6f6e0dbdd248f2d257b05899f4332) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/TestLogAggregationService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/logaggregation/AppLogAggregatorImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/webapp/TestContainerLogsPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/RecoveredContainerLaunch.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/webapp/ContainerLogsUtils.java NM fails to read files from full disks which can lead to container logs being lost and other issues --- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3850.01.patch, YARN-3850.02.patch *Container logs* can be lost if disk has become full(~90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603068#comment-14603068 ] Jason Lowe commented on YARN-3850: -- +1 lgtm. Committing this. NM fails to read files from full disks which can lead to container logs being lost and other issues --- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Attachments: YARN-3850.01.patch, YARN-3850.02.patch *Container logs* can be lost if disk has become full(~90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603064#comment-14603064 ] Hadoop QA commented on YARN-3644: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 30s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 34s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 44s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 18s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 6m 16s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 53m 37s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742125/YARN-3644.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8ef07f7 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8357/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8357/console | This message was automatically generated. Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan Assignee: Raju Bairishetti Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602942#comment-14602942 ] Lei Guo commented on YARN-3409: --- [~xinxianyin], topology related information could be another type of server attribute, if we look at Yarn-3856, the topology could be more complicate than rack. Node label may not a great option when we facing thousands node environment. And for YARN-1042, the complexity is more on relationship mapping among containers, and how Yarn to know the way AM use container, especially when we talk about affinity. Node label may not help in that area. Add constraint node labels -- Key: YARN-3409 URL: https://issues.apache.org/jira/browse/YARN-3409 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, client Reporter: Wangda Tan Assignee: Wangda Tan Specify only one label for each node (IAW, partition a cluster) is a way to determinate how resources of a special set of nodes could be shared by a group of entities (like teams, departments, etc.). Partitions of a cluster has following characteristics: - Cluster divided to several disjoint sub clusters. - ACL/priority can apply on partition (Only market team / marke team has priority to use the partition). - Percentage of capacities can apply on partition (Market team has 40% minimum capacity and Dev team has 60% of minimum capacity of the partition). Constraints are orthogonal to partition, they’re describing attributes of node’s hardware/software just for affinity. Some example of constraints: - glibc version - JDK version - Type of CPU (x86_64/i686) - Type of OS (windows, linux, etc.) With this, application can be able to ask for resource has (glibc.version = 2.20 JDK.version = 8u20 x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3644: --- Attachment: YARN-3644.003.patch Fixed test case with the newly added changes in the trunk. Override the unRegisterNodeManager(request) method in MyResourceTracker8 class. Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan Assignee: Raju Bairishetti Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603005#comment-14603005 ] Sunil G commented on YARN-3849: --- Thank you [~leftnoteasy] and [~ka...@cloudera.com] [~kasha] , we have tested this only in CS. And the issue is looking like in DominentResourceCalculator. I will analyze whether this will happen in Fair. [~leftnoteasy], I have understood your point. I can explain you the scenario based on few key code snippets. Please feel free to point out if any issues in my analysis. CSQueueUtils#updateUsedCapacity has below code to calculate absoluteUsedCapacity. {code} absoluteUsedCapacity = Resources.divide(rc, totalPartitionResource, usedResource, totalPartitionResource); {code} This will result a call to DominentResourceCalculator, {code} public float divide(Resource clusterResource, Resource numerator, Resource denominator) { return getResourceAsValue(clusterResource, numerator, true) / getResourceAsValue(clusterResource, denominator, true); {code} In our cluster, the resource allocation is as follows usedResource 10Gb, 95Cores totalPartitionResource 100Gb, 100Cores. Since we use dominence, absoluteUsedCapacity will come close to 1 eventhough Memory is used only 10%. IN ProportionalPreemptionPolicy, we use like below {code} float absUsed = qc.getAbsoluteUsedCapacity(partitionToLookAt); Resource current = Resources.multiply(partitionResource, absUsed); {code} So *current - guaranteed* will give us tobePreempted which will be close to 50GB, 45Cores. Actually here memory should have been 5Gb. Now in our cluster, each container is of 1Gb, 10Cores. Hence the *cores* will be 0 after 5 container kills, but tobePreempted will still have memory. And as mentioned in above comment, preemption will continue to kill other containers too. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603373#comment-14603373 ] Wangda Tan commented on YARN-3849: -- Make sense, please try to run the test with/without the change. And if you have time, could you add the test for node partition preemption as well? Thanks, Wangda Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603389#comment-14603389 ] Sunil G commented on YARN-3849: --- Yes [~leftnoteasy] and [~rohithsharma]. Thank you for the updates. It seems we cannot give CPU to the tests as of now. We can update that by changing buildPolicy. Meantime once this is handled, I will add case for nodepartition too. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603349#comment-14603349 ] Wangda Tan commented on YARN-3849: -- I understand now, this is a bad issue when DRF enabled. Thanks for explanation from [~sunilg] and [~rohithsharma]. Let me take a look at how to solve this issue. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603381#comment-14603381 ] Wangda Tan commented on YARN-3849: -- Good suggestion [~rohithsharma], but the more urgent issue we need to solve now is currently we cannot specify CPU to tests. I think we can file a separated ticket for the parameterized test class. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603382#comment-14603382 ] Rohith Sharma K S commented on YARN-3849: - I mean for TestProportionalPreemptinPolicy. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603308#comment-14603308 ] Rohith Sharma K S commented on YARN-3849: - The below is the log trace for the issue. In our cluster, there are 3 NodeManager and each with resource {{memory:327680, vCores:35}}. Total cluster resource is {{clusterResource: memory:983040, vCores:105}} with CapacityScheduler configured queue's with name *default* and *QueueA*. # Application app-1 is submitted to queue default and containers are started running the applications with 10 containers,each with {{resource: memory:1024, vCores:10}}. so total used is {{usedResources=memory:10240, vCores:91}} {noformat} default user=spark used=memory:10240, vCores:91 numContainers=10 headroom = memory:1024, vCores:10 user-resources=memory:10240, vCores:91 Re-sorting assigned queue: root.default stats: default: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:10240, vCores:91, usedCapacity=1.733, absoluteUsedCapacity=0.867, numApps=1, numContainers=10 {noformat} *NOTE : Resource allocation is by CPU DOMINANT* After 10 container running, available NodeManagers memory is {noformat} linux-174, available: memory:323584, vCores:4 linux-175, available: memory:324608, vCores:5 linux-223, available: memory:324608, vCores:5 {noformat} # Application app-2 is submitted to QueueA. ApplicationMaster container started running and NodeManager memory is {{available: memory:322560, vCores:3}} {noformat} Assigned container container_1435072598099_0002_01_01 of capacity memory:1024, vCores:1 on host linux-174:26009, which has 5 containers, memory:5120, vCores:32 used and memory:322560, vCores:3 available after allocation | SchedulerNode.java:154 linux-174, available: memory:322560, vCores:3 {noformat} # the preemption policy does the below calculation {noformat} 2015-06-23 23:20:51,127 NAME: QueueA CUR: memory:0, vCores:0 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:0, vCores:0 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:0, vCores:0 2015-06-23 23:20:51,128 NAME: default CUR: memory:851968, vCores:91 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: 1.0 IDEAL_ASSIGNED: memory:851968, vCores:91 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:360448, vCores:39 {noformat} In the above log , observe for the queue default *CUR is memory:851968, vCores:91*, but actually *usedResources=memory:10240, vCores:91*. Here, only CPU is matching but not MEMORY. The CUR calculation is done below formula #* CUR= {{clusterResource: memory:983040, vCores:105}} * {{absoluteUsedCapacity(0.8)}} = {{memory:851968, vCores:91}} #* GAR= {{clusterResource: memory:983040, vCores:105}} * {{absoluteCapacity(0.5)}} = {{ memory:491520, vCores:52}} #* PREEMPTABLE= GAR - CUR = {{memory:360448, vCores:39}} # App-2 request for the containers with {{resource: memory:1024, vCores:10}}. So, the preemption cycle finds that how much memory toBePreempt {noformat} 2015-06-23 23:21:03,131 | DEBUG | SchedulingMonitor (ProportionalCapacityPreemptionPolicy) | 1435072863131: NAME: default CUR: memory:851968, vCores:91 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:491520, vCores:52 IDEAL_PREEMPT: memory:97043, vCores:10 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:360448, vCores:39 {noformat} Observe that *IDEAL_PREEMPT: memory:97043, vCores:10*, but app-2 in queue QueueA required only 10 CPU resource to be preempt, but memory to be preempt is 97043 but memory sufficiently available. Below is the calculations which does IDEAL_PREMPT, #* totalPreemptionAllowed = clusterResource: memory:983040, vCores:105 * 0.1 = memory:98304, vCores:10.5 #* totPreemptionNeeded = CUR - IDEAL_ASSIGNED = CUR: memory:851968, vCores:91 #* scalingFactor = Resources.divide(drc, memory:491520, vCores:52, memory:98304, vCores:10.5, memory:851968, vCores:91); scalingFactor = 0.114285715 #* toBePreempted = CUR: memory:851968, vCores:91 * scalingFactor(0.1139045128455529) = memory:97368, vCores:10 {{resource-to-obtain = memory:97043, vCores:10}} *So the problem is in either of the below steps* # As [~sunilg] said, usedResources=memory:10240, vCores:91 but preemption policy calculate wrongly that current used capacity as {{memory:851968, vCores:91}}. This is mainly becaue preemption policy is using absoluteCapacity for calculating for Current usage which always gives wrong result for one of the resources in DominantResourceAllocator used. I think, fraction should not be used which caused problem in DRC(Multi dimentional resources) instead we should be usedResource from CSQueue. # Even bypassing
[jira] [Updated] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2005: Attachment: YARN-2005.002.patch Addressed the test failure. Unmanaged AM also executes the AMLaunched Transition which was causing the the allocate call that removes AM blacklist. Changed it so it does not execute for unmanaged AM. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603357#comment-14603357 ] Wangda Tan commented on YARN-3849: -- I think the correct fix should be: Instead of using absUsed to compute current, we should using getQueueResourceUsage.getUsed(...) to get the current used. And add some tests, that should be enough. {code} QueueCapacities qc = curQueue.getQueueCapacities(); float absUsed = qc.getAbsoluteUsedCapacity(partitionToLookAt); float absCap = qc.getAbsoluteCapacity(partitionToLookAt); float absMaxCap = qc.getAbsoluteMaximumCapacity(partitionToLookAt); boolean preemptionDisabled = curQueue.getPreemptionDisabled(); Resource current = Resources.multiply(partitionResource, absUsed); Resource guaranteed = Resources.multiply(partitionResource, absCap); Resource maxCapacity = Resources.multiply(partitionResource, absMaxCap); {code} [~sunilg], do you want to take a shot about this? Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603279#comment-14603279 ] Wangda Tan commented on YARN-3409: -- Thanks for comments [~xinxianyin], [~grey]. Actually you're not the first person want to make node label as a uniform solution to the problem of locality / affinity / blacklist, etc. [~curino], [~vinodkv] all suggested about this. Personally I think this is a good direction, otherwise we will have separated implementation / API for all of them, which is not clear enough. We're also looking at possibilities to put them together to the design doc, hopefully will not take too much time. Add constraint node labels -- Key: YARN-3409 URL: https://issues.apache.org/jira/browse/YARN-3409 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, client Reporter: Wangda Tan Assignee: Wangda Tan Specify only one label for each node (IAW, partition a cluster) is a way to determinate how resources of a special set of nodes could be shared by a group of entities (like teams, departments, etc.). Partitions of a cluster has following characteristics: - Cluster divided to several disjoint sub clusters. - ACL/priority can apply on partition (Only market team / marke team has priority to use the partition). - Percentage of capacities can apply on partition (Market team has 40% minimum capacity and Dev team has 60% of minimum capacity of the partition). Constraints are orthogonal to partition, they’re describing attributes of node’s hardware/software just for affinity. Some example of constraints: - glibc version - JDK version - Type of CPU (x86_64/i686) - Type of OS (windows, linux, etc.) With this, application can be able to ask for resource has (glibc.version = 2.20 JDK.version = 8u20 x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603362#comment-14603362 ] Sunil G commented on YARN-3849: --- Thank you [~leftnoteasy] for the pointer. Yes. It looks to me like the root cause of the issue is the usage of absoluteCapacity fraction in proportionalpreemption policy. And we could try directly use the real usage there as you mentioned. I will add some tests and give one. :) Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603363#comment-14603363 ] Raju Bairishetti commented on YARN-3644: Seems checkstyle error was not introduced as part of this patch. File had already more than 2000 lines :) . *Check style error:* YarnConfiguration.java:1: File length is 2,036 lines (max allowed is 2,000). Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan Assignee: Raju Bairishetti Attachments: YARN-3644.001.patch, YARN-3644.001.patch, YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603375#comment-14603375 ] Rohith Sharma K S commented on YARN-3849: - For the test,how it would be using parameterized test class which uses defaultRC and dominatRC ? Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603396#comment-14603396 ] Wangda Tan commented on YARN-3849: -- Make sense [~sunilg]. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603303#comment-14603303 ] zhihai xu commented on YARN-3857: - Hi [~mujunchao], thanks for reporting and working on this issue. It is a nice catch. I see why this is a critical issue. For non-secure cluster, the more completed jobs, the more entries with null value will be left in {{ClientToAMTokenSecretManagerInRM#masterKeys}}. You patch makes sense to me, since we only call {{unRegisterApplication}} in secure mode, we should also call {{registerApplication}} in secure mode to match {{unRegisterApplication}}. Could you add a test case in your patch? You can do something similar as {{TestRMAppAttemptTransitions#testGetClientToken}} for non-secure mode. Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Priority: Critical Attachments: hadoop-yarn-server-resourcemanager.patch We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603309#comment-14603309 ] zhihai xu commented on YARN-2871: - Hi [~xgong], the latest patch passed Jenkins test, Could you review it? thanks TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3695: --- Attachment: YARN-3695.patch ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. -- Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Raju Bairishetti Attachments: YARN-3695.patch YARN-3646 fix the retry forever policy in RMProxy that it only applies on limited exceptions rather than all exceptions. Here, we may need the same fix for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart
[ https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603614#comment-14603614 ] Alok Lal commented on YARN-3858: As can be seen from the log the distributed shell app did not finish even thought all containers had finished successfully. Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart - Key: YARN-3858 URL: https://issues.apache.org/jira/browse/YARN-3858 Project: Hadoop YARN Issue Type: Bug Environment: secure CentOS 6 Reporter: Alok Lal Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log Attached is the resource manager log. This was on a 10-node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart
[ https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alok Lal updated YARN-3858: --- Attachment: yarn-yarn-resourcemanager-c7-jun24-10.log Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart - Key: YARN-3858 URL: https://issues.apache.org/jira/browse/YARN-3858 Project: Hadoop YARN Issue Type: Bug Environment: secure CentOS 6 Reporter: Alok Lal Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log Attached is the resource manager log. This was on a 10-node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart
[ https://issues.apache.org/jira/browse/YARN-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-3858: Assignee: Varun Vasudev Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart - Key: YARN-3858 URL: https://issues.apache.org/jira/browse/YARN-3858 Project: Hadoop YARN Issue Type: Bug Environment: secure CentOS 6 Reporter: Alok Lal Assignee: Varun Vasudev Attachments: yarn-yarn-resourcemanager-c7-jun24-10.log Attached is the resource manager log. This was on a 10-node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603541#comment-14603541 ] Hadoop QA commented on YARN-2005: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 20s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 3 new or modified test files. | | {color:green}+1{color} | javac | 7m 39s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 30s | The applied patch generated 1 new checkstyle issues (total was 211, now 211). | | {color:green}+1{color} | whitespace | 0m 5s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 3m 49s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | tools/hadoop tests | 0m 52s | Tests passed in hadoop-sls. | | {color:green}+1{color} | yarn tests | 0m 22s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 51m 4s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 95m 8s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742183/YARN-2005.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 60b858b | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt | | hadoop-sls test log | https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-sls.txt | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8358/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8358/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8358/console | This message was automatically generated. Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1965) Interrupted exception when closing YarnClient
[ https://issues.apache.org/jira/browse/YARN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603103#comment-14603103 ] Mit Desai commented on YARN-1965: - Overall patch Looks good. Few minor nits. * There should be a space between () and { here {{public static final ExecutorService getClientExecutor(){}} * In testStandAloneClient(), we need spaces near the brackets. Change {{}finally{}} to {{} finally {}} * In testConnectionIdleTimeouts(), we need space near the brackets. Change {{}finally{}} to {{} finally {}} * testInterrupted needs to be indented. * In doErrorTest, testRTEDuringConnectionSetup stopping the client before the server makes more sense. Swap the stop calls in finally block * In testSocketFactoryException,testIpcConnectTimeout {{client.stop()}} should be within finally block * Is there a need to move {{Client client = new Client(LongWritable.class, conf, spyFactory);}} in testRTEDuringConnectionSetup? Interrupted exception when closing YarnClient - Key: YARN-1965 URL: https://issues.apache.org/jira/browse/YARN-1965 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.3.0 Reporter: Oleg Zhurakousky Assignee: Kuhu Shukla Priority: Minor Labels: newbie Attachments: YARN-1965-v2.patch, YARN-1965.patch Its more of a nuisance then a bug, but nevertheless {code} 16:16:48,709 ERROR pool-1-thread-1 ipc.Client:195 - Interrupted while waiting for clientExecutorto stop java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2072) at java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1468) at org.apache.hadoop.ipc.Client$ClientExecutorServiceFactory.unrefAndCleanup(Client.java:191) at org.apache.hadoop.ipc.Client.stop(Client.java:1235) at org.apache.hadoop.ipc.ClientCache.stopClient(ClientCache.java:100) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.close(ProtobufRpcEngine.java:251) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.close(ApplicationClientProtocolPBClientImpl.java:112) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:621) at org.apache.hadoop.io.retry.DefaultFailoverProxyProvider.close(DefaultFailoverProxyProvider.java:57) at org.apache.hadoop.io.retry.RetryInvocationHandler.close(RetryInvocationHandler.java:206) at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:626) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStop(YarnClientImpl.java:124) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) . . . {code} It happens sporadically when stopping YarnClient. Looking at the code in Client's 'unrefAndCleanup' its not immediately obvious why and who throws the interrupt but in any event it should not be logged as ERROR. Probably a WARN with no stack trace. Also, for consistency and correctness you may want to Interrupt current thread as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603555#comment-14603555 ] Hadoop QA commented on YARN-3695: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 20m 33s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 32s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 51s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 14s | The applied patch generated 1 new checkstyle issues (total was 3, now 4). | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 40s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 47s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 6m 18s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 54m 11s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742186/YARN-3695.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / aa07dea | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/diffcheckstylehadoop-yarn-common.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8359/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8359/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8359/console | This message was automatically generated. ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. -- Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Raju Bairishetti Attachments: YARN-3695.patch YARN-3646 fix the retry forever policy in RMProxy that it only applies on limited exceptions rather than all exceptions. Here, we may need the same fix for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603603#comment-14603603 ] Eric Payne commented on YARN-2004: -- Thanks, [~sunilg], for this fix. - {{SchedulerApplicationAttempt.java}}: {code} if (!getApplicationPriority().equals( ((SchedulerApplicationAttempt) other).getApplicationPriority())) { return getApplicationPriority().compareTo( ((SchedulerApplicationAttempt) other).getApplicationPriority()); } {code} -- Can {{getApplicationPriority}} return null? I see that {{SchedulerApplicationAttempt}} initializes {{appPriority}} to null. - {{CapacityScheduler.java}}: {code} if (!a1.getApplicationPriority().equals(a2.getApplicationPriority())) { return a1.getApplicationPriority().compareTo( a2.getApplicationPriority()); } {code} -- Same question about {{getApplicationPriority}} returning null. -- Also, can {{updateApplicationPriority}} call {{authenticateApplicationPriority}}? Seems like duplicate code to me. Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch, 0007-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3858) Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart
Alok Lal created YARN-3858: -- Summary: Distributed shell app master becomes unresponsive sometimes even when there hasn't been an AM restart Key: YARN-3858 URL: https://issues.apache.org/jira/browse/YARN-3858 Project: Hadoop YARN Issue Type: Bug Environment: secure CentOS 6 Reporter: Alok Lal Attached is the resource manager log. This was on a 10-node cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2003) Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side]
[ https://issues.apache.org/jira/browse/YARN-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603425#comment-14603425 ] Wangda Tan commented on YARN-2003: -- Hi [~sunilg], Thanks for updating, some comments: 1) API for YARNScheduler: - updateApplicationPriority, we don't need pass user/queueName, since scheduler should know it. authenticate is different, since scheduler may not have application information at that time. - It may be better to throw YarnException instead of IOException. 2) RMAppManager: - Is this check necessary? {{rmContext.getScheduler() != null}}, if this is for test cases, I think it's better to fix tests. Support to process Job priority from Submission Context in AppAttemptAddedSchedulerEvent [RM side] -- Key: YARN-2003 URL: https://issues.apache.org/jira/browse/YARN-2003 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Sunil G Labels: BB2015-05-TBR Attachments: 0001-YARN-2003.patch, 00010-YARN-2003.patch, 0002-YARN-2003.patch, 0003-YARN-2003.patch, 0004-YARN-2003.patch, 0005-YARN-2003.patch, 0006-YARN-2003.patch, 0007-YARN-2003.patch, 0008-YARN-2003.patch, 0009-YARN-2003.patch, 0011-YARN-2003.patch, 0012-YARN-2003.patch AppAttemptAddedSchedulerEvent should be able to receive the Job Priority from Submission Context and store. Later this can be used by Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603484#comment-14603484 ] Masatake Iwasaki commented on YARN-2871: Thanks for the investigation, [~zxu]. I found that [org.mockito.Mockito.timeout|http://docs.mockito.googlecode.com/hg/1.8.5/org/mockito/Mockito.html#timeout(int)] is used in some other tests using Mockito. It could be used here too. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2004) Priority scheduling support in Capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603516#comment-14603516 ] Wangda Tan commented on YARN-2004: -- Thanks for updating, [~sunilg]. A quick comment before posting others, I think most of the code to check/update application priority can be reused by other schedulers. [~kasha], could you take a quick look at this patch to see if it is also needed for Fair Scheduler? Priority scheduling support in Capacity scheduler - Key: YARN-2004 URL: https://issues.apache.org/jira/browse/YARN-2004 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-2004.patch, 0002-YARN-2004.patch, 0003-YARN-2004.patch, 0004-YARN-2004.patch, 0005-YARN-2004.patch, 0006-YARN-2004.patch, 0007-YARN-2004.patch Based on the priority of the application, Capacity Scheduler should be able to give preference to application while doing scheduling. ComparatorFiCaSchedulerApp applicationComparator can be changed as below. 1.Check for Application priority. If priority is available, then return the highest priority job. 2.Otherwise continue with existing logic such as App ID comparison and then TimeStamp comparison. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2871: Attachment: YARN-2871.002.patch TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603538#comment-14603538 ] zhihai xu commented on YARN-2871: - [~iwasakims], thanks for the suggestion, it should work. I uploaded a new patch YARN-2871.002.patch based on your suggestion. Please review it. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3045) [Event producers] Implement NM writing container lifecycle events to ATS
[ https://issues.apache.org/jira/browse/YARN-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602627#comment-14602627 ] Naganarasimha G R commented on YARN-3045: - thanks for reviewing [~djp] [~sjlee0], sorry for the late response as i was lil held up. thanks for confirming consolidation [~djp], will try to get that done by next patch. bq. if need separated event queue later to make sure container metrics boom Already i have created a async dispatcher for timeline publishing if req we can create another dispatcher for container metrics only. this is what you meant? bq. For corner case that NM publisher delay too long time (queue is busy) to publish event, it still get chance to fail (very low chance should be acceptable here). Ok, will leave The lifecycle management of app collector out of this jira. may be we can handle them (including multiple attempt as specified [~sangjin) in another jira. bq. APPLICATION_CREATED_EVENT might be seeing the race condition Yes there seems to be another race condition but this time not with src and the test but within the src. {quote} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:276) {quote} Had seen this only once earlier but was not able to get the logs now i can analyze further on this. bq. I'm a bit puzzled by the hashCode override; is it necessary? My mistake i think its resudual code of initial version, which may be i have added while trying out MultiAsync dispatcher and events of one app needs to go to one handler. but not required any more, will remove it. Will take care of other [~sjlee0] comments and will try to provide the patch at the earliest [Event producers] Implement NM writing container lifecycle events to ATS Key: YARN-3045 URL: https://issues.apache.org/jira/browse/YARN-3045 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Sangjin Lee Assignee: Naganarasimha G R Attachments: YARN-3045-YARN-2928.002.patch, YARN-3045-YARN-2928.003.patch, YARN-3045-YARN-2928.004.patch, YARN-3045.20150420-1.patch Per design in YARN-2928, implement NM writing container lifecycle events and container system metrics to ATS. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
mujunchao created YARN-3857: --- Summary: Memory leak in ResourceManager with SIMPLE mode Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Priority: Critical We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mujunchao updated YARN-3857: Remaining Estimate: (was: 24h) Original Estimate: (was: 24h) Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Priority: Critical We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3409) Add constraint node labels
[ https://issues.apache.org/jira/browse/YARN-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603869#comment-14603869 ] Xianyin Xin commented on YARN-3409: --- Thanks for comments, [~grey]. IMO, topology may be hard to deal with node labels, as node labels describe the attributes of a node and topology is an attribute of the whole cluster. You remind me that YARN-1042 may not be as simple as i think. And expecting your design doc, [~leftnoteasy]. Add constraint node labels -- Key: YARN-3409 URL: https://issues.apache.org/jira/browse/YARN-3409 Project: Hadoop YARN Issue Type: Sub-task Components: api, capacityscheduler, client Reporter: Wangda Tan Assignee: Wangda Tan Specify only one label for each node (IAW, partition a cluster) is a way to determinate how resources of a special set of nodes could be shared by a group of entities (like teams, departments, etc.). Partitions of a cluster has following characteristics: - Cluster divided to several disjoint sub clusters. - ACL/priority can apply on partition (Only market team / marke team has priority to use the partition). - Percentage of capacities can apply on partition (Market team has 40% minimum capacity and Dev team has 60% of minimum capacity of the partition). Constraints are orthogonal to partition, they’re describing attributes of node’s hardware/software just for affinity. Some example of constraints: - glibc version - JDK version - Type of CPU (x86_64/i686) - Type of OS (windows, linux, etc.) With this, application can be able to ask for resource has (glibc.version = 2.20 JDK.version = 8u20 x86_64). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1449) Protocol changes in NM side to support change container resource
[ https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603802#comment-14603802 ] MENG DING commented on YARN-1449: - The patch is way too big for review. I will split it into several JIRAs. Protocol changes in NM side to support change container resource Key: YARN-1449 URL: https://issues.apache.org/jira/browse/YARN-1449 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan (No longer used) Assignee: MENG DING Attachments: YARN-1449.1.patch, YARN-1449.2.patch, yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch, yarn-1449.5.patch As described in YARN-1197, we need add API/implementation changes, 1) Add a changeContainersResources method in ContainerManagementProtocol 2) Can get succeed/failed increased/decreased containers in response of changeContainersResources 3) Add a new decreased containers field in NodeStatus which can help NM notify RM such changes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603679#comment-14603679 ] Hadoop QA commented on YARN-2871: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 6m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | release audit | 0m 18s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 2s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 70m 10s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742218/YARN-2871.002.patch | | Optional Tests | javac unit findbugs checkstyle | | git revision | trunk / aa07dea | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8360/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8360/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8360/console | This message was automatically generated. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603710#comment-14603710 ] Jian He commented on YARN-3695: --- looks good, +1 ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. -- Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Raju Bairishetti Attachments: YARN-3695.patch YARN-3646 fix the retry forever policy in RMProxy that it only applies on limited exceptions rather than all exceptions. Here, we may need the same fix for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1449) Protocol changes in NM side to support change container resource
[ https://issues.apache.org/jira/browse/YARN-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] MENG DING updated YARN-1449: Attachment: YARN-1449.2.patch Attach updated patch for review which includes: * All protocol changes (AM-RM, AM-NM, NM-RM) as described in the design doc (see YARN-1197). * ContainerManager logic * NodeStatusUpdater logic * NodeManager recovery logic * New and updated unit test cases The ContainersMonitor logic is covered in YARN-1643, and a patch will be posted for review early next week. Protocol changes in NM side to support change container resource Key: YARN-1449 URL: https://issues.apache.org/jira/browse/YARN-1449 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Wangda Tan (No longer used) Assignee: MENG DING Attachments: YARN-1449.1.patch, YARN-1449.2.patch, yarn-1449.1.patch, yarn-1449.3.patch, yarn-1449.4.patch, yarn-1449.5.patch As described in YARN-1197, we need add API/implementation changes, 1) Add a changeContainersResources method in ContainerManagementProtocol 2) Can get succeed/failed increased/decreased containers in response of changeContainersResources 3) Add a new decreased containers field in NodeStatus which can help NM notify RM such changes -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
[ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603767#comment-14603767 ] Anubhav Dhoot commented on YARN-2005: - The checkstyle error is unavoidable (preexisting). [~jlowe][~sunilg] this is as per the discussion here and is ready for your review. [~jianhe][~kasha] appreciate your review as well Blacklisting support for scheduling AMs --- Key: YARN-2005 URL: https://issues.apache.org/jira/browse/YARN-2005 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Anubhav Dhoot Attachments: YARN-2005.001.patch, YARN-2005.002.patch It would be nice if the RM supported blacklisting a node for an AM launch after the same node fails a configurable number of AM attempts. This would be similar to the blacklisting support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on the RM side. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3705) forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state
[ https://issues.apache.org/jira/browse/YARN-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-3705: --- Attachment: YARN-3705.006.patch bq. If we call resetLeaderElection inside the rmadmin.transitionToStandby(), it will cause a infinite loop. You are right. I need to make sure that resetLeaderElection is not called when EmbeddedElectorService#becomeStandby calls transitionToStandy. Thanks for the good catch, [~xgong]. I attached 006. Though I checked that the loop is not caused by starting RM-HA manually with patched jar, it is difficult to test that in unit test. forcemanual transitionToStandby in RM-HA automatic-failover mode should change elector state Key: YARN-3705 URL: https://issues.apache.org/jira/browse/YARN-3705 Project: Hadoop YARN Issue Type: Sub-task Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: YARN-3705.001.patch, YARN-3705.002.patch, YARN-3705.003.patch, YARN-3705.004.patch, YARN-3705.005.patch, YARN-3705.006.patch Executing {{rmadmin -transitionToStandby --forcemanual}} in automatic-failover.enabled mode makes ResouceManager standby while keeping the state of ActiveStandbyElector. It should make elector to quit and rejoin in order to enable other candidates to promote, otherwise forcemanual transition should not be allowed in automatic-failover mode in order to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3850) NM fails to read files from full disks which can lead to container logs being lost and other issues
[ https://issues.apache.org/jira/browse/YARN-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603728#comment-14603728 ] Varun Saxena commented on YARN-3850: Thanks for the review and commit [~jlowe] NM fails to read files from full disks which can lead to container logs being lost and other issues --- Key: YARN-3850 URL: https://issues.apache.org/jira/browse/YARN-3850 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation, nodemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Blocker Fix For: 2.7.1 Attachments: YARN-3850.01.patch, YARN-3850.02.patch *Container logs* can be lost if disk has become full(~90% full). When application finishes, we upload logs after aggregation by calling {{AppLogAggregatorImpl#uploadLogsForContainers}}. But this call in turns checks the eligible directories on call to {{LocalDirsHandlerService#getLogDirs}} which in case of disk full would return nothing. So none of the container logs are aggregated and uploaded. But on application finish, we also call {{AppLogAggregatorImpl#doAppLogAggregationPostCleanUp()}}. This deletes the application directory which contains container logs. This is because it calls {{LocalDirsHandlerService#getLogDirsForCleanup}} which returns the full disks as well. So we are left with neither aggregated logs for the app nor the individual container logs for the app. In addition to this, there are 2 more issues : # {{ContainerLogsUtil#getContainerLogDirs}} does not consider full disks so NM will fail to serve up logs from full disks from its web interfaces. # {{RecoveredContainerLaunch#locatePidFile}} also does not consider full disks so it is possible that on container recovery, PID file is not found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603727#comment-14603727 ] Masatake Iwasaki commented on YARN-2871: I'm +1(non-binding) on this. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603917#comment-14603917 ] Xuan Gong commented on YARN-2871: - +1 LGTM. Checking this in TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603919#comment-14603919 ] Xuan Gong commented on YARN-2871: - Committed into trunk/branch-2. Thanks, zhihai. TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Fix For: 2.8.0 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raju Bairishetti updated YARN-3695: --- Attachment: YARN-3695.01.patch [~jianhe] Thanks for the review. Moved the Precondtion Checks before creating RetryPolicy. So that we can avoid creating policy if the connection timeout values are invalid. ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. -- Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Raju Bairishetti Attachments: YARN-3695.01.patch, YARN-3695.patch YARN-3646 fix the retry forever policy in RMProxy that it only applies on limited exceptions rather than all exceptions. Here, we may need the same fix for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603945#comment-14603945 ] Hudson commented on YARN-2871: -- FAILURE: Integrated in Hadoop-trunk-Commit #8076 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8076/]) YARN-2871. TestRMRestart#testRMRestartGetApplicationList sometime fails (xgong: rev fe6c1bd73aee188ed58df4d33bbc2d2fe0779a97) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Fix For: 2.8.0 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2871) TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603957#comment-14603957 ] zhihai xu commented on YARN-2871: - thanks [~iwasakims] for the review! thanks [~xgong] for the review and committing the patch! TestRMRestart#testRMRestartGetApplicationList sometime fails in trunk - Key: YARN-2871 URL: https://issues.apache.org/jira/browse/YARN-2871 Project: Hadoop YARN Issue Type: Test Reporter: Ted Yu Assignee: zhihai xu Priority: Minor Fix For: 2.8.0 Attachments: YARN-2871.000.patch, YARN-2871.001.patch, YARN-2871.002.patch From trunk build #746 (https://builds.apache.org/job/Hadoop-Yarn-trunk/746): {code} Failed tests: TestRMRestart.testRMRestartGetApplicationList:957 rMAppManager.logApplicationSummary( isA(org.apache.hadoop.yarn.api.records.ApplicationId) ); Wanted 3 times: - at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:957) But was 2 times: - at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.handle(RMAppManager.java:66) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3695) ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception.
[ https://issues.apache.org/jira/browse/YARN-3695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603971#comment-14603971 ] Hadoop QA commented on YARN-3695: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 17m 51s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 49s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 56s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 1m 30s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 1s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 36s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 46s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 1m 57s | Tests passed in hadoop-yarn-common. | | {color:green}+1{color} | yarn tests | 6m 18s | Tests passed in hadoop-yarn-server-nodemanager. | | | | 50m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12742286/YARN-3695.01.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / fe6c1bd | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/8361/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8361/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8361/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8361/console | This message was automatically generated. ServerProxy (NMProxy, etc.) shouldn't retry forever for non network exception. -- Key: YARN-3695 URL: https://issues.apache.org/jira/browse/YARN-3695 Project: Hadoop YARN Issue Type: Bug Reporter: Junping Du Assignee: Raju Bairishetti Attachments: YARN-3695.01.patch, YARN-3695.patch YARN-3646 fix the retry forever policy in RMProxy that it only applies on limited exceptions rather than all exceptions. Here, we may need the same fix for ServerProxy (NMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3837) javadocs of TimelineAuthenticationFilterInitializer give wrong prefix for auth options
[ https://issues.apache.org/jira/browse/YARN-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602822#comment-14602822 ] Bibin A Chundatt commented on YARN-3837: Whitespace error seems not correct to me. javadocs of TimelineAuthenticationFilterInitializer give wrong prefix for auth options -- Key: YARN-3837 URL: https://issues.apache.org/jira/browse/YARN-3837 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.8.0 Reporter: Steve Loughran Assignee: Bibin A Chundatt Priority: Minor Attachments: 0001-YARN-3837.patch, 0002-YARN-3837.patch Original Estimate: 0.5h Remaining Estimate: 0.5h The javadocs for {{TimelineAuthenticationFilterInitializer}} talk about the prefix {{yarn.timeline-service.authentication.}}, but the code uses {{ yarn.timeline-service.http-authentication.}} as the prefix. best to use {{@value}} and let the javadocs sort it out for themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3857) Memory leak in ResourceManager with SIMPLE mode
[ https://issues.apache.org/jira/browse/YARN-3857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mujunchao updated YARN-3857: Attachment: hadoop-yarn-server-resourcemanager.patch never register org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.masterKeys while SecretKey is null. Memory leak in ResourceManager with SIMPLE mode --- Key: YARN-3857 URL: https://issues.apache.org/jira/browse/YARN-3857 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: mujunchao Priority: Critical Attachments: hadoop-yarn-server-resourcemanager.patch We register the ClientTokenMasterKey to avoid client may hold an invalid ClientToken after RM restarts. In SIMPLE mode, we register PairApplicationAttemptId, null , But we never remove it from HashMap, as unregister only runing while in Security mode, so memory leak coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602747#comment-14602747 ] Hudson commented on YARN-3826: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #240 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/240/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602750#comment-14602750 ] Hudson commented on YARN-3745: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #240 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/240/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3745) SerializedException should also try to instantiate internal exception with the default constructor
[ https://issues.apache.org/jira/browse/YARN-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602765#comment-14602765 ] Hudson commented on YARN-3745: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #970 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/970/]) YARN-3745. SerializedException should also try to instantiate internal (devaraj: rev b381f88c71d18497deb35039372b1e9715d2c038) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/SerializedExceptionPBImpl.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/records/impl/pb/TestSerializedExceptionPBImpl.java SerializedException should also try to instantiate internal exception with the default constructor -- Key: YARN-3745 URL: https://issues.apache.org/jira/browse/YARN-3745 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Lavkesh Lahngir Assignee: Lavkesh Lahngir Fix For: 2.8.0 Attachments: YARN-3745.1.patch, YARN-3745.2.patch, YARN-3745.3.patch, YARN-3745.patch While deserialising a SerializedException it tries to create internal exception in instantiateException() with cn = cls.getConstructor(String.class). if cls does not has a constructor with String parameter it throws Nosuchmethodexception for example ClosedChannelException class. We should also try to instantiate exception with default constructor so that inner exception can to propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3826) Race condition in ResourceTrackerService leads to wrong diagnostics messages
[ https://issues.apache.org/jira/browse/YARN-3826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602762#comment-14602762 ] Hudson commented on YARN-3826: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #970 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/970/]) YARN-3826. Race condition in ResourceTrackerService leads to wrong (devaraj: rev 57f1a01eda80f44d3ffcbcb93c4ee290e274946a) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/utils/YarnServerBuilderUtils.java Race condition in ResourceTrackerService leads to wrong diagnostics messages Key: YARN-3826 URL: https://issues.apache.org/jira/browse/YARN-3826 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.8.0 Attachments: YARN-3826.01.patch, YARN-3826.02.patch, YARN-3826.03.patch Since we are calling {{setDiagnosticsMessage}} in {{nodeHeartbeat}}, which can be called concurrently, the static {{resync}} and {{shutdown}} may have wrong diagnostics messages in some cases. On the other side, these static members can hardly save any memory, since the normal heartbeat responses are created for each heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)