[jira] [Assigned] (YARN-56) Handle container requests that request more resources than currently available in the cluster
[ https://issues.apache.org/jira/browse/YARN-56?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-56: Assignee: (was: Karthik Kambatla) Handle container requests that request more resources than currently available in the cluster - Key: YARN-56 URL: https://issues.apache.org/jira/browse/YARN-56 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.0.2-alpha, 0.23.3 Reporter: Hitesh Shah In heterogenous clusters, a simple check at the scheduler to check if the allocation request is within the max allocatable range is not enough. If there are large nodes in the cluster which are not available, there may be situations where some allocation requests will never be fulfilled. Need an approach to decide when to invalidate such requests. For application submissions, there will need to be a feedback loop for applications that could not be launched. For running AMs, AllocationResponse may need to augmented with information for invalidated/cancelled container requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2984: --- Issue Type: Sub-task (was: Improvement) Parent: YARN-2141 Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2532) Track pending resources at the application level
[ https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-2532: -- Assignee: (was: Karthik Kambatla) Track pending resources at the application level - Key: YARN-2532 URL: https://issues.apache.org/jira/browse/YARN-2532 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.5.1 Reporter: Karthik Kambatla SchedulerApplicationAttempt keeps track of current consumption of an app. It would be nice to have a similar value tracked for pending requests. The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and (2) updating demand in FS in an event-driven style (YARN-2353) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259695#comment-14259695 ] Karthik Kambatla commented on YARN-1856: I haven't had a chance to work on this further. [~beckham007] - how did your testing go? Please feel free to take this JIRA over if you want to contribute what you guys have done. cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1856) cgroups based memory monitoring for containers
[ https://issues.apache.org/jira/browse/YARN-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1856: -- Assignee: (was: Karthik Kambatla) cgroups based memory monitoring for containers -- Key: YARN-1856 URL: https://issues.apache.org/jira/browse/YARN-1856 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-1535) Add an option to yarn rmadmin to clear the znode used by embedded elector
[ https://issues.apache.org/jira/browse/YARN-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-1535: -- Assignee: (was: Karthik Kambatla) Add an option to yarn rmadmin to clear the znode used by embedded elector - Key: YARN-1535 URL: https://issues.apache.org/jira/browse/YARN-1535 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla YARN-1029 implements EmbeddedElectorService. Admins should have a way to clear the znode that this elector uses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259697#comment-14259697 ] Karthik Kambatla commented on YARN-2965: [~srikanthkandula], [~rgrandl] - any updates here? I am particularly keen to see how you plan to capture per-container usages, at least memory and CPU. I filed YARN-2984 and posted a preliminary patch there that captures memory consumption. Enhance Node Managers to monitor and report the resource usage on machines -- Key: YARN-2965 URL: https://issues.apache.org/jira/browse/YARN-2965 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Robert Grandl Assignee: Robert Grandl Attachments: ddoc_RT.docx This JIRA is about augmenting Node Managers to monitor the resource usage on the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2141) [Umbrella] Capture container and node resource consumption
[ https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2141: --- Summary: [Umbrella] Capture container and node resource consumption (was: Capture container and node resource consumption) [Umbrella] Capture container and node resource consumption -- Key: YARN-2141 URL: https://issues.apache.org/jira/browse/YARN-2141 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Carlo Curino Assignee: Karthik Kambatla Priority: Minor Collecting per-container and per-node resource consumption statistics in a fairly granular manner, and making them available to both infrastructure code (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can facilitate several performance work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2141) [Umbrella] Capture container and node resource consumption
[ https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-2141: -- Assignee: (was: Karthik Kambatla) Filed YARN-2984 to capture container's actual memory consumption. Will file another sub-task for CPU (Carlo has emailed me his implementation offline). YARN-2965 covers capturing node resource consumption. [Umbrella] Capture container and node resource consumption -- Key: YARN-2141 URL: https://issues.apache.org/jira/browse/YARN-2141 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Carlo Curino Priority: Minor Collecting per-container and per-node resource consumption statistics in a fairly granular manner, and making them available to both infrastructure code (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can facilitate several performance work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2716) Refactor ZKRMStateStore retry code with Apache Curator
[ https://issues.apache.org/jira/browse/YARN-2716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259699#comment-14259699 ] Karthik Kambatla commented on YARN-2716: [~rkanter] - is it okay for me to take this over? We have recently seen more issues with the current implementation, and this rewrite could greatly help. Refactor ZKRMStateStore retry code with Apache Curator -- Key: YARN-2716 URL: https://issues.apache.org/jira/browse/YARN-2716 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Robert Kanter Per suggestion by [~kasha] in YARN-2131, it's nice to use curator to simplify the retry logic in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2063) ZKRMStateStore: Better handling of operation failures
[ https://issues.apache.org/jira/browse/YARN-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla resolved YARN-2063. Resolution: Duplicate ZKRMStateStore: Better handling of operation failures - Key: YARN-2063 URL: https://issues.apache.org/jira/browse/YARN-2063 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Today, when a ZK operation fails, we handle connection-loss and operation-timeout the same way. This could definitely use some improvements: # Add special handling for other error codes # Connection-loss: Nullify zkClient, so a new connection is established # Operation-timeout: Retry a few times with exponential delay? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
[ https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2062: --- Target Version/s: 2.7.0 (was: 2.6.0) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover --- Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
[ https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2062: --- Attachment: yarn-2062-1.patch Straight-forward patch that adds a dummy transition to not log invalid transitions. [~jianhe], [~adhoot] - does this patch make any sense? Should we be handling all these transitions to better handle work-preserving RM restart? Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover --- Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2062-1.patch On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService
[ https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Liu updated YARN-2993: - Fix Version/s: 2.7.0 Several fixes (missing acl check, error log msg ...) and some refinement in AdminService Key: YARN-2993 URL: https://issues.apache.org/jira/browse/YARN-2993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Fix For: 2.7.0 Attachments: YARN-2993.001.patch This JIRA is to resolve following issues in {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}: *1.* There is no ACLs check for {{refreshServiceAcls}} *2.* log message in {{refreshAdminAcls}} is incorrect, it should be ... Can not refresh Admin ACLs. instead of ... Can not refresh user-groups. *3.* some unnecessary header import. *4.* {code} if (!isRMActive()) { RMAuditLogger.logFailure(user.getShortUserName(), argName, adminAcl.toString(), AdminService, ResourceManager is not active. Can not remove labels.); throwStandbyException(); } {code} is common in lots of methods, just the message is different, we should refine it into one common method. *5.* {code} LOG.info(Exception remove labels, ioe); RMAuditLogger.logFailure(user.getShortUserName(), argName, adminAcl.toString(), AdminService, Exception remove label); throw RPCUtil.getRemoteException(ioe); {code} is common in lots of methods, just the message is different, we should refine it into one common method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2993) Several fixes (missing acl check, error log msg ...) and some refinement in AdminService
[ https://issues.apache.org/jira/browse/YARN-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259776#comment-14259776 ] Yi Liu commented on YARN-2993: -- Thanks [~djp] for the review and commit. Several fixes (missing acl check, error log msg ...) and some refinement in AdminService Key: YARN-2993 URL: https://issues.apache.org/jira/browse/YARN-2993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Fix For: 2.7.0 Attachments: YARN-2993.001.patch This JIRA is to resolve following issues in {{org.apache.hadoop.yarn.server.resourcemanager.AdminService}}: *1.* There is no ACLs check for {{refreshServiceAcls}} *2.* log message in {{refreshAdminAcls}} is incorrect, it should be ... Can not refresh Admin ACLs. instead of ... Can not refresh user-groups. *3.* some unnecessary header import. *4.* {code} if (!isRMActive()) { RMAuditLogger.logFailure(user.getShortUserName(), argName, adminAcl.toString(), AdminService, ResourceManager is not active. Can not remove labels.); throwStandbyException(); } {code} is common in lots of methods, just the message is different, we should refine it into one common method. *5.* {code} LOG.info(Exception remove labels, ioe); RMAuditLogger.logFailure(user.getShortUserName(), argName, adminAcl.toString(), AdminService, Exception remove label); throw RPCUtil.getRemoteException(ioe); {code} is common in lots of methods, just the message is different, we should refine it into one common method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2797: --- Attachment: yarn-2797-1.patch Straight-forward patch. TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2062) Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover
[ https://issues.apache.org/jira/browse/YARN-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259793#comment-14259793 ] Hadoop QA commented on YARN-2062: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689271/yarn-2062-1.patch against trunk revision 1454efe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6200//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6200//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6200//console This message is automatically generated. Too many InvalidStateTransitionExceptions from NodeState.NEW on RM failover --- Key: YARN-2062 URL: https://issues.apache.org/jira/browse/YARN-2062 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2062-1.patch On busy clusters, we see several {{org.apache.hadoop.yarn.state.InvalidStateTransitonException}} for events invoked against NEW nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259814#comment-14259814 ] Hadoop QA commented on YARN-2797: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12689275/yarn-2797-1.patch against trunk revision 1454efe. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6201//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6201//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6201//console This message is automatically generated. TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2994) Document work-preserving RM restart
[ https://issues.apache.org/jira/browse/YARN-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259853#comment-14259853 ] Rohith commented on YARN-2994: -- Thanks [~jianhe] for woking on documenting work preserving restart feature. I quickly read patch, changes are fine. I have one basic doubt that does work preserving restart work only for ZKRMStateStore? It is also can be used with FileSysytemStore also right? Document work-preserving RM restart --- Key: YARN-2994 URL: https://issues.apache.org/jira/browse/YARN-2994 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2994.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2797) TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase
[ https://issues.apache.org/jira/browse/YARN-2797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259863#comment-14259863 ] Rohith commented on YARN-2797: -- Thanks Karthik working on this. One quick comment, ParameterizedSchedulerTestBase does not have FIFO scheduler configurations. TestWorkPreservingRMRestart run for fifoscheduler too. I think FIFO also should be included. TestWorkPreservingRMRestart should use ParametrizedSchedulerTestBase Key: YARN-2797 URL: https://issues.apache.org/jira/browse/YARN-2797 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Minor Attachments: yarn-2797-1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2994) Document work-preserving RM restart
[ https://issues.apache.org/jira/browse/YARN-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259864#comment-14259864 ] Jian He commented on YARN-2994: --- yes, that's correct. Document work-preserving RM restart --- Key: YARN-2994 URL: https://issues.apache.org/jira/browse/YARN-2994 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2994.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259887#comment-14259887 ] Rohith commented on YARN-2992: -- I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon. ZKRMStateStore crashes due to session expiry Key: YARN-2992 URL: https://issues.apache.org/jira/browse/YARN-2992 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2992-1.patch We recently saw the RM crash with the following stacktrace. On session expiry, we should gracefully transition to standby. {noformat} 2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259885#comment-14259885 ] Rohith commented on YARN-2992: -- I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon. ZKRMStateStore crashes due to session expiry Key: YARN-2992 URL: https://issues.apache.org/jira/browse/YARN-2992 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2992-1.patch We recently saw the RM crash with the following stacktrace. On session expiry, we should gracefully transition to standby. {noformat} 2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259886#comment-14259886 ] Rohith commented on YARN-2992: -- I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon. ZKRMStateStore crashes due to session expiry Key: YARN-2992 URL: https://issues.apache.org/jira/browse/YARN-2992 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2992-1.patch We recently saw the RM crash with the following stacktrace. On session expiry, we should gracefully transition to standby. {noformat} 2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2992) ZKRMStateStore crashes due to session expiry
[ https://issues.apache.org/jira/browse/YARN-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259884#comment-14259884 ] Rohith commented on YARN-2992: -- I see, yes.. In my cluster, configured retry was very less, so Rm was exiting very soon. ZKRMStateStore crashes due to session expiry Key: YARN-2992 URL: https://issues.apache.org/jira/browse/YARN-2992 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Blocker Fix For: 2.7.0 Attachments: yarn-2992-1.patch We recently saw the RM crash with the following stacktrace. On session expiry, we should gracefully transition to standby. {noformat} 2014-12-18 06:28:42,689 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:930) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1069) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1088) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:927) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:941) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.setDataWithRetries(ZKRMStateStore.java:958) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationAttemptStateInternal(ZKRMStateStore.java:687) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2922) Concurrent Modification Exception in LeafQueue when collecting applications
[ https://issues.apache.org/jira/browse/YARN-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2922: - Target Version/s: 2.7.0 Concurrent Modification Exception in LeafQueue when collecting applications --- Key: YARN-2922 URL: https://issues.apache.org/jira/browse/YARN-2922 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.5.1 Reporter: Jason Tufo Assignee: Rohith Attachments: 0001-YARN-2922.patch java.util.ConcurrentModificationException at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1115) at java.util.TreeMap$KeyIterator.next(TreeMap.java:1169) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.collectSchedulerApplications(LeafQueue.java:1618) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getAppsInQueue(CapacityScheduler.java:1119) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:798) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:234) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:333) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith reassigned YARN-2991: Assignee: Rohith TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk -- Key: YARN-2991 URL: https://issues.apache.org/jira/browse/YARN-2991 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Rohith {code} Error Message test timed out after 6 milliseconds Stacktrace java.lang.Exception: test timed out after 6 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873) {code} It happened twice this months: https://builds.apache.org/job/PreCommit-YARN-Build/6096/ https://builds.apache.org/job/PreCommit-YARN-Build/6182/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2991) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk
[ https://issues.apache.org/jira/browse/YARN-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2991: - Priority: Blocker (was: Major) TestRMRestart.testDecomissionedNMsMetricsOnRMRestart intermittently fails on trunk -- Key: YARN-2991 URL: https://issues.apache.org/jira/browse/YARN-2991 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Rohith Priority: Blocker {code} Error Message test timed out after 6 milliseconds Stacktrace java.lang.Exception: test timed out after 6 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1281) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1106) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testDecomissionedNMsMetricsOnRMRestart(TestRMRestart.java:1873) {code} It happened twice this months: https://builds.apache.org/job/PreCommit-YARN-Build/6096/ https://builds.apache.org/job/PreCommit-YARN-Build/6182/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)