[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075257#comment-14075257 ] Zhijie Shen commented on YARN-2247: --- +1 for the latest patch. [~vinodkv], do you have more comments about this issue? > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Attachments: apache-yarn-2247.0.patch, apache-yarn-2247.1.patch, > apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, apache-yarn-2247.4.patch, > apache-yarn-2247.5.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2346) Add a 'status' command to yarn-daemon.sh
[ https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-2346: --- Target Version/s: (was: 2.5.0) > Add a 'status' command to yarn-daemon.sh > > > Key: YARN-2346 > URL: https://issues.apache.org/jira/browse/YARN-2346 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 >Reporter: Nikunj Bansal >Assignee: Allen Wittenauer >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Adding a 'status' command to yarn-daemon.sh will be useful for finding out > the status of yarn daemons. > Running the 'status' command should exit with a 0 exit code if the target > daemon is running and non-zero code in case its not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2346) Add a 'status' command to yarn-daemon.sh
[ https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer reassigned YARN-2346: -- Assignee: Allen Wittenauer > Add a 'status' command to yarn-daemon.sh > > > Key: YARN-2346 > URL: https://issues.apache.org/jira/browse/YARN-2346 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 >Reporter: Nikunj Bansal >Assignee: Allen Wittenauer >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Adding a 'status' command to yarn-daemon.sh will be useful for finding out > the status of yarn daemons. > Running the 'status' command should exit with a 0 exit code if the target > daemon is running and non-zero code in case its not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2346) Add a 'status' command to yarn-daemon.sh
[ https://issues.apache.org/jira/browse/YARN-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved YARN-2346. Resolution: Duplicate This is part of HADOOP-9902 now. Resolving. > Add a 'status' command to yarn-daemon.sh > > > Key: YARN-2346 > URL: https://issues.apache.org/jira/browse/YARN-2346 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 >Reporter: Nikunj Bansal >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Adding a 'status' command to yarn-daemon.sh will be useful for finding out > the status of yarn daemons. > Running the 'status' command should exit with a 0 exit code if the target > daemon is running and non-zero code in case its not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075228#comment-14075228 ] Hudson commented on YARN-1726: -- FAILURE: Integrated in Hadoop-trunk-Commit #5975 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5975/]) YARN-1726. Add missing files. ResourceSchedulerWrapper broken due to AbstractYarnScheduler. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613552) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/appmaster * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/appmaster/TestAMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/nodemanager * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/nodemanager/TestNMSimulator.java > ResourceSchedulerWrapper broken due to AbstractYarnScheduler > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Fix For: 2.5.0 > > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1796) container-executor shouldn't require o-r permissions
[ https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075222#comment-14075222 ] Hudson commented on YARN-1796: -- FAILURE: Integrated in Hadoop-trunk-Commit #5974 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5974/]) YARN-1796. container-executor shouldn't require o-r permissions. Contributed by Aaron T. Myers. (atm: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613548) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c > container-executor shouldn't require o-r permissions > > > Key: YARN-1796 > URL: https://issues.apache.org/jira/browse/YARN-1796 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers >Priority: Minor > Fix For: 2.6.0 > > Attachments: YARN-1796.patch > > > The container-executor currently checks that "other" users don't have read > permissions. This is unnecessary and runs contrary to the debian packaging > policy manual. > This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075224#comment-14075224 ] Hadoop QA commented on YARN-2354: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657943/YARN-2354-072514.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build///testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build///console This message is automatically generated. > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075220#comment-14075220 ] Wangda Tan commented on YARN-2069: -- Hi Mayank, Thanks for your detailed explanation, I think I understood your approach. However, I think the current way to compute target user limit is not correct, let me explain: I found basically, your created {{computeTargetedUserLimit}} is modified from {{computeUserLimit}}, it will calculate as following {code} target_capacity = used_capacity - resToObtain min( max(target_capacity / #active_user, target_capacity * user_limit_percent), target_capacity * user_limit_factor)), {code} So when a user_limit_percent is set as default (100%), it is possible that target_user_limit * #active_user > queue_max_capacity. In this case, it is possible that any of the user-usage is below target_user_limit, but the usage of the queue is larger than guaranteed resource. Let me give you an example {code} Assume queue capacity = 50, used_resource = 70, resToObtain = 20 So target_capacity = 50, there're 5 users in the queue user_limit_percent = 100%, user_limit_factor = 1 (both are default) So target_user_capacity = min(max(50 / 5, 50 * 100%), 50) = 50 User1 used 20 User2 used 10 User3 used 10 User4 used 20 User5 used 10 So all user's used capacity are < target_user_capacity {code} In existing logic of {{balanceUserLimitsinQueueForPreemption}} {code} if (Resources.lessThan(rc, clusterResource, userLimitforQueue, userConsumedResource)) { // do preemption } else continue; {code} If a user used resource < target_user_capacity, it will not be preempted. Mayank, is that correct? Or I misunderstood your logic? Please let me know you comments, Thanks, Wangda > CS queue level preemption should respect user-limits > > > Key: YARN-2069 > URL: https://issues.apache.org/jira/browse/YARN-2069 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, > YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, > YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch > > > This is different from (even if related to, and likely share code with) > YARN-2113. > YARN-2113 focuses on making sure that even if queue has its guaranteed > capacity, it's individual users are treated in-line with their limits > irrespective of when they join in. > This JIRA is about respecting user-limits while preempting containers to > balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1796) container-executor shouldn't require o-r permissions
[ https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075212#comment-14075212 ] Hadoop QA commented on YARN-1796: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633282/YARN-1796.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4445//console This message is automatically generated. > container-executor shouldn't require o-r permissions > > > Key: YARN-1796 > URL: https://issues.apache.org/jira/browse/YARN-1796 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers >Priority: Minor > Attachments: YARN-1796.patch > > > The container-executor currently checks that "other" users don't have read > permissions. This is unnecessary and runs contrary to the debian packaging > policy manual. > This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075213#comment-14075213 ] Hadoop QA commented on YARN-1726: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657962/YARN-1726-7-branch2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4446//console This message is automatically generated. > ResourceSchedulerWrapper broken due to AbstractYarnScheduler > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1796) container-executor shouldn't require o-r permissions
[ https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated YARN-1796: - Target Version/s: 2.6.0 (was: 2.4.0) > container-executor shouldn't require o-r permissions > > > Key: YARN-1796 > URL: https://issues.apache.org/jira/browse/YARN-1796 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Aaron T. Myers >Assignee: Aaron T. Myers >Priority: Minor > Attachments: YARN-1796.patch > > > The container-executor currently checks that "other" users don't have read > permissions. This is unnecessary and runs contrary to the debian packaging > policy manual. > This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075203#comment-14075203 ] Hudson commented on YARN-1726: -- FAILURE: Integrated in Hadoop-trunk-Commit #5973 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5973/]) YARN-1726. ResourceSchedulerWrapper broken due to AbstractYarnScheduler. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613547) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/TestSLSRunner.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > ResourceSchedulerWrapper broken due to AbstractYarnScheduler > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075200#comment-14075200 ] Hadoop QA commented on YARN-2361: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657941/YARN-2361.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4442//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4442//console This message is automatically generated. > remove duplicate entries (EXPIRE event) in the EnumSet of event type in > RMAppAttempt state machine > -- > > Key: YARN-2361 > URL: https://issues.apache.org/jira/browse/YARN-2361 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Priority: Minor > Attachments: YARN-2361.000.patch > > > remove duplicate entries in the EnumSet of event type in RMAppAttempt state > machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the > following code. > {code} > EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.LAUNCHED, > RMAppAttemptEventType.LAUNCH_FAILED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.REGISTERED, > RMAppAttemptEventType.CONTAINER_ALLOCATED, > RMAppAttemptEventType.UNREGISTERED, > RMAppAttemptEventType.KILL, > RMAppAttemptEventType.STATUS_UPDATE)) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1726: -- Attachment: YARN-1726-7-branch2.patch update a patch for branch-2. > ResourceSchedulerWrapper broken due to AbstractYarnScheduler > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7-branch2.patch, YARN-1726-7.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper broken due to AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1726: --- Summary: ResourceSchedulerWrapper broken due to AbstractYarnScheduler (was: ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041) > ResourceSchedulerWrapper broken due to AbstractYarnScheduler > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075187#comment-14075187 ] Karthik Kambatla commented on YARN-1726: Checking this in.. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075185#comment-14075185 ] Hadoop QA commented on YARN-1726: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657940/YARN-1726-7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4443//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4443//console This message is automatically generated. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075175#comment-14075175 ] Xuan Gong commented on YARN-2212: - Testcase failure is not related. > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch, > YARN-2212.3.1.patch, YARN-2212.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075173#comment-14075173 ] Hadoop QA commented on YARN-2212: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657937/YARN-2212.3.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4441//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4441//console This message is automatically generated. > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch, > YARN-2212.3.1.patch, YARN-2212.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075172#comment-14075172 ] Hadoop QA commented on YARN-2209: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657938/YARN-2209.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4440//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4440//console This message is automatically generated. > Replace allocate#resync command with ApplicationMasterNotRegisteredException > to indicate AM to re-register on RM restart > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075166#comment-14075166 ] Karthik Kambatla commented on YARN-1726: +1 pending Jenkins. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075165#comment-14075165 ] Jian He commented on YARN-2354: --- looks good. thanks for working on the patch Li ! > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2354: Attachment: YARN-2354-072514.patch The problem was on numRequestedContainers. In the previous version, initially, it was set to numTotalContainers - previousAMRunningContainers.size(). Then, on container completion, the number of containers that need to to relaunched is calculated by numTotalContainers - numRequestedContainers, and normally this equals to previousAMRunningContainers.size(). If the containers are not reused (no -keep_containers_across_application_attempts), there should be no previousAMRunningContainers, so this problem only occurs when -keep_containers_across_application_attempts is set. I'm also fixing the testDSRestartWithPreviousRunningContainers UT associated with this issue. > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2361: Attachment: YARN-2361.000.patch > remove duplicate entries (EXPIRE event) in the EnumSet of event type in > RMAppAttempt state machine > -- > > Key: YARN-2361 > URL: https://issues.apache.org/jira/browse/YARN-2361 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Priority: Minor > Attachments: YARN-2361.000.patch > > > remove duplicate entries in the EnumSet of event type in RMAppAttempt state > machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the > following code. > {code} > EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.LAUNCHED, > RMAppAttemptEventType.LAUNCH_FAILED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.REGISTERED, > RMAppAttemptEventType.CONTAINER_ALLOCATED, > RMAppAttemptEventType.UNREGISTERED, > RMAppAttemptEventType.KILL, > RMAppAttemptEventType.STATUS_UPDATE)) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
[ https://issues.apache.org/jira/browse/YARN-2361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2361: Component/s: resourcemanager > remove duplicate entries (EXPIRE event) in the EnumSet of event type in > RMAppAttempt state machine > -- > > Key: YARN-2361 > URL: https://issues.apache.org/jira/browse/YARN-2361 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Reporter: zhihai xu >Priority: Minor > Attachments: YARN-2361.000.patch > > > remove duplicate entries in the EnumSet of event type in RMAppAttempt state > machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the > following code. > {code} > EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.LAUNCHED, > RMAppAttemptEventType.LAUNCH_FAILED, > RMAppAttemptEventType.EXPIRE, > RMAppAttemptEventType.REGISTERED, > RMAppAttemptEventType.CONTAINER_ALLOCATED, > RMAppAttemptEventType.UNREGISTERED, > RMAppAttemptEventType.KILL, > RMAppAttemptEventType.STATUS_UPDATE)) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2361) remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine
zhihai xu created YARN-2361: --- Summary: remove duplicate entries (EXPIRE event) in the EnumSet of event type in RMAppAttempt state machine Key: YARN-2361 URL: https://issues.apache.org/jira/browse/YARN-2361 Project: Hadoop YARN Issue Type: Improvement Reporter: zhihai xu Priority: Minor Attachments: YARN-2361.000.patch remove duplicate entries in the EnumSet of event type in RMAppAttempt state machine. The event RMAppAttemptEventType.EXPIRE is duplicated in the following code. {code} EnumSet.of(RMAppAttemptEventType.ATTEMPT_ADDED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.LAUNCHED, RMAppAttemptEventType.LAUNCH_FAILED, RMAppAttemptEventType.EXPIRE, RMAppAttemptEventType.REGISTERED, RMAppAttemptEventType.CONTAINER_ALLOCATED, RMAppAttemptEventType.UNREGISTERED, RMAppAttemptEventType.KILL, RMAppAttemptEventType.STATUS_UPDATE)) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1726: -- Attachment: YARN-1726-7.patch Update a patch to fix the comments. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726-7.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075125#comment-14075125 ] Jian He commented on YARN-2209: --- Thanks for the review, Rohith. Uploaded a new patch which fixed the above comments. > Replace allocate#resync command with ApplicationMasterNotRegisteredException > to indicate AM to re-register on RM restart > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.3.1.patch > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch, > YARN-2212.3.1.patch, YARN-2212.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2209: -- Attachment: YARN-2209.4.patch > Replace allocate#resync command with ApplicationMasterNotRegisteredException > to indicate AM to re-register on RM restart > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075097#comment-14075097 ] Hadoop QA commented on YARN-1707: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657926/YARN-1707.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4439//console This message is automatically generated. > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075084#comment-14075084 ] Hadoop QA commented on YARN-2026: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657909/YARN-2026-v3.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4438//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4438//console This message is automatically generated. > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt > > > Problem1- While using hierarchical queues in fair scheduler,there are few > scenarios where we have seen a leaf queue with least fair share can take > majority of the cluster and starve a sibling parent queue which has greater > weight/fair share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent’s fair share i.e. > 80%. > This would cause preemption to reclaim the 30% needed by childQ1 from > root.lowPriorityQueue after fairSharePreemptionTimeout seconds. > Problem2 - Also note that similar situation can happen between > root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 > hogs the
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075076#comment-14075076 ] Carlo Curino commented on YARN-1707: The attached patch is part of the YARN-1051 effort, as for the other patches in this series does not work on itself but it has been cut for ease of reviewing. Given previous discussions, we introduced subclasses for ParentQueue and LeafQueue that are dynamically addable/removable/resizeable, as well as changes in the CapacityScheduler to support the "move" of applications across queues. These are core features, we tested on a cluster running lots of gridmix and manual jobs, and seems to work fine, but I am sure there are corner cases and possibly metrics that are not updated correctly under all cases. We should also create a new set of tests for the dynamic behavior of the CapacityScheduler. > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075073#comment-14075073 ] Hadoop QA commented on YARN-2212: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657905/YARN-2212.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4437//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4437//console This message is automatically generated. > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlo Curino updated YARN-1707: --- Attachment: YARN-1707.patch > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075011#comment-14075011 ] Ashwin Shankar commented on YARN-2026: -- Incorporated [~kasha] suggestion of having two notions of fairness. Also incorporated [~sandyr] unit test comments. Please let me know if you have any other comments. Created YARN-2360 to deal with UI changes to display dynamic fair share on scheduler page. I've not added dynamic fair share in FSQueueMetrics . Could you please let me know how these metrics are used and if we want to add dynamic fair share to it ? > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt > > > Problem1- While using hierarchical queues in fair scheduler,there are few > scenarios where we have seen a leaf queue with least fair share can take > majority of the cluster and starve a sibling parent queue which has greater > weight/fair share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent’s fair share i.e. > 80%. > This would cause preemption to reclaim the 30% needed by childQ1 from > root.lowPriorityQueue after fairSharePreemptionTimeout seconds. > Problem2 - Also note that similar situation can happen between > root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 > hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck > at 5%,until childQ2 starts relinquishing containers. We would like each of > childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie > 40%,which would ensure childQ1 gets upto 40% resource if needed through > preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2359: Description: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated when the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} was: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle the event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated when the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074989#comment-14074989 ] Karthik Kambatla commented on YARN-1726: Comments on the trunk patch: # NMSimulator: May be change this signature to match AMSimulator and {{throws Exception}} {code} public void middleStep() throws YarnException, InterruptedException, IOException { {code} # Would the following affect performance? Is there a better alternative, may be wait-notify? {code} while (rmAppAttempt.getAppAttemptState() != RMAppAttemptState.LAUNCHED) { Thread.sleep(50); {code} # In the test, remove the space between in {{count --}}. Also, is there a reason we have to wait for 45 seconds? Can we use a MockClock to speed this test up? > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2026: - Attachment: YARN-2026-v3.txt > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt > > > Problem1- While using hierarchical queues in fair scheduler,there are few > scenarios where we have seen a leaf queue with least fair share can take > majority of the cluster and starve a sibling parent queue which has greater > weight/fair share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent’s fair share i.e. > 80%. > This would cause preemption to reclaim the 30% needed by childQ1 from > root.lowPriorityQueue after fairSharePreemptionTimeout seconds. > Problem2 - Also note that similar situation can happen between > root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 > hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck > at 5%,until childQ2 starts relinquishing containers. We would like each of > childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie > 40%,which would ensure childQ1 gets upto 40% resource if needed through > preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
Ashwin Shankar created YARN-2360: Summary: Fair Scheduler : Display dynamic fair share for queues on the scheduler page Key: YARN-2360 URL: https://issues.apache.org/jira/browse/YARN-2360 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Based on the discussion in YARN-2026, we'd like to display dynamic fair share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074940#comment-14074940 ] Mayank Bansal commented on YARN-2069: - HI [~wangda] , Thanks for the review. Let me explain what this algo is doing . Lets say you have queueA in your cluster with capacity 30% allocated to it. Now Queue A is using 50% resources. Queue A has 5 users with 20% user limit. That means with each user is using 10% of the capacity of the cluster. Now Another queueB is there with allocated capacity 70%. Used capacity of queue B is 50%. Now if another application gets submitted to Queue B which needs 10% capacity. Now 10% capacity has to be claimed back from queue A . So restoobtain = 10% Targated user limit will be = 8% (This is always calculated based on how much we need to calim back from user) So based on the current alogorithm , it will take out 2% resources from every user and will leave behind the balance for each users. This will also be true if all the users are not using same number of resources so based on this algo it will take out more from the users which are using more to balance till targated user limit. Other thing this algo also does is it preempt application which is submitted last that means if user1 has 2 application, it will try to take the maximum containers from the last application submitted leaving behind the AM container however user limit will be honoured with combined all applications in the queue. This algo does not remove AM continer if its not absolutely needed, it goes to get all the tasks containers first and then consider AM containers.to be preempted. Thanks, Mayank > CS queue level preemption should respect user-limits > > > Key: YARN-2069 > URL: https://issues.apache.org/jira/browse/YARN-2069 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, > YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, > YARN-2069-trunk-6.patch, YARN-2069-trunk-7.patch > > > This is different from (even if related to, and likely share code with) > YARN-2113. > YARN-2113 focuses on making sure that even if queue has its guaranteed > capacity, it's individual users are treated in-line with their limits > irrespective of when they join in. > This JIRA is about respecting user-limits while preempting containers to > balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2212) ApplicationMaster needs to find a way to update the AMRMToken periodically
[ https://issues.apache.org/jira/browse/YARN-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2212: Attachment: YARN-2212.3.patch Adding more testcases > ApplicationMaster needs to find a way to update the AMRMToken periodically > -- > > Key: YARN-2212 > URL: https://issues.apache.org/jira/browse/YARN-2212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2212.1.patch, YARN-2212.2.patch, YARN-2212.3.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2359: Description: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {{ .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED))}} was: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)) > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttempt
[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2359: Description: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {code}.addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)){code} was: Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: {{ .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED))}} > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > R
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074917#comment-14074917 ] zhihai xu commented on YARN-2359: - I can pass the test TestAMRestart in my local build. --- T E S T S --- Running org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 90.076 sec - in org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Results : Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > .addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074890#comment-14074890 ] Hudson commented on YARN-2211: -- FAILURE: Integrated in Hadoop-trunk-Commit #5970 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5970/]) YARN-2211. Persist AMRMToken master key in RMStateStore for RM recovery. Contributed by Xuan Gong (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613515) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/ProtocolHATestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestApplicationMasterServiceOnHA.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMSecretManagerService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/NullRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/AMRMTokenSecretManagerState.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/records/impl/pb/AMRMTokenSecretManagerStatePBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/AMRMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/proto/yarn_server_resourcemanager_recovery.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestAMRMTokens.java > RMStateStore need
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074876#comment-14074876 ] Jian He commented on YARN-2211: --- looks good, +1 > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074873#comment-14074873 ] Hadoop QA commented on YARN-2359: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657887/YARN-2359.000.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4436//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4436//console This message is automatically generated. > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > .addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074819#comment-14074819 ] Hadoop QA commented on YARN-1726: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657881/YARN-1726-6-branch2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4435//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4435//console This message is automatically generated. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074815#comment-14074815 ] Hadoop QA commented on YARN-2211: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657877/YARN-2211.8.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 10 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4434//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4434//console This message is automatically generated. > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2359: Attachment: YARN-2359.000.patch > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > Attachments: YARN-2359.000.patch > > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > .addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1726: -- Attachment: YARN-1726-6-branch2.patch update a patch for branch2 > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6-branch2.patch, > YARN-1726-6.patch, YARN-1726.patch, YARN-1726.patch, YARN-1726.patch, > YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074753#comment-14074753 ] Xuan Gong commented on YARN-2211: - Same patch. just fix the log on TestAMRMTokens which can fix the test-case failure. We changed the exception message on AMRMTokenSecretManager#retrievePWD. But did not change the exception message which is used to assert in testcase which cause the testcase failure. The new patch fixes it. > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074749#comment-14074749 ] Hadoop QA commented on YARN-1726: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657870/YARN-1726-6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4433//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4433//console This message is automatically generated. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2211: Attachment: YARN-2211.8.1.patch > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.1.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074731#comment-14074731 ] Hadoop QA commented on YARN-2211: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657856/YARN-2211.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestAMRMTokens {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4430//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4430//console This message is automatically generated. > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074728#comment-14074728 ] Hadoop QA commented on YARN-1354: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657864/YARN-1354-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4432//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4432//console This message is automatically generated. > Recover applications upon nodemanager restart > - > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, > YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, > YARN-1354-v4.patch, YARN-1354-v5.patch > > > The set of active applications in the nodemanager context need to be > recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1726) ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced in YARN-1041
[ https://issues.apache.org/jira/browse/YARN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-1726: -- Attachment: YARN-1726-6.patch rebase the patch after YARN-2335. May not work with branch-2, will update a patch for branch-2. > ResourceSchedulerWrapper failed due to the AbstractYarnScheduler introduced > in YARN-1041 > > > Key: YARN-1726 > URL: https://issues.apache.org/jira/browse/YARN-1726 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Blocker > Attachments: YARN-1726-5.patch, YARN-1726-6.patch, YARN-1726.patch, > YARN-1726.patch, YARN-1726.patch, YARN-1726.patch > > > The YARN scheduler simulator failed when running Fair Scheduler, due to > AbstractYarnScheduler introduced in YARN-1041. The ResourceSchedulerWrapper > should inherit AbstractYarnScheduler, instead of implementing > ResourceScheduler interface directly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074677#comment-14074677 ] Zhijie Shen commented on YARN-2262: --- [~nishan] and [~Naganarasimha], I've a general suggestion on this issue. According to the logs, the problem is likely to be related to the FS history store. On the other side, we're seeking rebasing on the timeline store to persist the generic history data (See YARN-2033 for the motivation and the more details). Given this is done, we may deprecate the current FS history store because there are limitations around the FS history store, and it is expensive to maintain two store interfaces. It's always welcome if you'd like to help fix the bug, but I hope you're aware of this plan, in case your effort is likely not to be leveraged. If you have bandwidth, I would appreciate if you can help with other issue such as YARN-2033, where I've a patch available for the timeline store based generic history service, but I still didn't have a chance to test it with RM restart. Anyway, thanks for your interest on the timeline server. Please feel free to share your thoughts. > Few fields displaying wrong values in Timeline server after RM restart > -- > > Key: YARN-2262 > URL: https://issues.apache.org/jira/browse/YARN-2262 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.4.0 >Reporter: Nishan Shetty >Assignee: Naganarasimha G R > Attachments: Capture.PNG, Capture1.PNG, > yarn-testos-historyserver-HOST-10-18-40-95.log, > yarn-testos-resourcemanager-HOST-10-18-40-84.log, > yarn-testos-resourcemanager-HOST-10-18-40-95.log > > > Few fields displaying wrong values in Timeline server after RM restart > State:null > FinalStatus: UNDEFINED > Started: 8-Jul-2014 14:58:08 > Elapsed: 2562047397789hrs, 44mins, 47sec -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1354: - Attachment: YARN-1354-v5.patch Updating patch to fix the warning. > Recover applications upon nodemanager restart > - > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, > YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, > YARN-1354-v4.patch, YARN-1354-v5.patch > > > The set of active applications in the nodemanager context need to be > recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074635#comment-14074635 ] Anubhav Dhoot commented on YARN-2229: - We cannot simply add a field and have old code not know about it. That will cause it to silently work with a wrong id (missing field). And because of the way we construct containerIds we need to add the new field (details in YARN-2052). The only way i see it working (without a cluster shutdown) is if we support deserializing both the older format and newer format. When serializing we can choose to emit a new field based on a condition (flag or version number of the daemon). So the first rolling upgrade will not turn on the condition but will ensure all the code supports deserializing the newer field if it exists. In the next rolling upgrade we can turn on the condition to serialize the new field. RM can ensure that NMs are upgraded to a specific version (support deserializing the new field) before allowing the flag to be turned on. That will take care of the case when someone does not follow the approach above. Any problems with this approach? > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2262) Few fields displaying wrong values in Timeline server after RM restart
[ https://issues.apache.org/jira/browse/YARN-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074636#comment-14074636 ] Zhijie Shen commented on YARN-2262: --- [~nishan], thanks for sharing the logs. I've done a preliminary investigation into the RM logs. It seems that the FS history store messed up after RM failed over. There're two types of exception: 1. The history file of application_1406035038624_0005 shouldn't occur, because before failover, I didn't see application_1406035038624_0005 was already started according to the log (or the log is not complete?). However, FS store found the history file on HDFS, and wanted to append more information into the file, but failed to open the file in append mode. {code} 2014-07-23 17:00:03,066 ERROR org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore: Error when openning history file of application application_1406035038624_0005 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): Failed to create file [/home/testos/timelinedata/generic-history/ApplicationHistoryDataRoot/application_1406035038624_0005] for [DFSClient_NONMAPREDUCE_-903472038_1] for client [10.18.40.84], because this file is already being created by [DFSClient_NONMAPREDUCE_1878412866_1] on [10.18.40.84] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2549) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2378) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:2613) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2576) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:537) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:373) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at $Proxy14.append(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:276) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at $Proxy15.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1569) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1609) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1597) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:320) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:316) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:316) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1161) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileWriter.(FileSystemApplicationHistoryStore.java:723) at org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.applicationStarted(FileSystemApplicationHistoryStore.java:418) at org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.
[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074634#comment-14074634 ] Hadoop QA commented on YARN-2335: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657858/YARN-2335-1.branch2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4431//console This message is automatically generated. > Annotate all hadoop-sls APIs as @Private > > > Key: YARN-2335 > URL: https://issues.apache.org/jira/browse/YARN-2335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2335-1.branch2.patch, YARN-2335-1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2335: -- Attachment: YARN-2335-1.branch2.patch update patch for branch-2 > Annotate all hadoop-sls APIs as @Private > > > Key: YARN-2335 > URL: https://issues.apache.org/jira/browse/YARN-2335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2335-1.branch2.patch, YARN-2335-1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074622#comment-14074622 ] Hudson commented on YARN-2335: -- FAILURE: Integrated in Hadoop-trunk-Commit #5967 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5967/]) YARN-2335. Annotate all hadoop-sls APIs as @Private. (Wei Yan via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613478) * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/RumenToSLSConverter.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/SLSRunner.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/AMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/conf/SLSConfiguration.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NMSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NodeInfo.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/CapacitySchedulerMetrics.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ContainerSimulator.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/FairSchedulerMetrics.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/FifoSchedulerMetrics.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/NodeUpdateSchedulerEventWrapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/RMNodeWrapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/ResourceSchedulerWrapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerMetrics.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SchedulerWrapper.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/TaskRunner.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/utils/SLSUtils.java * /hadoop/common/trunk/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/web/SLSWebApp.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt > Annotate all hadoop-sls APIs as @Private > > > Key: YARN-2335 > URL: https://issues.apache.org/jira/browse/YARN-2335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2335-1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2211) RMStateStore needs to save AMRMToken master key for recovery when RM restart/failover happens
[ https://issues.apache.org/jira/browse/YARN-2211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2211: -- Attachment: YARN-2211.8.patch Looks good overall, fixed some log msgs myself. re-submit the patch. > RMStateStore needs to save AMRMToken master key for recovery when RM > restart/failover happens > -- > > Key: YARN-2211 > URL: https://issues.apache.org/jira/browse/YARN-2211 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2211.1.patch, YARN-2211.2.patch, YARN-2211.3.patch, > YARN-2211.4.patch, YARN-2211.5.1.patch, YARN-2211.5.patch, > YARN-2211.6.1.patch, YARN-2211.6.patch, YARN-2211.7.1.patch, > YARN-2211.7.patch, YARN-2211.8.patch > > > After YARN-2208, AMRMToken can be rolled over periodically. We need to save > related Master Keys and use them to recover the AMRMToken when RM > restart/failover happens -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074600#comment-14074600 ] Karthik Kambatla commented on YARN-2335: Committed to trunk. branch-2 had conflicts. Mind updating the patch for branch-2? > Annotate all hadoop-sls APIs as @Private > > > Key: YARN-2335 > URL: https://issues.apache.org/jira/browse/YARN-2335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2335-1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2335) Annotate all hadoop-sls APIs as @Private
[ https://issues.apache.org/jira/browse/YARN-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074593#comment-14074593 ] Karthik Kambatla commented on YARN-2335: +1 > Annotate all hadoop-sls APIs as @Private > > > Key: YARN-2335 > URL: https://issues.apache.org/jira/browse/YARN-2335 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Minor > Attachments: YARN-2335-1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2359: Priority: Critical (was: Major) > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu >Priority: Critical > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > .addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
[ https://issues.apache.org/jira/browse/YARN-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu reassigned YARN-2359: --- Assignee: zhihai xu > Application is hung without timeout and retry after DNS/network is down. > - > > Key: YARN-2359 > URL: https://issues.apache.org/jira/browse/YARN-2359 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: zhihai xu >Assignee: zhihai xu > > Application is hung without timeout and retry after DNS/network is down. > It is because right after the container is allocated for the AM, the > DNS/network is down for the node which has the AM container. > The application attempt is at state RMAppAttemptState.SCHEDULED, it receive > RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the > IllegalArgumentException(due to DNS error) happened, it stay at state > RMAppAttemptState.SCHEDULED. In the state machine, only two events will be > processed at this state: > RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. > The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) > which will be generated by the node and container timeout. So even the node > is removed, the Application is still hung in this state > RMAppAttemptState.SCHEDULED. > The only way to make the application exit this state is to send > RMAppAttemptEventType.KILL event which will only be generated when you > manually kill the application from Job Client by forceKillApplication. > To fix the issue, we should add an entry in the state machine table to handle > RMAppAttemptEventType.CONTAINER_FINISHED event at state > RMAppAttemptState.SCHEDULED > add the following code in StateMachineFactory: > .addTransition(RMAppAttemptState.SCHEDULED, > RMAppAttemptState.FINAL_SAVING, > RMAppAttemptEventType.CONTAINER_FINISHED, > new FinalSavingTransition( > new AMContainerCrashedBeforeRunningTransition(), > RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2359) Application is hung without timeout and retry after DNS/network is down.
zhihai xu created YARN-2359: --- Summary: Application is hung without timeout and retry after DNS/network is down. Key: YARN-2359 URL: https://issues.apache.org/jira/browse/YARN-2359 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: zhihai xu Application is hung without timeout and retry after DNS/network is down. It is because right after the container is allocated for the AM, the DNS/network is down for the node which has the AM container. The application attempt is at state RMAppAttemptState.SCHEDULED, it receive RMAppAttemptEventType.CONTAINER_ALLOCATED event, because the IllegalArgumentException(due to DNS error) happened, it stay at state RMAppAttemptState.SCHEDULED. In the state machine, only two events will be processed at this state: RMAppAttemptEventType.CONTAINER_ALLOCATED and RMAppAttemptEventType.KILL. The code didn't handle any event(RMAppAttemptEventType.CONTAINER_FINISHED) which will be generated by the node and container timeout. So even the node is removed, the Application is still hung in this state RMAppAttemptState.SCHEDULED. The only way to make the application exit this state is to send RMAppAttemptEventType.KILL event which will only be generated when you manually kill the application from Job Client by forceKillApplication. To fix the issue, we should add an entry in the state machine table to handle RMAppAttemptEventType.CONTAINER_FINISHED event at state RMAppAttemptState.SCHEDULED add the following code in StateMachineFactory: .addTransition(RMAppAttemptState.SCHEDULED, RMAppAttemptState.FINAL_SAVING, RMAppAttemptEventType.CONTAINER_FINISHED, new FinalSavingTransition( new AMContainerCrashedBeforeRunningTransition(), RMAppAttemptState.FAILED)) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074548#comment-14074548 ] Ashwin Shankar commented on YARN-2214: -- Thanks Karthik ! > FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence > towards fairness > -- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Fix For: 2.6.0 > > Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074527#comment-14074527 ] Hudson commented on YARN-2214: -- FAILURE: Integrated in Hadoop-trunk-Commit #5966 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5966/]) YARN-2214. FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness. (Ashwin Shankar via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613459) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java > FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence > towards fairness > -- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074523#comment-14074523 ] Hadoop QA commented on YARN-1354: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12657689/YARN-1354-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1259 javac compiler warnings (more than the trunk's current 1258 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4429//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4429//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4429//console This message is automatically generated. > Recover applications upon nodemanager restart > - > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, > YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, > YARN-1354-v4.patch > > > The set of active applications in the nodemanager context need to be > recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2358) TestNamenodeCapacityReport.testXceiverCount may sometimes fail due to lack of retry
Mit Desai created YARN-2358: --- Summary: TestNamenodeCapacityReport.testXceiverCount may sometimes fail due to lack of retry Key: YARN-2358 URL: https://issues.apache.org/jira/browse/YARN-2358 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Mit Desai Assignee: Mit Desai I have seen TestNamenodeCapacityReport.testXceiverCount fail intermittently in our nightly builds with the following error: {noformat} java.io.IOException: Unable to close file because the last block does not have enough number of replicas. at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2151) at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2119) at org.apache.hadoop.hdfs.server.namenode.TestNamenodeCapacityReport.testXceiverCount(TestNamenodeCapacityReport.java:281) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2214: --- Summary: FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness (was: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness) > FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence > towards fairness > -- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2214) FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2214: --- Issue Type: Improvement (was: Bug) > FairScheduler: preemptContainerPreCheck() in FSParentQueue delays convergence > towards fairness > -- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2214) preemptContainerPreCheck() in FSParentQueue delays convergence towards fairness
[ https://issues.apache.org/jira/browse/YARN-2214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074509#comment-14074509 ] Karthik Kambatla commented on YARN-2214: +1. Checking this in. > preemptContainerPreCheck() in FSParentQueue delays convergence towards > fairness > --- > > Key: YARN-2214 > URL: https://issues.apache.org/jira/browse/YARN-2214 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: YARN-2214-v1.txt, YARN-2214-v2.txt > > > preemptContainerPreCheck() in FSParentQueue rejects preemption requests if > the parent queue is below fair share. This can cause a delay in converging > towards fairness when the starved leaf queue and the queue above fairshare > belong under a non-root parent queue(ie their least common ancestor is a > parent queue which is not root). > Here is an example : > root.parent has fair share = 80% and usage = 80% > root.parent.child1 has fair share =40% usage = 80% > root.parent.child2 has fair share=40% usage=0% > Now a job is submitted to child2 and the demand is 40%. > Preemption will kick in and try to reclaim all the 40% from child1. > When it preempts the first container from child1,the usage of root.parent > will become <80%, which is less than root.parent's fair share,causing > preemption to stop.So only one container gets preempted in this round > although the need is a lot more. child2 would eventually get to half its fair > share but only after multiple rounds of preemption. > Solution is to remove preemptContainerPreCheck() in FSParentQueue and keep it > only in FSLeafQueue(which is already there). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace allocate#resync command with ApplicationMasterNotRegisteredException to indicate AM to re-register on RM restart
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074366#comment-14074366 ] Rohith commented on YARN-2209: -- Hi [~jianhe], I reviewed patch and found some comments 1. Missing lastResponseID=0 in RMContainerAllocator#getResources(). {code} catch (ApplicationMasterNotRegisteredException e) { LOG.info("ApplicationMaster is out of sync with ResourceManager," + " hence resync and send outstanding requests."); // RM may have restarted, re-register with RM. register(); addOutstandingRequestOnResync(); return null; } {code} 2. In AMRMClientAsyncImpl, below code may loose one response since it is not adding back to responseQueue when InterruptedException ocure. This may be worst case, but still it can ocure may because java itself Interrupting or os may be Interrupting. Can we add reponse back to responseQueue on InterruptedException? {code} if (response != null) { try { responseQueue.put(response); break; } catch (InterruptedException ex) { LOG.debug("Interrupted while waiting to put on response queue", ex); } {code} > Replace allocate#resync command with ApplicationMasterNotRegisteredException > to indicate AM to re-register on RM restart > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: (was: YARN-2357.1.patch) > Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 > changes to branch-2 > -- > > Key: YARN-2357 > URL: https://issues.apache.org/jira/browse/YARN-2357 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Remus Rusanu >Assignee: Remus Rusanu >Priority: Critical > Labels: security, windows > Attachments: YARN-2357.1.patch > > > As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to > trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: YARN-2357.1.patch Now with compile fix! > Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 > changes to branch-2 > -- > > Key: YARN-2357 > URL: https://issues.apache.org/jira/browse/YARN-2357 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Remus Rusanu >Assignee: Remus Rusanu >Priority: Critical > Labels: security, windows > Attachments: YARN-2357.1.patch > > > As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to > trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2357: --- Attachment: YARN-2357.1.patch Patch .1 is port of currently uploaded YARN-1063 .6, YARN-1972 .3 and YARN-2198 .2 patches. > Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 > changes to branch-2 > -- > > Key: YARN-2357 > URL: https://issues.apache.org/jira/browse/YARN-2357 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Remus Rusanu >Assignee: Remus Rusanu >Priority: Critical > Labels: security, windows > Attachments: YARN-2357.1.patch > > > As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to > trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
Remus Rusanu created YARN-2357: -- Summary: Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2 Key: YARN-2357 URL: https://issues.apache.org/jira/browse/YARN-2357 Project: Hadoop YARN Issue Type: Task Components: nodemanager Affects Versions: 2.4.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2356) yarn status command for non-existent application/application attempt/container is too verbose
[ https://issues.apache.org/jira/browse/YARN-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-2356: -- Attachment: Yarn-2356.1.patch Fixed to handle ApplicationNotFoundException, ApplicationAttemptNotFoundException and ContainerNotFoundException for "-status" commands generally for non-existent entries. > yarn status command for non-existent application/application > attempt/container is too verbose > -- > > Key: YARN-2356 > URL: https://issues.apache.org/jira/browse/YARN-2356 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Sunil G >Assignee: Sunil G >Priority: Minor > Attachments: Yarn-2356.1.patch > > > *yarn application -status* or *applicationattempt -status* or *container > status* commands can suppress exception such as ApplicationNotFound, > ApplicationAttemptNotFound and ContainerNotFound for non-existent entries in > RM or History Server. > For example, below exception can be suppressed better > sunildev@host-a:~/hadoop/hadoop/bin> ./yarn application -status > application_1402668848165_0015 > No GC_PROFILE is given. Defaults to medium. > 14/07/25 16:21:45 INFO client.RMProxy: Connecting to ResourceManager at > /10.18.40.77:45022 > Exception in thread "main" > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1402668848165_0015' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:285) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:607) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2099) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2095) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2093) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at > org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) > at $Proxy12.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:428) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:153) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) > at > org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:76) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException): > Application with id 'application_1402668848165_0015' doesn't exist in RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2356) yarn status command for non-existent application/application attempt/container is too verbose
Sunil G created YARN-2356: - Summary: yarn status command for non-existent application/application attempt/container is too verbose Key: YARN-2356 URL: https://issues.apache.org/jira/browse/YARN-2356 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Sunil G Assignee: Sunil G Priority: Minor *yarn application -status* or *applicationattempt -status* or *container status* commands can suppress exception such as ApplicationNotFound, ApplicationAttemptNotFound and ContainerNotFound for non-existent entries in RM or History Server. For example, below exception can be suppressed better sunildev@host-a:~/hadoop/hadoop/bin> ./yarn application -status application_1402668848165_0015 No GC_PROFILE is given. Defaults to medium. 14/07/25 16:21:45 INFO client.RMProxy: Connecting to ResourceManager at /10.18.40.77:45022 Exception in thread "main" org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1402668848165_0015' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:285) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:607) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2099) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2095) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1626) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2093) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at $Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:428) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:76) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException): Application with id 'application_1402668848165_0015' doesn't exist in RM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2336) Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree
[ https://issues.apache.org/jira/browse/YARN-2336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074268#comment-14074268 ] Akira AJISAKA commented on YARN-2336: - Thanks [~kj-ki] for the update. +1 (non-binding). > Fair scheduler REST api returns a missing '[' bracket JSON for deep queue tree > -- > > Key: YARN-2336 > URL: https://issues.apache.org/jira/browse/YARN-2336 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 2.4.1 >Reporter: Kenji Kikushima >Assignee: Kenji Kikushima > Attachments: YARN-2336-2.patch, YARN-2336-3.patch, YARN-2336.patch > > > When we have sub queues in Fair Scheduler, REST api returns a missing '[' > blacket JSON for childQueues. > This issue found by [~ajisakaa] at YARN-1050. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074249#comment-14074249 ] Wangda Tan commented on YARN-2069: -- Hi [~mayank_bansal], Thanks for working on this again. I've taken a brief look at your patch, I think the general appoarch in your patch is: - Compute a target-user-limit for a given queue, - Preempt containers according to a user's current comsumption and target-user-limit, - If more resource need to be preempted, we should consider preempt AM container, I think there're couple of rules we need respect (Please let me know if you don't agree with any of them), # Used resource of users in a queue after preempted should be as average as possible # Before we start preempting AM containers, all task containers should be preempted (according to YARN-2022, keep preempting AM container as least priority) # If we should preempt AM container, we should respect #1 too For #1, If we want to quantize the result, it should be: {code} i∈{user} Let rp_i = used-resource-after-preemption of user_i Minimize sqrt(Σ(rp - Σ(rp_i)/#{user})^2) i i {code} In another word, we should minimize standard deviation of used-resource-after-preemption. Since not all containers are equal in size, so it is possible that used-resource-after-preemption of a given user cannot precisely equal to target-user-limit. In our current logic, we will make used-resource-after-preemption <= target-user-limit. considering following example, {code} qA: has user {V, W, X, Y, Z}; each user has one application V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, minimum_allocation=1G W: app4: {4, 4, 4, 4}, X: app3: {4, 4, 4, 4}, Y: app2: {4, 4, 4, 4, 4, 4}, Z: app1: {4} target-user-limit=11, resource-to-obtain=23 After preemption: V: {4, 4} W: {4, 4} X: {4, 4} Y: {4, 4, 4, 4, 4, 4} Z: {4} {code} This imbalance happens because, for every application we preempted, may excess user-limit (bias), the more user we processed, the more potentially accumulated bias we might have. In another word, the un-balanced is linear correlated number-of-user-in-a-queue multiplies average-container-size And we cannot solve this problem by preempting from user has most usage, still the example: {code} qA: has user {V, W, X, Y, Z}; each user has one application V: app5: {4, 4, 4, 4}, //means V has 4 containers, each one has memory=4G, minimum_allocation=1G W: app4: {4, 4, 4, 4}, X: app3: {4, 4, 4, 4}, Y: app2: {4, 4, 4, 4, 4, 4}, Z: app1: {4} target-user-limit=11, resource-to-obtain=23 After preemption (from user has most usage, the sequence is Y->V->W->X->Z): V: {4, 4} W: {4, 4, 4, 4} X: {4, 4, 4, 4} Y: {4, 4} Z: {4} {code} Still not very balanced, the ideal result should be: {code} V: {4, 4, 4} W: {4, 4, 4} X: {4, 4, 4} Y: {4, 4, 4} Z: {4} {code} In addition, this appoarch cannot resolve rule #2/#3 as well if target-user-limit is not appropriately computed. So I propose to do in another way, We should recompute used-resource - marked-preempted-resource every time for a user after making decision of preemption each container. Maybe we can use a priority queue here to store (used-resource - marked-preempted-resource) here. And we don’t need to compute a target user limit here. The pseudo code for preempting resource of a queue might look like: {code} compute resToObtain first; // first preempt task containers while (resToObtain > 0) { pick a user-x which has most (used-resource - marked-preempted-resource) pick one container-y from user to preempted resToObtain -= container-y.resource } if (resToObtain <= 0) { return; } // if more resource need to be preempted, we should preempt AM container while (resToObtain > 0 && total-am-resource - marked-preempted-am-resource > max-am-percentage) { // do the same thing again: pick a user-x which has most (used-resource - marked-preempted-resource) pick one container-y from user to preempted resToObtain -= container-y.resource } {code} With this, we can make the un-balanced linear correlated with average-container-size only and solved the #2/#3 rules we should respect I mentioned before altogether. Mayank, do you think is it looks like a reasonable suggestion? Any other thoughts? [~vinodkv], [~curino], [~sunilg]. Thanks, Wangda > CS queue level preemption should respect user-limits > > > Key: YARN-2069 > URL: https://issues.apache.org/jira/browse/YARN-2069 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Vinod Kumar Vavilapalli >Assignee: Mayank Bansal > Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, > YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch, > YARN-2069-trunk-6.pat
[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
[ https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074204#comment-14074204 ] Zhijie Shen commented on YARN-2347: --- Please ignore previous comment 1. I posted the wrong one. The right comment 1 I'd like to post: 1. The javadoc seems not to be correct after refactoring. {code} +/** + * The version information of RM state. + */ +@Private +@Unstable +public abstract class StateVersion { {code} > Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in > yarn-server-common > > > Key: YARN-2347 > URL: https://issues.apache.org/jira/browse/YARN-2347 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, YARN-2347.patch > > > We have similar things for version state for RM, NM, TS (TimelineServer), > etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
[ https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074198#comment-14074198 ] Zhijie Shen commented on YARN-2347: --- [~djp], it's a good idea to refactor the code to make the common classes. The changes are straightfoward, and look good to me almost. Just some minor comments. 1. Mark the class \@Prviate and \@Unstable? {code} +public class StateVersionPBImpl extends StateVersion { {code} 2. I'm not sure StateVersion is the best name in this case. For example, StateVersion for a db schema sounds weird to me. Why not YarnVersion or even Version? > Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in > yarn-server-common > > > Key: YARN-2347 > URL: https://issues.apache.org/jira/browse/YARN-2347 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, YARN-2347.patch > > > We have similar things for version state for RM, NM, TS (TimelineServer), > etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-641) Make AMLauncher in RM Use NMClient
[ https://issues.apache.org/jira/browse/YARN-641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074147#comment-14074147 ] Hadoop QA commented on YARN-641: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12587395/YARN-641.3.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4428//console This message is automatically generated. > Make AMLauncher in RM Use NMClient > -- > > Key: YARN-641 > URL: https://issues.apache.org/jira/browse/YARN-641 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-641.1.patch, YARN-641.2.patch, YARN-641.3.patch > > > YARN-422 adds NMClient. RM's AMLauncher is responsible for the interactions > with an application's AM container. AMLauncher should also replace the raw > ContainerManager proxy with NMClient. -- This message was sent by Atlassian JIRA (v6.2#6252)