[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057170#comment-14057170 ] Mayank Bansal commented on YARN-2069: - I just verified, rebased the patch and compiled and tested . Patch doesn't seems to be the problem. Thanks, Mayank CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057171#comment-14057171 ] Hadoop QA commented on YARN-2088: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12646030/YARN-2088.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common: org.apache.hadoop.yarn.util.TestFSDownload {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4253//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4253//console This message is automatically generated. Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2181: - Attachment: YARN-2181.patch Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2069) CS queue level preemption should respect user-limits
[ https://issues.apache.org/jira/browse/YARN-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057175#comment-14057175 ] Wangda Tan commented on YARN-2069: -- Hi Mayank, Can you re-kick Jenkins manually? Thanks, Wangda CS queue level preemption should respect user-limits Key: YARN-2069 URL: https://issues.apache.org/jira/browse/YARN-2069 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Vinod Kumar Vavilapalli Assignee: Mayank Bansal Attachments: YARN-2069-trunk-1.patch, YARN-2069-trunk-2.patch, YARN-2069-trunk-3.patch, YARN-2069-trunk-4.patch, YARN-2069-trunk-5.patch This is different from (even if related to, and likely share code with) YARN-2113. YARN-2113 focuses on making sure that even if queue has its guaranteed capacity, it's individual users are treated in-line with their limits irrespective of when they join in. This JIRA is about respecting user-limits while preempting containers to balance queue capacities. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057200#comment-14057200 ] Mayank Bansal commented on YARN-1408: - Thanks [~sunilg] for the patch. Patch looks good , There are some minor comments 1. You current patch is not applying on the trunk, Please rebase on trunk. 2. There are lot of unwanted formatting changes, can you please revert them back. Some examples are as follows {code} - .currentTimeMillis()); +.currentTimeMillis()); {code} {code} -RMContainer rmContainer = -new RMContainerImpl(container, attemptId, node.getNodeID(), - applications.get(attemptId.getApplicationId()).getUser(), rmContext, - status.getCreationTime()); +RMContainer rmContainer = new RMContainerImpl(container, attemptId, +node.getNodeID(), applications.get(attemptId.getApplicationId()) +.getUser(), rmContext, status.getCreationTime()); {code} Please check this in all the patch. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057205#comment-14057205 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654946/YARN-2208.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4254//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4254//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057217#comment-14057217 ] Hadoop QA commented on YARN-2181: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654947/YARN-2181.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4255//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4255//console This message is automatically generated. Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057219#comment-14057219 ] Xuan Gong commented on YARN-2208: - bq. org.apache.hadoop.yarn.util.TestFSDownload This is unrelated test case failure bq. org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart can pass successfully locally.. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057302#comment-14057302 ] Binglin Chang commented on YARN-2088: - Hi [~jianhe] or [~djp], looks like there are no more comments? Would you help get this committed? Thanks. Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches
[ https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057332#comment-14057332 ] Wangda Tan commented on YARN-2258: -- Hi [~nishan], According to your log provided, I think it's a duplicate of YARN-1885. YARN-1885 is targeting to be released at 2.5.0 (coming soon). I'll close it. Please reopen it if you encounter such problems after you upgraded to 2.5.0 Thanks, Wangda Aggregation of MR job logs failing when Resourcemanager switches Key: YARN-2258 URL: https://issues.apache.org/jira/browse/YARN-2258 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty 1.Install RM in HA mode 2.Run a job with more tasks 3.Induce RM switchover while job is in progress Observe that log aggregation fails for the job which is running when Resourcemanager switchover is induced. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches
[ https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan resolved YARN-2258. -- Resolution: Duplicate Assignee: Wangda Tan Aggregation of MR job logs failing when Resourcemanager switches Key: YARN-2258 URL: https://issues.apache.org/jira/browse/YARN-2258 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Wangda Tan 1.Install RM in HA mode 2.Run a job with more tasks 3.Induce RM switchover while job is in progress Observe that log aggregation fails for the job which is running when Resourcemanager switchover is induced. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2257) Add user to queue mappings to automatically place users' apps into specific queues
[ https://issues.apache.org/jira/browse/YARN-2257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057339#comment-14057339 ] Patrick Liu commented on YARN-2257: --- Hi, Vinod, I think we could inject the user-queue mapping judgement in 'RMAppManager's method 'protected synchronized void submitApplication': // Sanity checks if (submissionContext.getQueue() == null) { submissionContext.setQueue(YarnConfiguration.DEFAULT_QUEUE_NAME); } if (submissionContext.getApplicationName() == null) { submissionContext.setApplicationName( YarnConfiguration.DEFAULT_APPLICATION_NAME); } All applications submitted to yarn will be launched by 'RMAppManager'. 'RMAppManager' will do sanity check, create a 'RMAppImpl' instance, and finally send the 'new RMAppEvent(applicationId, RMAppEventType.START)' event. When 'RMAppImpl' received the Event, it will change the state machine and do the transition. The transition will launch the 'RMAppAttemptImpl', and start the 'RMAppAttemptImpl'. Then the app will be scheduled by the specific scheduler. The only thing we need to injuect is the QUEUE in the submissionContext. Like this: // Precondition: set user-as-default-queue to false in yarn-site.xml if(QueuePlacementRule.hasMappingForUser(user)) { submissionContext.setQueue(QueuePlacementRule.getQueue(user)); } else { submissionContext.setQueue(Default); } Add user to queue mappings to automatically place users' apps into specific queues -- Key: YARN-2257 URL: https://issues.apache.org/jira/browse/YARN-2257 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Patrick Liu Assignee: Vinod Kumar Vavilapalli Labels: features Currently, the fair-scheduler supports two modes, default queue or individual queue for each user. Apparently, the default queue is not a good option, because the resources cannot be managed for each user or group. However, individual queue for each user is not good enough. Especially when connecting yarn with hive. There will be increasing hive users in a corporate environment. If we create a queue for a user, the resource management will be hard to maintain. I think the problem can be solved like this: 1. Define user-queue mapping in Fair-Scheduler.xml. Inside each queue, use aclSubmitApps to control user's ability. 2. Each time a user submit an app to yarn, if the user has mapped to a queue, the app will be scheduled to that queue; otherwise, the app will be submitted to default queue. 3. If the user cannot pass aclSubmitApps limits, the app will not be accepted. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057352#comment-14057352 ] Hudson commented on YARN-2131: -- FAILURE: Integrated in Hadoop-Yarn-trunk #609 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/609/]) YARN-2131. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609278) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/NullRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057354#comment-14057354 ] Hudson commented on YARN-1366: -- FAILURE: Integrated in Hadoop-Yarn-trunk #609 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/609/]) YARN-1366. Changed AMRMClient to re-register with RM and send outstanding requests back to RM on work-preserving RM restart. Contributed by Rohith (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609254) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClientOnRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Fix For: 2.5.0 Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.12.patch, YARN-1366.13.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2267) Auxiliary Service support in RM
[ https://issues.apache.org/jira/browse/YARN-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057415#comment-14057415 ] Naganarasimha G R commented on YARN-2267: - Some scenarios for supporting Auxillary Service in RM: Scenario 1: [Overload Control type] a. Monitor plugin inside RM can open rpc port and can recieve feedback from other components in cluster (NM, HBase etc) b. Based on feedback, monitor plugin can take action such as remove a particular NM or change capacity of an NM etc. Scenario 2: [Alarming module] a. Any state changes such as RM moved to Standby/Active, NM added , NM removed/decommisioned etc can easily be informed/reported to central monitoring service. [instead of existing pull type thru REST api's ] b. This plugin should also be able to register to RM for critical state changes as mentioned above, so that these can be reported. Auxiliary Service support in RM --- Key: YARN-2267 URL: https://issues.apache.org/jira/browse/YARN-2267 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Naganarasimha G R Currently RM does not have a provision to run any Auxiliary services. For health/monitoring in RM, its better to make a plugin mechanism in RM itself, similar to NM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2272) UI issues in timeline server
Nishan Shetty created YARN-2272: --- Summary: UI issues in timeline server Key: YARN-2272 URL: https://issues.apache.org/jira/browse/YARN-2272 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Links to nodemanager is not working in timeline server -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2272) UI issues in timeline server
[ https://issues.apache.org/jira/browse/YARN-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty updated YARN-2272: Priority: Minor (was: Major) UI issues in timeline server Key: YARN-2272 URL: https://issues.apache.org/jira/browse/YARN-2272 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor Links to nodemanager is not working in timeline server -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2272) UI issues in timeline server
[ https://issues.apache.org/jira/browse/YARN-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057440#comment-14057440 ] Zhijie Shen commented on YARN-2272: --- [~nishan], thanks for reporting the issue. It has been documented before: YARN-1884. Please refer to this jira for why the link doesn't work. I agree with it, we can close the current jira as duplicate of YARN-1884. UI issues in timeline server Key: YARN-2272 URL: https://issues.apache.org/jira/browse/YARN-2272 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor Links to nodemanager is not working in timeline server -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2272) UI issues in timeline server
[ https://issues.apache.org/jira/browse/YARN-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057454#comment-14057454 ] Nishan Shetty commented on YARN-2272: - Thanks [~zjshen] for looking into the issue I will close this as duplicate of YARN-1884 UI issues in timeline server Key: YARN-2272 URL: https://issues.apache.org/jira/browse/YARN-2272 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor Links to nodemanager is not working in timeline server -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2272) UI issues in timeline server
[ https://issues.apache.org/jira/browse/YARN-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishan Shetty resolved YARN-2272. - Resolution: Duplicate UI issues in timeline server Key: YARN-2272 URL: https://issues.apache.org/jira/browse/YARN-2272 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.4.0 Reporter: Nishan Shetty Priority: Minor Links to nodemanager is not working in timeline server -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2258) Aggregation of MR job logs failing when Resourcemanager switches
[ https://issues.apache.org/jira/browse/YARN-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057460#comment-14057460 ] Nishan Shetty commented on YARN-2258: - Thanks [~vinodkv] and [~leftnoteasy] for looking into the issue Aggregation of MR job logs failing when Resourcemanager switches Key: YARN-2258 URL: https://issues.apache.org/jira/browse/YARN-2258 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager Affects Versions: 2.4.0 Reporter: Nishan Shetty Assignee: Wangda Tan 1.Install RM in HA mode 2.Run a job with more tasks 3.Induce RM switchover while job is in progress Observe that log aggregation fails for the job which is running when Resourcemanager switchover is induced. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057472#comment-14057472 ] Hudson commented on YARN-1366: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1800 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1800/]) YARN-1366. Changed AMRMClient to re-register with RM and send outstanding requests back to RM on work-preserving RM restart. Contributed by Rohith (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609254) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClientOnRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Fix For: 2.5.0 Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.12.patch, YARN-1366.13.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2228: -- Attachment: YARN-2228.3.patch TimelineServer should load pseudo authentication filter when authentication = simple Key: YARN-2228 URL: https://issues.apache.org/jira/browse/YARN-2228 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch When kerberos authentication is not enabled, we should let the timeline server to work with pseudo authentication filter. In this way, the sever is able to detect the request user by checking user.name. On the other hand, timeline client should append user.name in un-secure case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057504#comment-14057504 ] Zhijie Shen commented on YARN-2228: --- Vinod, thanks for your review. Please check my response bellow. bq. Not sure if we can rename this to be better, but if possible we should. yarn.timeline-service is trying to indicate which component the configurations are related to, and http.authentication. is to be as close as to the original hadoop.http.authentication.. Does it make sense? bq. After this patch, owner should never be empty, right? We can reject requests when we cannot figure out the submission user. Via TimelineClient, the owner is always set no matter it is pseudo or kerberos authentication. However, users can choose to walk around TimelineClient and post entities to the timeline server on top of the REST API directly. Personally, I prefer to accept anonymous user, in case some users want to ignore security at all. For example, when testing the functionality stuff, users may not want to append user.name= every time they compose a URL. bq. I am not able find the magic that is automatically putting the PseudoAuthFilter into the configuration. It also seems like TimelineAuthenticationFilterInitializer is always added irrespective of security. It is based on the agreement that ACLs need to work in insecure mode (i.e. type = simple) as well. Given this agreement, I need always to use TimelineAuthenticationFilterInitializer to load TimelineAuthenticationFilter, which will extract the user information from the request. When type = simple, the user information comes from the URL param. On the other hand, if we don't load the authentication filter in insecure mode, the timeline server is unable to know the user of a request. By default, the authentication type is simple, the parent class of TimelineAuthenticationFilter (i.e., AuthenticationFilter) is going to load PseudoAuthenticationFilter. The magic is within AuthenticationFilter#init. bq. It doesn't seem like we had tests to validate delegationtoken based access to TimelineServer? The whole authentication part is lacking test cases. Given the work of HADOOP-10799, we may take advantage of DT authentication stack in common, which will mitigate the problem, because the relevant test cases are promoted to common together with DT authentication stack. After that, we can evaluate what are missing UTs for the scenario of the timeline server. Now let's just file a ticket to track the UT stuff. How do you think? Please suggest on these points further. For the remaining comments, I've addressed them in the newly uploaded patch. TimelineServer should load pseudo authentication filter when authentication = simple Key: YARN-2228 URL: https://issues.apache.org/jira/browse/YARN-2228 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch When kerberos authentication is not enabled, we should let the timeline server to work with pseudo authentication filter. In this way, the sever is able to detect the request user by checking user.name. On the other hand, timeline client should append user.name in un-secure case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057539#comment-14057539 ] Hudson commented on YARN-1366: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1827 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1827/]) YARN-1366. Changed AMRMClient to re-register with RM and send outstanding requests back to RM on work-preserving RM restart. Contributed by Rohith (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609254) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/async/impl/AMRMClientAsyncImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/async/impl/TestAMRMClientAsync.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClientOnRMRestart.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/resources/core-site.xml AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Fix For: 2.5.0 Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.12.patch, YARN-1366.13.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057537#comment-14057537 ] Hudson commented on YARN-2131: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1827 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1827/]) YARN-2131. Add a way to format the RMStateStore. (Robert Kanter via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609278) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/bin/yarn * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/NullRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/RMStateStoreTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestFSRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/YarnCommands.apt.vm Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057545#comment-14057545 ] Hadoop QA commented on YARN-2228: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12654990/YARN-2228.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice: org.apache.hadoop.yarn.util.TestFSDownload org.apache.hadoop.yarn.server.applicationhistoryservice.TestMemoryApplicationHistoryStore {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4256//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4256//console This message is automatically generated. TimelineServer should load pseudo authentication filter when authentication = simple Key: YARN-2228 URL: https://issues.apache.org/jira/browse/YARN-2228 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch When kerberos authentication is not enabled, we should let the timeline server to work with pseudo authentication filter. In this way, the sever is able to detect the request user by checking user.name. On the other hand, timeline client should append user.name in un-secure case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Potocnik updated YARN-1994: - Attachment: YARN-1994.4.patch Hi guys, I have attached a slightly updated version of the patch which incorporates your changes to test and configuration. The only difference should be that: - Adding logic for Timeline service - Putting all JHS related bind options under MR_HISTORY_BIND_HOST, instead of having 4 options. - Did some minor code cleanup Thanks for reviewing and pushing this! Expose YARN/MR endpoints on multiple interfaces --- Key: YARN-1994 URL: https://issues.apache.org/jira/browse/YARN-1994 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Arpit Agarwal Assignee: Craig Welch Attachments: YARN-1994.0.patch, YARN-1994.1.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch YARN and MapReduce daemons currently do not support specifying a wildcard address for the server endpoints. This prevents the endpoints from being accessible from all interfaces on a multihomed machine. Note that if we do specify INADDR_ANY for any of the options, it will break clients as they will attempt to connect to 0.0.0.0. We need a solution that allows specifying a hostname or IP-address for clients while requesting wildcard bind for the servers. (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2228) TimelineServer should load pseudo authentication filter when authentication = simple
[ https://issues.apache.org/jira/browse/YARN-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2228: -- Attachment: YARN-2228.4.patch Relax the criterium for TestMemoryApplicationHistoryStore, while the other test failure seems to be transit and not related. TimelineServer should load pseudo authentication filter when authentication = simple Key: YARN-2228 URL: https://issues.apache.org/jira/browse/YARN-2228 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2228.1.patch, YARN-2228.2.patch, YARN-2228.3.patch, YARN-2228.4.patch When kerberos authentication is not enabled, we should let the timeline server to work with pseudo authentication filter. In this way, the sever is able to detect the request user by checking user.name. On the other hand, timeline client should append user.name in un-secure case as well, such that ACLs can keep working in this case. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1408: -- Attachment: Yarn-1408.8.patch Thank you [~mayank_bansal] for the review. I have updated patch against trunk and fixed formatting problems. Kindly check. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057605#comment-14057605 ] Hadoop QA commented on YARN-1408: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655007/Yarn-1408.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4259//console This message is automatically generated. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1408: -- Attachment: (was: Yarn-1408.8.patch) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G updated YARN-1408: -- Attachment: Yarn-1408.8.patch Reattaching patch again as there was a test case problem. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2181: -- Attachment: YARN-2181.patch patch looks good overall, removed unused RMAppAttempt#isPreempted method. and did few code refactor in RMContainerImpl#updatePreemptionMetrics. Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1408) Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins
[ https://issues.apache.org/jira/browse/YARN-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057764#comment-14057764 ] Hadoop QA commented on YARN-1408: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655031/Yarn-1408.8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4260//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4260//console This message is automatically generated. Preemption caused Invalid State Event: ACQUIRED at KILLED and caused a task timeout for 30mins -- Key: YARN-1408 URL: https://issues.apache.org/jira/browse/YARN-1408 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Assignee: Sunil G Attachments: Yarn-1408.1.patch, Yarn-1408.2.patch, Yarn-1408.3.patch, Yarn-1408.4.patch, Yarn-1408.5.patch, Yarn-1408.6.patch, Yarn-1408.7.patch, Yarn-1408.8.patch, Yarn-1408.patch Capacity preemption is enabled as follows. * yarn.resourcemanager.scheduler.monitor.enable= true , * yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy Queue = a,b Capacity of Queue A = 80% Capacity of Queue B = 20% Step 1: Assign a big jobA on queue a which uses full cluster capacity Step 2: Submitted a jobB to queue b which would use less than 20% of cluster capacity JobA task which uses queue b capcity is been preempted and killed. This caused below problem: 1. New Container has got allocated for jobA in Queue A as per node update from an NM. 2. This container has been preempted immediately as per preemption. Here ACQUIRED at KILLED Invalid State exception came when the next AM heartbeat reached RM. ERROR org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: ACQUIRED at KILLED This also caused the Task to go for a timeout for 30minutes as this Container was already killed by preemption. attempt_1380289782418_0003_m_00_0 Timed out after 1800 secs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057787#comment-14057787 ] Jian He commented on YARN-2208: --- looks good overall, 1. Maybe change passwords to use concurrentHashMap and use read/write lock guard nextMasterKey/currentMasterKey for better concurrency as this is a chatty class. 2. Put In the same line {code} + ms and AMRMTokenKeyActivationDelay: + this.activationDelay + ms); } else if (identifier.getKeyId() == this.currentMasterKey.getMasterKey() {code} 3. Info level for easier debugging, while stabilizing this feature. {code} if (LOG.isDebugEnabled()) { LOG.debug(Activating next master key with id: + this.nextMasterKey.getMasterKey().getKeyId()); } {code} 4. createAndGetAMRMToken, add info log here also. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2273) Flapping node caused NPE in FairScheduler
Andy Skelton created YARN-2273: -- Summary: Flapping node caused NPE in FairScheduler Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few minutes later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057816#comment-14057816 ] Hadoop QA commented on YARN-2181: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655043/YARN-2181.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4261//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4261//console This message is automatically generated. Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057840#comment-14057840 ] Jian He commented on YARN-2181: --- committing this Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Skelton updated YARN-2273: --- Description: One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. was: One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few minutes later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. Summary: NPE in ContinuousScheduling Thread crippled RM after DN flap (was: Flapping node caused NPE in FairScheduler) NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0 Environment: cdh5.0.2
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057893#comment-14057893 ] Robert Kanter commented on YARN-2131: - Makes sense to me. I'll do an addendum patch to rename the command and use multi operation. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057890#comment-14057890 ] Hudson commented on YARN-2181: -- FAILURE: Integrated in Hadoop-trunk-Commit #5861 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5861/]) YARN-2181. Added preemption info to logs and RM web UI. Contributed by Wangda Tan (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609561) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttempt.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppBlock.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/MockAsm.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/MockRMApp.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestLeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairSchedulerTestBase.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container
[jira] [Commented] (YARN-2273) NPE in ContinuousScheduling Thread crippled RM after DN flap
[ https://issues.apache.org/jira/browse/YARN-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057930#comment-14057930 ] Wei Yan commented on YARN-2273: --- Thanks for the catch, [~skeltoac]. A quick guess is that the NodeAvailableResourceComparator doesn't check whether the node is alive when does comparison. A node may be removed during the sorting process. I'll re-check it. NPE in ContinuousScheduling Thread crippled RM after DN flap Key: YARN-2273 URL: https://issues.apache.org/jira/browse/YARN-2273 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, resourcemanager Affects Versions: 2.3.0 Environment: cdh5.0.2 wheezy Reporter: Andy Skelton One DN experienced memory errors and entered a cycle of rebooting and rejoining the cluster. After the second time the node went away, the RM produced this: {code} 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1404858438119_4352_01 released container container_1404858438119_4352_01_04 on node: host: node-A16-R09-19.hadoop.dfw.wordpress.com:8041 #containers=0 available=memory:8192, vCores:8 used=memory:0, vCores:0 with event: KILL 2014-07-09 21:47:36,571 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Removed node node-A16-R09-19.hadoop.dfw.wordpress.com:8041 cluster capacity: memory:335872, vCores:328 2014-07-09 21:47:36,571 ERROR org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContinuousScheduling,5,main] threw an Exception. java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1044) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$NodeAvailableResourceComparator.compare(FairScheduler.java:1040) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:329) at java.util.TimSort.sort(TimSort.java:203) at java.util.TimSort.sort(TimSort.java:173) at java.util.Arrays.sort(Arrays.java:659) at java.util.Collections.sort(Collections.java:217) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousScheduling(FairScheduler.java:1012) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.access$600(FairScheduler.java:124) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$2.run(FairScheduler.java:1306) at java.lang.Thread.run(Thread.java:744) {code} A few cycles later YARN was crippled. The RM was running and jobs could be submitted but containers were not assigned and no progress was made. Restarting the RM resolved it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
Karthik Kambatla created YARN-2274: -- Summary: FairScheduler: Add debug information about cluster capacity, availability and reservations Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-2131: Attachment: YARN-2131_addendum.patch The addendum patch renames the command. However, I was looking into making the ZK change, and I'm not sure it makes sense to do that. To build up the list of delete Ops, we need to get all of the children, and there's no get _all_ children call; so we have to recursively do this ourselves. And we can't use another list of Ops for this because it's a discovery operation. That is, if the structure looks like this: {noformat} - A | - B | - C {noformat} given that we start off only knowing A, we can't know that C exists until we know that B exists; and these each require a call to ZK. Because we already have to recursively call ZK to discover the nodes to delete, we may as well delete them at the same time, right? Also, I agree with Karthik's earlier comment that it would be good to eventually replace all of the ZooKeeper code with Curator code. It handles most if not all of the connection stuff, provides useful convenience methods, and implements a lot of useful recipes (e.g. leader latch, locks, etc). We've been using Curator extensively for Oozie HA. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2274: --- Priority: Trivial (was: Major) FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2274: --- Issue Type: Improvement (was: Bug) FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2274: --- Attachment: yarn-2274-1.patch Reviewers - please feel free to suggest logging any other basic information. FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Attachments: yarn-2274-1.patch FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058009#comment-14058009 ] Sandy Ryza commented on YARN-2026: -- I think Ashwin makes a good point. I think displaying both is reasonable if we present it in a careful way. For example, it might make sense to add tooltips that explain the difference. Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios -- Key: YARN-2026 URL: https://issues.apache.org/jira/browse/YARN-2026 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Ashwin Shankar Assignee: Ashwin Shankar Labels: scheduler Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt Problem1- While using hierarchical queues in fair scheduler,there are few scenarios where we have seen a leaf queue with least fair share can take majority of the cluster and starve a sibling parent queue which has greater weight/fair share and preemption doesn’t kick in to reclaim resources. The root cause seems to be that fair share of a parent queue is distributed to all its children irrespective of whether its an active or an inactive(no apps running) queue. Preemption based on fair share kicks in only if the usage of a queue is less than 50% of its fair share and if it has demands greater than that. When there are many queues under a parent queue(with high fair share),the child queue’s fair share becomes really low. As a result when only few of these child queues have apps running,they reach their *tiny* fair share quickly and preemption doesn’t happen even if other leaf queues(non-sibling) are hogging the cluster. This can be solved by dividing fair share of parent queue only to active child queues. Here is an example describing the problem and proposed solution: root.lowPriorityQueue is a leaf queue with weight 2 root.HighPriorityQueue is parent queue with weight 8 root.HighPriorityQueue has 10 child leaf queues : root.HighPriorityQueue.childQ(1..10) Above config,results in root.HighPriorityQueue having 80% fair share and each of its ten child queue would have 8% fair share. Preemption would happen only if the child queue is 4% (0.5*8=4). Lets say at the moment no apps are running in any of the root.HighPriorityQueue.childQ(1..10) and few apps are running in root.lowPriorityQueue which is taking up 95% of the cluster. Up till this point,the behavior of FS is correct. Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% of the cluster. It would get only the available 5% in the cluster and preemption wouldn't kick in since its above 4%(half fair share).This is bad considering childQ1 is under a highPriority parent queue which has *80% fair share*. Until root.lowPriorityQueue starts relinquishing containers,we would see the following allocation on the scheduler page: *root.lowPriorityQueue = 95%* *root.HighPriorityQueue.childQ1=5%* This can be solved by distributing a parent’s fair share only to active queues. So in the example above,since childQ1 is the only active queue under root.HighPriorityQueue, it would get all its parent’s fair share i.e. 80%. This would cause preemption to reclaim the 30% needed by childQ1 from root.lowPriorityQueue after fairSharePreemptionTimeout seconds. Problem2 - Also note that similar situation can happen between root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck at 5%,until childQ2 starts relinquishing containers. We would like each of childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie 40%,which would ensure childQ1 gets upto 40% resource if needed through preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2204) TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058050#comment-14058050 ] Hudson commented on YARN-2204: -- FAILURE: Integrated in Hadoop-trunk-Commit #5862 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5862/]) YARN-2224. Fix CHANGES.txt. This was committed as YARN-2204 before. (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609582) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler --- Key: YARN-2204 URL: https://issues.apache.org/jira/browse/YARN-2204 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Robert Kanter Assignee: Robert Kanter Priority: Trivial Fix For: 2.5.0 Attachments: YARN-2204.patch, YARN-2204_addendum.patch, YARN-2204_addendum.patch TestAMRestart#testAMRestartWithExistingContainers assumes CapacityScheduler -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2224) Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow
[ https://issues.apache.org/jira/browse/YARN-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058049#comment-14058049 ] Hudson commented on YARN-2224: -- FAILURE: Integrated in Hadoop-trunk-Commit #5862 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5862/]) YARN-2224. Fix CHANGES.txt. This was committed as YARN-2204 before. (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609582) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Explicitly enable vmem check in TestContainersMonitor#testContainerKillOnMemoryOverflow --- Key: YARN-2224 URL: https://issues.apache.org/jira/browse/YARN-2224 Project: Hadoop YARN Issue Type: Test Components: nodemanager Affects Versions: 2.4.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Trivial Labels: newbie Fix For: 2.5.0 Attachments: YARN-2224.patch If the default setting DEFAULT_NM_VMEM_CHECK_ENABLED is set to false the test will fail. Make the test pass not rely on the default settings but just let it verify that once the setting is turned on it actually does the memory check. See YARN-2225 which suggests we turn the default off. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058056#comment-14058056 ] Jian He commented on YARN-2088: --- patch looks good, committing Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058066#comment-14058066 ] Hadoop QA commented on YARN-2274: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655089/yarn-2274-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4262//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4262//console This message is automatically generated. FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Attachments: yarn-2274-1.patch FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2208: Attachment: YARN-2208.7.patch AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058072#comment-14058072 ] Hudson commented on YARN-2088: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5863 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5863/]) YARN-2088. Fixed a bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder. Contributed by Binglin Chang (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609584) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/GetApplicationsRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/api/TestGetApplicationsRequest.java Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.5.0 Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058068#comment-14058068 ] Jian He commented on YARN-2131: --- bq. there's no get all children call; I see. I didn't have practice with Curator, make sense to have it if it makes code cleaner and simpler. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058069#comment-14058069 ] Xuan Gong commented on YARN-2208: - bq. 1. Maybe change passwords to use concurrentHashMap and use read/write lock guard nextMasterKey/currentMasterKey for better concurrency as this is a chatty class. DONE bq. 2. Put In the same line DONE bq. 3. Info level for easier debugging, while stabilizing this feature. DONE bq. 4. createAndGetAMRMToken, add info log here also. ADDED AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2244: Attachment: YARN-2244.002.patch The build seemed to fail for hdfs native generation unrelated to this patch. Uploading same one again to retrigger build FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058085#comment-14058085 ] Sandy Ryza commented on YARN-2274: -- Demanded resources could also be a useful statistic to report. The update thread typically runs twice every second, so it might make sense to 5th update or something to avoid a flood of messages. FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Attachments: yarn-2274-1.patch FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058086#comment-14058086 ] Wangda Tan commented on YARN-2181: -- Thanks Jian and Vinod for review and commit! Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2269) External links need to be removed from YARN UI
[ https://issues.apache.org/jira/browse/YARN-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2269: Assignee: Craig Welch External links need to be removed from YARN UI -- Key: YARN-2269 URL: https://issues.apache.org/jira/browse/YARN-2269 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Craig Welch Labels: security Attachments: YARN-2269.0.patch Accessing external link from YARN UI can disclose delegation parameter to 3rd party in secure cluster. Thus, All external links must be deleted from Yarn Web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2269) External links need to be removed from YARN UI
[ https://issues.apache.org/jira/browse/YARN-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058114#comment-14058114 ] Xuan Gong commented on YARN-2269: - +1 LGTM. Committed to trunk and branch-2. Thanks Craig ! External links need to be removed from YARN UI -- Key: YARN-2269 URL: https://issues.apache.org/jira/browse/YARN-2269 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Craig Welch Labels: security Fix For: 2.5.0 Attachments: YARN-2269.0.patch Accessing external link from YARN UI can disclose delegation parameter to 3rd party in secure cluster. Thus, All external links must be deleted from Yarn Web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2269) External links need to be removed from YARN UI
[ https://issues.apache.org/jira/browse/YARN-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058120#comment-14058120 ] Hudson commented on YARN-2269: -- FAILURE: Integrated in Hadoop-trunk-Commit #5865 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5865/]) YARN-2269. Remove external links from YARN UI. Contributed by Craig Welch (xgong: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1609590) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/view/FooterBlock.java External links need to be removed from YARN UI -- Key: YARN-2269 URL: https://issues.apache.org/jira/browse/YARN-2269 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Craig Welch Labels: security Fix For: 2.5.0 Attachments: YARN-2269.0.patch Accessing external link from YARN UI can disclose delegation parameter to 3rd party in secure cluster. Thus, All external links must be deleted from Yarn Web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058123#comment-14058123 ] Hadoop QA commented on YARN-2208: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655103/YARN-2208.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestFSDownload {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4263//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4263//console This message is automatically generated. AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1366) AM should implement Resync with the ApplicationMasterService instead of shutting down
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1366: -- Target Version/s: 2.6.0 (was: 2.5.0) AM should implement Resync with the ApplicationMasterService instead of shutting down - Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Rohith Fix For: 2.5.0 Attachments: YARN-1366.1.patch, YARN-1366.10.patch, YARN-1366.11.patch, YARN-1366.12.patch, YARN-1366.13.patch, YARN-1366.2.patch, YARN-1366.3.patch, YARN-1366.4.patch, YARN-1366.5.patch, YARN-1366.6.patch, YARN-1366.7.patch, YARN-1366.8.patch, YARN-1366.9.patch, YARN-1366.patch, YARN-1366.prototype.patch, YARN-1366.prototype.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2088: -- Fix Version/s: (was: 2.5.0) 2.6.0 Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2208) AMRMTokenManager need to have a way to roll over AMRMToken
[ https://issues.apache.org/jira/browse/YARN-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058124#comment-14058124 ] Xuan Gong commented on YARN-2208: - The testcase failure : org.apache.hadoop.yarn.util.TestFSDownload is un-related AMRMTokenManager need to have a way to roll over AMRMToken -- Key: YARN-2208 URL: https://issues.apache.org/jira/browse/YARN-2208 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2208.1.patch, YARN-2208.2.patch, YARN-2208.3.patch, YARN-2208.4.patch, YARN-2208.5.patch, YARN-2208.5.patch, YARN-2208.6.patch, YARN-2208.7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2088) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder
[ https://issues.apache.org/jira/browse/YARN-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2088: -- Target Version/s: 2.6.0 (was: 2.5.0) Fix code bug in GetApplicationsRequestPBImpl#mergeLocalToBuilder Key: YARN-2088 URL: https://issues.apache.org/jira/browse/YARN-2088 Project: Hadoop YARN Issue Type: Bug Reporter: Binglin Chang Assignee: Binglin Chang Fix For: 2.6.0 Attachments: YARN-2088.v1.patch Some fields(set,list) are added to proto builders many times, we need to clear those fields before add, otherwise the result proto contains more contents. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2269) External links need to be removed from YARN UI
[ https://issues.apache.org/jira/browse/YARN-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2269: Fix Version/s: (was: 2.5.0) 2.6.0 External links need to be removed from YARN UI -- Key: YARN-2269 URL: https://issues.apache.org/jira/browse/YARN-2269 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Craig Welch Labels: security Fix For: 2.6.0 Attachments: YARN-2269.0.patch Accessing external link from YARN UI can disclose delegation parameter to 3rd party in secure cluster. Thus, All external links must be deleted from Yarn Web UI. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2244) FairScheduler missing handling of containers for unknown application attempts
[ https://issues.apache.org/jira/browse/YARN-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058128#comment-14058128 ] Hadoop QA commented on YARN-2244: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655104/YARN-2244.002.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4264//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4264//console This message is automatically generated. FairScheduler missing handling of containers for unknown application attempts -- Key: YARN-2244 URL: https://issues.apache.org/jira/browse/YARN-2244 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical Attachments: YARN-2224.patch, YARN-2244.001.patch, YARN-2244.002.patch We are missing changes in patch MAPREDUCE-3596 in FairScheduler. Among other fixes that were common across schedulers, there were some scheduler specific fixes added to handle containers for unknown application attempts. Without these fair scheduler simply logs that an unknown container was found and continues to let it run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058131#comment-14058131 ] Chris Nauroth commented on YARN-2181: - I'm seeing a compilation error on branch-2 that appears to be related to this patch. (See below.) I think YARN-2022 would need to get merged to branch-2 to resolve this. I'll comment over there too. {code} ERROR] /Users/chris/svn/hadoop-common-branch-2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] cannot find symbol [ERROR] symbol : method isAMContainer() [ERROR] location: interface org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainer {code} Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.5.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2263) CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs
[ https://issues.apache.org/jira/browse/YARN-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058132#comment-14058132 ] Jason Lowe commented on YARN-2263: -- 1 is an appropriate lower bound since we don't ever want the maximum number of applications for a user to be zero or less. (That would be a worthless queue since we could submit jobs to it but no jobs would activate.) I'm assuming it only causes a deadlock in the case where the active job submits and waits for the completion of other jobs? If it simply submits jobs and exits then even if the queue is so tiny that only 1 active job per user is allowed then the jobs should eventually complete (assuming sufficient resources to launch an AM _and_ at least one task simultaneously if this is MapReduce). If the concern is that the queue can be too small to allow running more than one application simultaneously for a user and some app frameworks might not like that, then yes that could be an issue. However I'm not sure that is YARN's problem to solve. I could have an application framework that for whatever reason requires 10 jobs to be running simultaneously to work. There could definitely be a queue config that will not allow that to run properly because the queue is too small to support 10 simultaneous applications by a single user. Should YARN handle this scenario? If so, how would it detect it, and what should it do to mitigate it? I would argue the same applies to the simpler job-launching-job-and-waiting scenario. Some queues are going to be too small to support that. Users can work around issues like this with smarter queue setups. This is touched upon in MAPREDUCE-4304 and elsewhere for the Oozie case which is a similar scenario. We can setup a separate queue for the launcher jobs separate from a queue where the other jobs run. That way we can't accidentally fill the cluster/queue with just launcher jobs and deadlock. CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs - Key: YARN-2263 URL: https://issues.apache.org/jira/browse/YARN-2263 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.10, 2.4.1 Reporter: Chen He computeMaxActiveApplicationsPerUser() has a lower bound 1. For a nested MapReduce job which files new mapreduce jobs in its mapper/reducer, it will cause job stuck. public static int computeMaxActiveApplicationsPerUser( int maxActiveApplications, int userLimit, float userLimitFactor) { return Math.max( (int)Math.ceil( maxActiveApplications * (userLimit / 100.0f) * userLimitFactor), 1); } -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058133#comment-14058133 ] Chris Nauroth commented on YARN-2022: - Does this still need to be merged to branch-2? YARN-2181 was just committed to branch-2. It depends on the new {{RMContainer#isAMContainer}} method, so I'm seeing a compilation error on branch-2 now. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2263) CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs
[ https://issues.apache.org/jira/browse/YARN-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He resolved YARN-2263. --- Resolution: Won't Fix CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs - Key: YARN-2263 URL: https://issues.apache.org/jira/browse/YARN-2263 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.10, 2.4.1 Reporter: Chen He computeMaxActiveApplicationsPerUser() has a lower bound 1. For a nested MapReduce job which files new mapreduce jobs in its mapper/reducer, it will cause job stuck. public static int computeMaxActiveApplicationsPerUser( int maxActiveApplications, int userLimit, float userLimitFactor) { return Math.max( (int)Math.ceil( maxActiveApplications * (userLimit / 100.0f) * userLimitFactor), 1); } -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2263) CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs
[ https://issues.apache.org/jira/browse/YARN-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058137#comment-14058137 ] Chen He commented on YARN-2263: --- Thank you for the comments. Jason Lowe. I will close it. CSQueueUtils.computeMaxActiveApplicationsPerUser may cause deadlock for nested MapReduce jobs - Key: YARN-2263 URL: https://issues.apache.org/jira/browse/YARN-2263 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.10, 2.4.1 Reporter: Chen He computeMaxActiveApplicationsPerUser() has a lower bound 1. For a nested MapReduce job which files new mapreduce jobs in its mapper/reducer, it will cause job stuck. public static int computeMaxActiveApplicationsPerUser( int maxActiveApplications, int userLimit, float userLimitFactor) { return Math.max( (int)Math.ceil( maxActiveApplications * (userLimit / 100.0f) * userLimitFactor), 1); } -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1367) After restart NM should resync with the RM without killing containers
[ https://issues.apache.org/jira/browse/YARN-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1367: -- Fix Version/s: (was: 2.5.0) 2.6.0 After restart NM should resync with the RM without killing containers - Key: YARN-1367 URL: https://issues.apache.org/jira/browse/YARN-1367 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Fix For: 2.6.0 Attachments: YARN-1367.001.patch, YARN-1367.002.patch, YARN-1367.003.patch, YARN-1367.prototype.patch After RM restart, the RM sends a resync response to NMs that heartbeat to it. Upon receiving the resync response, the NM kills all containers and re-registers with the RM. The NM should be changed to not kill the container and instead inform the RM about all currently running containers including their allocations etc. After the re-register, the NM should send all pending container completions to the RM as usual. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2181: -- Fix Version/s: (was: 2.5.0) 2.6.0 Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-611: --- Attachment: YARN-611.4.patch Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058161#comment-14058161 ] Xuan Gong commented on YARN-611: create new patch based on vinod's suggestion. Also move all logics about how to decide wether this is the last attempt from RMApp to ApplicationRetryPolicy. Add an AM retry count reset window to YARN RM - Key: YARN-611 URL: https://issues.apache.org/jira/browse/YARN-611 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha Reporter: Chris Riccomini Assignee: Xuan Gong Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, YARN-611.4.patch YARN currently has the following config: yarn.resourcemanager.am.max-retries This config defaults to 2, and defines how many times to retry a failed AM before failing the whole YARN job. YARN counts an AM as failed if the node that it was running on dies (the NM will timeout, which counts as a failure for the AM), or if the AM dies. This configuration is insufficient for long running (or infinitely running) YARN jobs, since the machine (or NM) that the AM is running on will eventually need to be restarted (or the machine/NM will fail). In such an event, the AM has not done anything wrong, but this is counted as a failure by the RM. Since the retry count for the AM is never reset, eventually, at some point, the number of machine/NM failures will result in the AM failure count going above the configured value for yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the job as failed, and shut it down. This behavior is not ideal. I propose that we add a second configuration: yarn.resourcemanager.am.retry-count-window-ms This configuration would define a window of time that would define when an AM is well behaved, and it's safe to reset its failure count back to zero. Every time an AM fails the RmAppImpl would check the last time that the AM failed. If the last failure was less than retry-count-window-ms ago, and the new failure count is max-retries, then the job should fail. If the AM has never failed, the retry count is max-retries, or if the last failure was OUTSIDE the retry-count-window-ms, then the job should be restarted. Additionally, if the last failure was outside the retry-count-window-ms, then the failure count should be set back to 0. This would give developers a way to have well-behaved AMs run forever, while still failing mis-behaving AMs after a short period of time. I think the work to be done here is to change the RmAppImpl to actually look at app.attempts, and see if there have been more than max-retries failures in the last retry-count-window-ms milliseconds. If there have, then the job should fail, if not, then the job should go forward. Additionally, we might also need to add an endTime in either RMAppAttemptImpl or RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the failure. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (YARN-2275) When log aggregation not enabled, message should point to NM HTTP port, not IPC port
[ https://issues.apache.org/jira/browse/YARN-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli moved MAPREDUCE-5185 to YARN-2275: -- Component/s: (was: jobhistoryserver) log-aggregation Affects Version/s: (was: 2.0.4-alpha) 2.0.4-alpha Key: YARN-2275 (was: MAPREDUCE-5185) Project: Hadoop YARN (was: Hadoop Map/Reduce) When log aggregation not enabled, message should point to NM HTTP port, not IPC port - Key: YARN-2275 URL: https://issues.apache.org/jira/browse/YARN-2275 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE5185-01.patch When I try to get a container's logs in the JHS without log aggregation enabled, I get a message that looks like this: Aggregation is not enabled. Try the nodemanager at sandy-ThinkPad-T530:33224 This could be a lot more helpful by actually pointing the URL that would show the container logs on the NM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2275) When log aggregation not enabled, message should point to NM HTTP port, not IPC port
[ https://issues.apache.org/jira/browse/YARN-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058163#comment-14058163 ] Vinod Kumar Vavilapalli commented on YARN-2275: --- AggregatedLogsBlock is hosted on a server that is not the nodemanager - today the MR JobHistoryServer and the TimelineServer in the near future. So you cannot look into the config. When log aggregation not enabled, message should point to NM HTTP port, not IPC port - Key: YARN-2275 URL: https://issues.apache.org/jira/browse/YARN-2275 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.0.4-alpha Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE5185-01.patch When I try to get a container's logs in the JHS without log aggregation enabled, I get a message that looks like this: Aggregation is not enabled. Try the nodemanager at sandy-ThinkPad-T530:33224 This could be a lot more helpful by actually pointing the URL that would show the container logs on the NM. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2238) filtering on UI sticks even if I move away from the page
[ https://issues.apache.org/jira/browse/YARN-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058176#comment-14058176 ] Garth Goodson commented on YARN-2238: - We have the same issue and it is very annoying for our users. If the front page is being filtered, it should be able to be cleared from that page. filtering on UI sticks even if I move away from the page Key: YARN-2238 URL: https://issues.apache.org/jira/browse/YARN-2238 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.4.0 Reporter: Sangjin Lee Attachments: filtered.png The main data table in many web pages (RM, AM, etc.) seems to show an unexpected filtering behavior. If I filter the table by typing something in the key or value field (or I suspect any search field), the data table gets filtered. The example I used is the job configuration page for a MR job. That is expected. However, when I move away from that page and visit any other web page of the same type (e.g. a job configuration page), the page is rendered with the filtering! That is unexpected. What's even stranger is that it does not render the filtering term. As a result, I have a page that's mysteriously filtered but doesn't tell me what it's filtering on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (YARN-2131) Add a way to format the RMStateStore
[ https://issues.apache.org/jira/browse/YARN-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reopened YARN-2131: --- Reopening for the addendum.. One other thing that occurred to me was running RM while the format is in progress or vice-versa. Namenode solves this issue by a lock file. We can do the same here. Irrespective of the approach, I think handling the above is a major blocker for this feature/patch. Let's try to do that here too.. Add a way to format the RMStateStore Key: YARN-2131 URL: https://issues.apache.org/jira/browse/YARN-2131 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.4.0 Reporter: Karthik Kambatla Assignee: Robert Kanter Fix For: 2.6.0 Attachments: YARN-2131.patch, YARN-2131.patch, YARN-2131_addendum.patch There are cases when we don't want to recover past applications, but recover applications going forward. To do this, one has to clear the store. Today, there is no easy way to do this and users should understand how each store works. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2259) NM-Local dir cleanup failing when Resourcemanager switches
[ https://issues.apache.org/jira/browse/YARN-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058200#comment-14058200 ] Jason Lowe commented on YARN-2259: -- This sounds like the NM wasn't notified of the application completing and therefore didn't process the cleanup. Possibly a duplicate of YARN-1421? NM-Local dir cleanup failing when Resourcemanager switches -- Key: YARN-2259 URL: https://issues.apache.org/jira/browse/YARN-2259 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.0 Environment: Reporter: Nishan Shetty Attachments: Capture.PNG Induce RM switchover while job is in progress Observe that NM-Local dir cleanup failing when Resourcemanager switches. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1421) Node managers will not receive application finish event where containers ran before RM restart
[ https://issues.apache.org/jira/browse/YARN-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058201#comment-14058201 ] Jason Lowe commented on YARN-1421: -- Was this fixed by YARN-1885? Node managers will not receive application finish event where containers ran before RM restart -- Key: YARN-1421 URL: https://issues.apache.org/jira/browse/YARN-1421 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Priority: Critical Problem :- Today for every application we track the node managers where containers ran. So when application finishes it notifies all those node managers about application finish event (via node manager heartbeat). However if rm restarts then we forget this past information and those node managers will never get application finish event and will keep reporting finished applications. Proposed Solution :- Instead of remembering the node managers where containers ran for this particular application it would be better if we depend on node manager heartbeat to take this decision. i.e. when node manager heartbeats saying it is running application (app1, app2) then we should check those application's status in RM's memory {code}rmContext.getRMApps(){code} and if either they are not found (very old applications) or they are in their final state (FINISHED, KILLED, FAILED) then we should immediately notify the node manager about the application finish event. By doing this we are reducing the state which we need to store at RM after restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2274: --- Attachment: yarn-2274-2.patch Thanks Sandy. Updated patch includes demand and skips a few updates before spitting out debug info. FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Attachments: yarn-2274-1.patch, yarn-2274-2.patch FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2276) Branch-2 cannot build
Fengdong Yu created YARN-2276: - Summary: Branch-2 cannot build Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /home/yufengdong/svn/letv-hadoop/hadoop-2.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fengdong Yu updated YARN-2276: -- Description: [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol was: [ERROR] COMPILATION ERROR : [INFO] - [ERROR] /home/yufengdong/svn/letv-hadoop/hadoop-2.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058245#comment-14058245 ] Zhijie Shen commented on YARN-2276: --- It's related to [YARN-2181|https://issues.apache.org/jira/browse/YARN-2181?focusedCommentId=14058131page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14058131] Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-2276. --- Resolution: Fixed Assignee: Zhijie Shen Merged YARN-2022 into branch-2. Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu Assignee: Zhijie Shen [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2022) Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058271#comment-14058271 ] Zhijie Shen commented on YARN-2022: --- I merged YARN-2022 to branch-2, and the compilation error was gone. Preempting an Application Master container can be kept as least priority when multiple applications are marked for preemption by ProportionalCapacityPreemptionPolicy - Key: YARN-2022 URL: https://issues.apache.org/jira/browse/YARN-2022 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Sunil G Assignee: Sunil G Fix For: 2.5.0 Attachments: YARN-2022-DesignDraft.docx, YARN-2022.10.patch, YARN-2022.2.patch, YARN-2022.3.patch, YARN-2022.4.patch, YARN-2022.5.patch, YARN-2022.6.patch, YARN-2022.7.patch, YARN-2022.8.patch, YARN-2022.9.patch, Yarn-2022.1.patch Cluster Size = 16GB [2NM's] Queue A Capacity = 50% Queue B Capacity = 50% Consider there are 3 applications running in Queue A which has taken the full cluster capacity. J1 = 2GB AM + 1GB * 4 Maps J2 = 2GB AM + 1GB * 4 Maps J3 = 2GB AM + 1GB * 2 Maps Another Job J4 is submitted in Queue B [J4 needs a 2GB AM + 1GB * 2 Maps ]. Currently in this scenario, Jobs J3 will get killed including its AM. It is better if AM can be given least priority among multiple applications. In this same scenario, map tasks from J3 and J2 can be preempted. Later when cluster is free, maps can be allocated to these Jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2274) FairScheduler: Add debug information about cluster capacity, availability and reservations
[ https://issues.apache.org/jira/browse/YARN-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058275#comment-14058275 ] Hadoop QA commented on YARN-2274: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12655131/yarn-2274-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4266//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4266//console This message is automatically generated. FairScheduler: Add debug information about cluster capacity, availability and reservations -- Key: YARN-2274 URL: https://issues.apache.org/jira/browse/YARN-2274 Project: Hadoop YARN Issue Type: Improvement Components: fairscheduler Affects Versions: 2.4.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Attachments: yarn-2274-1.patch, yarn-2274-2.patch FairScheduler logs have little information on cluster capacity and availability. Need this information to debug production issues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2276: -- Assignee: (was: Zhijie Shen) Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058289#comment-14058289 ] Tsuyoshi OZAWA commented on YARN-2181: -- Hi Chris, good catch. I confirmed that branch-2 can be compiled and pass all tests by applying YARN-2022.10.patch. Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058291#comment-14058291 ] Tsuyoshi OZAWA commented on YARN-2276: -- As Chris mentioned on YARN-2181, we need to merge YARN-2022 into branch-2 to compile. If we take this problem on YARN-2181, we can close this JIRA as duplicated one. Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058293#comment-14058293 ] Wangda Tan commented on YARN-2181: -- Hi [~ozawa], Zhijie has already committed YARN-2022 to branch-2. You can update and try. Thanks, Wangda Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2276) Branch-2 cannot build
[ https://issues.apache.org/jira/browse/YARN-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058294#comment-14058294 ] Tsuyoshi OZAWA commented on YARN-2276: -- Zhijie, thanks you for your work! Branch-2 cannot build - Key: YARN-2276 URL: https://issues.apache.org/jira/browse/YARN-2276 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.0 Reporter: Fengdong Yu [ERROR] COMPILATION ERROR : [INFO] - [ERROR] hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java:[61,18] error: cannot find symbol -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2181) Add preemption info to RM Web UI and add logs when preemption occurs
[ https://issues.apache.org/jira/browse/YARN-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14058295#comment-14058295 ] Jian He commented on YARN-2181: --- [~zjshen] merged YARN-2022 to branch-2. This issue should be solved Add preemption info to RM Web UI and add logs when preemption occurs Key: YARN-2181 URL: https://issues.apache.org/jira/browse/YARN-2181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, YARN-2181.patch, application page-1.png, application page.png We need add preemption info to RM web page to make administrator/user get more understanding about preemption happened on app, etc. And RM logs should have following properties: * Logs are retrievable when an application is still running and often flushed. * Can distinguish between AM container preemption and task container preemption with container ID shown. * Should be INFO level log. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2277) Add JSONP support to the ATS REST API
Jonathan Eagles created YARN-2277: - Summary: Add JSONP support to the ATS REST API Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles As the Application Timeline Server is provided with built-in UI, it may make sense to enable JSONP Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2277) Add JSONP support to the ATS REST API
[ https://issues.apache.org/jira/browse/YARN-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated YARN-2277: -- Attachment: YARN-2277.patch Starter patch for conversation starter Add JSONP support to the ATS REST API - Key: YARN-2277 URL: https://issues.apache.org/jira/browse/YARN-2277 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jonathan Eagles Attachments: YARN-2277.patch As the Application Timeline Server is provided with built-in UI, it may make sense to enable JSONP Rest API capabilities to allow for remote UI to access the data directly via javascript without cross side server browser blocks coming into play. Example client may be like http://api.jquery.com/jQuery.getJSON/ This can alleviate the need to create a local proxy cache. -- This message was sent by Atlassian JIRA (v6.2#6252)