[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922132#comment-13922132 ] bc Wong commented on YARN-1790: --- Seems that the fix of YARN-1407 forgot to change the FairSchedulerAppsBlock to use the user-facing app state. FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: fs_ui.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1774) FS: Submitting to non-leaf queue throws NPE
[ https://issues.apache.org/jira/browse/YARN-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922175#comment-13922175 ] Hadoop QA commented on YARN-1774: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633063/yarn-1774-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3274//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3274//console This message is automatically generated. FS: Submitting to non-leaf queue throws NPE --- Key: YARN-1774 URL: https://issues.apache.org/jira/browse/YARN-1774 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Attachments: YARN-1774.patch, yarn-1774-2.patch If you create a hierarchy of queues and assign a job to parent queue, FairScheduler quits with a NPE. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922177#comment-13922177 ] Hadoop QA commented on YARN-1685: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633060/YARN-1685.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3273//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3273//console This message is automatically generated. Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: fs_ui_fixed.png 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch Trivial fix. Also ported YARN-563 to FairScheduler UI. Tested manually (see screenshot). FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1788) AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill
[ https://issues.apache.org/jira/browse/YARN-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-1788: --- Assignee: Varun Vasudev AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill -- Key: YARN-1788 URL: https://issues.apache.org/jira/browse/YARN-1788 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Varun Vasudev Run MR sleep job. Kill the application in RUNNING state. Observe RM metrics. Expecting AppsCompleted = 0/AppsKilled = 1 Actual is AppsCompleted = 1/AppsKilled = 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1788) AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill
[ https://issues.apache.org/jira/browse/YARN-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-1788: Attachment: apache-yarn-1788.0.patch The fix is to use the finalState instead of the getState() function when dispatch the AppRemovedSchedulerEvent. The patch file has the fix in the FinalTransition class in RMAppImpl and added tests. AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill -- Key: YARN-1788 URL: https://issues.apache.org/jira/browse/YARN-1788 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Varun Vasudev Attachments: apache-yarn-1788.0.patch Run MR sleep job. Kill the application in RUNNING state. Observe RM metrics. Expecting AppsCompleted = 0/AppsKilled = 1 Actual is AppsCompleted = 1/AppsKilled = 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1791) Distributed cache issue using YARN
Ashish Kumar created YARN-1791: -- Summary: Distributed cache issue using YARN Key: YARN-1791 URL: https://issues.apache.org/jira/browse/YARN-1791 Project: Hadoop YARN Issue Type: Bug Reporter: Ashish Kumar If I want to have two cache files a/b/c and d/e/c for an MR job then is there any way to access Path of these files while reading it in Map or Reduce task? I'm using *job.addCacheFile(hdfsPath.toUri());* And then I'm accessing all cache file paths using *context.getLocalCacheFiles()* which returns all paths as given below: /yarn/?/?/?/1234/c and /yarn/?/?/?/2345/c But these paths don't have any folder level info so I'm not able to identify which path is representing a/b/c. Is it bug? Please help. Thanks, Ashish -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922332#comment-13922332 ] Hadoop QA commented on YARN-1790: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633086/fs_ui_fixed.png against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3275//console This message is automatically generated. FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922344#comment-13922344 ] Hudson commented on YARN-1752: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1693 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1693/]) YARN-1752. Fixed ApplicationMasterService to reject unregister request if AM did not register before. Contributed by Rohith Sharma. (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574623) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidApplicationMasterRequestException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAuditLogger.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.4.0 Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch, YARN-1752.4.patch, YARN-1752.5.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1761) RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby
[ https://issues.apache.org/jira/browse/YARN-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922345#comment-13922345 ] Hudson commented on YARN-1761: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1693 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1693/]) YARN-1761. Modified RMAdmin CLI to check whether HA is enabled or not before it executes any of the HA admin related commands. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574661) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMAdminCLI.java RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Key: YARN-1761 URL: https://issues.apache.org/jira/browse/YARN-1761 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1761.1.patch, YARN-1761.2.patch, YARN-1761.2.patch, YARN-1761.3.patch, YARN-1761.3.patch, YARN-1761.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1785) FairScheduler treats app lookup failures as ERRORs
[ https://issues.apache.org/jira/browse/YARN-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922351#comment-13922351 ] Hudson commented on YARN-1785: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1693 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1693/]) YARN-1785. FairScheduler treats app lookup failures as ERRORs. (bc Wong via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574604) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FairScheduler treats app lookup failures as ERRORs -- Key: YARN-1785 URL: https://issues.apache.org/jira/browse/YARN-1785 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Fix For: 2.4.0 Attachments: 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to RMAppImpl#createAndGetApplicationReport, which calls RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in the scheduler, which may or may not exist. So FairScheduler shouldn't log an error for every lookup failure: {noformat} 2014-02-17 08:23:21,240 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1392419715319_0135_01 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1781) NM should allow users to specify max disk utilization for local disks
[ https://issues.apache.org/jira/browse/YARN-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-1781: Attachment: apache-yarn-1781.2.patch Patch with code review comments incorporated. NM should allow users to specify max disk utilization for local disks - Key: YARN-1781 URL: https://issues.apache.org/jira/browse/YARN-1781 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1781.0.patch, apache-yarn-1781.1.patch, apache-yarn-1781.2.patch This is related to YARN-257(it's probably a sub task?). Currently, the NM does not detect full disks and allows full disks to be used by containers leading to repeated failures. YARN-257 deals with graceful handling of full disks. This ticket is only about detection of full disks by the disk health checkers. The NM should allow users to set a maximum disk utilization for local disks and mark disks as bad once they exceed that utilization. At the very least, the NM should at least detect full disks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1788) AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill
[ https://issues.apache.org/jira/browse/YARN-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922390#comment-13922390 ] Hadoop QA commented on YARN-1788: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633091/apache-yarn-1788.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3276//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3276//console This message is automatically generated. AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill -- Key: YARN-1788 URL: https://issues.apache.org/jira/browse/YARN-1788 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Varun Vasudev Attachments: apache-yarn-1788.0.patch Run MR sleep job. Kill the application in RUNNING state. Observe RM metrics. Expecting AppsCompleted = 0/AppsKilled = 1 Actual is AppsCompleted = 1/AppsKilled = 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1761) RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby
[ https://issues.apache.org/jira/browse/YARN-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922400#comment-13922400 ] Hudson commented on YARN-1761: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #501 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/501/]) YARN-1761. Modified RMAdmin CLI to check whether HA is enabled or not before it executes any of the HA admin related commands. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574661) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMAdminCLI.java RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Key: YARN-1761 URL: https://issues.apache.org/jira/browse/YARN-1761 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1761.1.patch, YARN-1761.2.patch, YARN-1761.2.patch, YARN-1761.3.patch, YARN-1761.3.patch, YARN-1761.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922399#comment-13922399 ] Hudson commented on YARN-1752: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #501 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/501/]) YARN-1752. Fixed ApplicationMasterService to reject unregister request if AM did not register before. Contributed by Rohith Sharma. (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574623) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidApplicationMasterRequestException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAuditLogger.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.4.0 Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch, YARN-1752.4.patch, YARN-1752.5.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1785) FairScheduler treats app lookup failures as ERRORs
[ https://issues.apache.org/jira/browse/YARN-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922406#comment-13922406 ] Hudson commented on YARN-1785: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #501 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/501/]) YARN-1785. FairScheduler treats app lookup failures as ERRORs. (bc Wong via kasha) (kasha: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574604) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java FairScheduler treats app lookup failures as ERRORs -- Key: YARN-1785 URL: https://issues.apache.org/jira/browse/YARN-1785 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Fix For: 2.4.0 Attachments: 0001-YARN-1785.-FairScheduler-treats-app-lookup-failures-.patch When invoking the /ws/v1/cluster/apps endpoint, RM will eventually get to RMAppImpl#createAndGetApplicationReport, which calls RMAppAttemptImpl#getApplicationResourceUsageReport, which looks up the app in the scheduler, which may or may not exist. So FairScheduler shouldn't log an error for every lookup failure: {noformat} 2014-02-17 08:23:21,240 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1392419715319_0135_01 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1781) NM should allow users to specify max disk utilization for local disks
[ https://issues.apache.org/jira/browse/YARN-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922457#comment-13922457 ] Hadoop QA commented on YARN-1781: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633107/apache-yarn-1781.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3277//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3277//console This message is automatically generated. NM should allow users to specify max disk utilization for local disks - Key: YARN-1781 URL: https://issues.apache.org/jira/browse/YARN-1781 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1781.0.patch, apache-yarn-1781.1.patch, apache-yarn-1781.2.patch This is related to YARN-257(it's probably a sub task?). Currently, the NM does not detect full disks and allows full disks to be used by containers leading to repeated failures. YARN-257 deals with graceful handling of full disks. This ticket is only about detection of full disks by the disk health checkers. The NM should allow users to set a maximum disk utilization for local disks and mark disks as bad once they exceed that utilization. At the very least, the NM should at least detect full disks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1791) Distributed cache issue using YARN
[ https://issues.apache.org/jira/browse/YARN-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-1791. -- Resolution: Invalid The distributed cache only preserves the basename of files and links them into the containers working directory. If two names collide one can use the URI fragment to provide an alternative name for the symlink. For example, hdfs:/a/b/c#d will be seen as d in the container working directory rather than c. If you require paths to be preserved then you can specify an archive (e.g.: .tar.gz, .zip, etc.) which will be expanded when localized and paths can exist within that. In the future please use [mailto:u...@hadoop.apache.org] for asking questions. Apache JIRA is for reporting bugs and tracking features/improvements. Distributed cache issue using YARN -- Key: YARN-1791 URL: https://issues.apache.org/jira/browse/YARN-1791 Project: Hadoop YARN Issue Type: Bug Reporter: Ashish Kumar If I want to have two cache files a/b/c and d/e/c for an MR job then is there any way to access Path of these files while reading it in Map or Reduce task? I'm using *job.addCacheFile(hdfsPath.toUri());* And then I'm accessing all cache file paths using *context.getLocalCacheFiles()* which returns all paths as given below: /yarn/?/?/?/1234/c and /yarn/?/?/?/2345/c But these paths don't have any folder level info so I'm not able to identify which path is representing a/b/c. Is it bug? Please help. Thanks, Ashish -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1761) RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby
[ https://issues.apache.org/jira/browse/YARN-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922590#comment-13922590 ] Hudson commented on YARN-1761: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1718 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1718/]) YARN-1761. Modified RMAdmin CLI to check whether HA is enabled or not before it executes any of the HA admin related commands. Contributed by Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574661) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMAdminCLI.java RMAdminCLI should check whether HA is enabled before executes transitionToActive/transitionToStandby Key: YARN-1761 URL: https://issues.apache.org/jira/browse/YARN-1761 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Fix For: 2.4.0 Attachments: YARN-1761.1.patch, YARN-1761.2.patch, YARN-1761.2.patch, YARN-1761.3.patch, YARN-1761.3.patch, YARN-1761.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1752) Unexpected Unregistered event at Attempt Launched state
[ https://issues.apache.org/jira/browse/YARN-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922589#comment-13922589 ] Hudson commented on YARN-1752: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1718 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1718/]) YARN-1752. Fixed ApplicationMasterService to reject unregister request if AM did not register before. Contributed by Rohith Sharma. (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1574623) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/exceptions/InvalidApplicationMasterRequestException.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAuditLogger.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockAM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockRM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationMasterService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java Unexpected Unregistered event at Attempt Launched state --- Key: YARN-1752 URL: https://issues.apache.org/jira/browse/YARN-1752 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Fix For: 2.4.0 Attachments: YARN-1752.1.patch, YARN-1752.2.patch, YARN-1752.3.patch, YARN-1752.4.patch, YARN-1752.5.patch {code} 2014-02-21 14:56:03,453 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: UNREGISTERED at LAUNCHED at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:647) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:103) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:714) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1410) Handle RM fails over after getApplicationID() and before submitApplication().
[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1410: Attachment: YARN-1410.10.patch Handle RM fails over after getApplicationID() and before submitApplication(). - Key: YARN-1410 URL: https://issues.apache.org/jira/browse/YARN-1410 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.10.patch, YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch, YARN-1410.9.patch Original Estimate: 48h Remaining Estimate: 48h App submission involves 1) creating appId 2) using that appId to submit an ApplicationSubmissionContext to the user. The client may have obtained an appId from an RM, the RM may have failed over, and the client may submit the app to the new RM. Since the new RM has a different notion of cluster timestamp (used to create app id) the new RM may reject the app submission resulting in unexpected failure on the client side. The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1410) Handle RM fails over after getApplicationID() and before submitApplication().
[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-1410: Attachment: YARN-1410.10.patch Handle RM fails over after getApplicationID() and before submitApplication(). - Key: YARN-1410 URL: https://issues.apache.org/jira/browse/YARN-1410 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.10.patch, YARN-1410.10.patch, YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch, YARN-1410.9.patch Original Estimate: 48h Remaining Estimate: 48h App submission involves 1) creating appId 2) using that appId to submit an ApplicationSubmissionContext to the user. The client may have obtained an appId from an RM, the RM may have failed over, and the client may submit the app to the new RM. Since the new RM has a different notion of cluster timestamp (used to create app id) the new RM may reject the app submission resulting in unexpected failure on the client side. The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1410) Handle RM fails over after getApplicationID() and before submitApplication().
[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922832#comment-13922832 ] Xuan Gong commented on YARN-1410: - create a new patch to address all the comments, which includes: 1. create a new Exception: ApplicationIdNotProvidedException. Before we do the application submission, we will check the applicationId. If the applicationId is not provided in ApplicationSubmissionContext, it will throw the ApplicationIdNotProvidedException. That requires the client should provide the ApplicationId before submission. 2. Added documentations in several places : GetNewApplicationResponse (explicitly saying the applicationId can be used to submit the application), YarnClient#submitApplication. * SubmitApplicationResponse: nothing change here. So, no new documentation added for this class * ApplicationClientProtocol.getNewApplication(..) API and ApplicationClientProtocol.submitApplication(..): no new documentation added. The current documents have enough information about that clients need to do when we return a appID. 3. Modify the testcase: * modify the TestYarnClient#testSubmitApplication() to validate we should get ApplicationIdNotProvidedException if applicationId is not provided * add a new test: TestSubmitApplicationWithRMHA to test handle RM fails over before submitApplication() Handle RM fails over after getApplicationID() and before submitApplication(). - Key: YARN-1410 URL: https://issues.apache.org/jira/browse/YARN-1410 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.10.patch, YARN-1410.10.patch, YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch, YARN-1410.9.patch Original Estimate: 48h Remaining Estimate: 48h App submission involves 1) creating appId 2) using that appId to submit an ApplicationSubmissionContext to the user. The client may have obtained an appId from an RM, the RM may have failed over, and the client may submit the app to the new RM. Since the new RM has a different notion of cluster timestamp (used to create app id) the new RM may reject the app submission resulting in unexpected failure on the client side. The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: (was: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch) FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] bc Wong updated YARN-1790: -- Attachment: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch Same patch with --no-prefix. FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922906#comment-13922906 ] Chris Trezzo commented on YARN-1492: bq. Do you want to go ahead and create sub-tasks? Will do. We have already made significant progress on implementation internally, so we should have a number of patches posted shortly. truly shared cache for jars (jobjar/libjar) --- Key: YARN-1492 URL: https://issues.apache.org/jira/browse/YARN-1492 Project: Hadoop YARN Issue Type: New Feature Affects Versions: 2.0.4-alpha Reporter: Sangjin Lee Assignee: Sangjin Lee Attachments: shared_cache_design.pdf, shared_cache_design_v2.pdf, shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, shared_cache_design_v5.pdf Currently there is the distributed cache that enables you to cache jars and files so that attempts from the same job can reuse them. However, sharing is limited with the distributed cache because it is normally on a per-job basis. On a large cluster, sometimes copying of jobjars and libjars becomes so prevalent that it consumes a large portion of the network bandwidth, not to speak of defeating the purpose of bringing compute to where data is. This is wasteful because in most cases code doesn't change much across many jobs. I'd like to propose and discuss feasibility of introducing a truly shared cache so that multiple jobs from multiple users can share and cache jars. This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1783) yarn application does not make any progress even when no other application is running when RM is being restarted in the background
[ https://issues.apache.org/jira/browse/YARN-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922920#comment-13922920 ] Xuan Gong commented on YARN-1783: - The logic to handle NodeAction.RESYNC looks good to me. But there will be one more issue. It is very possible that there is one container whose state is not completed when we generate NodeStatus and send to RM, but after we receive the response, the state of this container become COMPLETE. In this patch, we will remove all the completed containers. In this case, we will remove this container from context, and this container’s status will be missed. yarn application does not make any progress even when no other application is running when RM is being restarted in the background -- Key: YARN-1783 URL: https://issues.apache.org/jira/browse/YARN-1783 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1783.1.patch, YARN-1783.2.patch Noticed that during HA tests some tests took over 3 hours to run when the test failed. Looking at the logs i see the application made no progress for a very long time. However if i look at application log from yarn it actually ran in 5 mins I am seeing same behavior when RM was being restarted in the background and when both RM and AM were being restarted. This does not happen for all applications but a few will hit this in the nightly run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1466) implement the cleaner service for the shared cache
[ https://issues.apache.org/jira/browse/YARN-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-1466. --- Resolution: Invalid I'll close out these JIRAs for YARN-1492, as the design has changed from the time these JIRAs were filed. implement the cleaner service for the shared cache -- Key: YARN-1466 URL: https://issues.apache.org/jira/browse/YARN-1466 Project: Hadoop YARN Issue Type: New Feature Reporter: Sangjin Lee Assignee: Sangjin Lee -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1467) implement checksum verification for resource localization service for the shared cache
[ https://issues.apache.org/jira/browse/YARN-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee resolved YARN-1467. --- Resolution: Invalid I'll close out these JIRAs for YARN-1492, as the design has changed from the time these JIRAs were filed. implement checksum verification for resource localization service for the shared cache -- Key: YARN-1467 URL: https://issues.apache.org/jira/browse/YARN-1467 Project: Hadoop YARN Issue Type: New Feature Reporter: Sangjin Lee Assignee: Sangjin Lee -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1788) AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill
[ https://issues.apache.org/jira/browse/YARN-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1788: -- Priority: Critical (was: Major) Target Version/s: 2.4.0 Tx for the patch Varun! Marking it for 2.4 as it seems like a bad bug. The patch looks fine overall. But the test isn't useful much. TestRMAppTransitions and similar tests are basic unit-tests that don't uncover a lot of bugs that happen during integration. You should imitate TestRMRestart.testQueueMetricsOnRMRestart() without the restart part- that should be a fine balance between a unit test, integration test and a real life setup of starting clusters. AppsCompleted/AppsKilled metric is incorrect when MR job is killed with yarn application -kill -- Key: YARN-1788 URL: https://issues.apache.org/jira/browse/YARN-1788 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Tassapol Athiapinya Assignee: Varun Vasudev Priority: Critical Attachments: apache-yarn-1788.0.patch Run MR sleep job. Kill the application in RUNNING state. Observe RM metrics. Expecting AppsCompleted = 0/AppsKilled = 1 Actual is AppsCompleted = 1/AppsKilled = 0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1792) Add a CLI to kill yarn container
Tassapol Athiapinya created YARN-1792: - Summary: Add a CLI to kill yarn container Key: YARN-1792 URL: https://issues.apache.org/jira/browse/YARN-1792 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Tassapol Athiapinya One of my teammates saw an issue when there is dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1783) yarn application does not make any progress even when no other application is running when RM is being restarted in the background
[ https://issues.apache.org/jira/browse/YARN-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1783: -- Attachment: YARN-1783.3.patch Thanks for catching this ! The new patch creates a separate collection for recording the previous completed containers when getNodeStatus is called and remove containers from context only for those completed containers. yarn application does not make any progress even when no other application is running when RM is being restarted in the background -- Key: YARN-1783 URL: https://issues.apache.org/jira/browse/YARN-1783 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1783.1.patch, YARN-1783.2.patch, YARN-1783.3.patch Noticed that during HA tests some tests took over 3 hours to run when the test failed. Looking at the logs i see the application made no progress for a very long time. However if i look at application log from yarn it actually ran in 5 mins I am seeing same behavior when RM was being restarted in the background and when both RM and AM were being restarted. This does not happen for all applications but a few will hit this in the nightly run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1792) Add a CLI to kill yarn container
[ https://issues.apache.org/jira/browse/YARN-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-1792: --- Assignee: Xuan Gong Add a CLI to kill yarn container Key: YARN-1792 URL: https://issues.apache.org/jira/browse/YARN-1792 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Tassapol Athiapinya Assignee: Xuan Gong One of my teammates saw an issue when there is dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1792) Add a CLI to kill yarn container
[ https://issues.apache.org/jira/browse/YARN-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tassapol Athiapinya updated YARN-1792: -- Description: One of my teammates saw an issue when there was dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. was: One of my teammates saw an issue when there is dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. Add a CLI to kill yarn container Key: YARN-1792 URL: https://issues.apache.org/jira/browse/YARN-1792 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Tassapol Athiapinya Assignee: Xuan Gong One of my teammates saw an issue when there was dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1792) Add a CLI to kill yarn container
[ https://issues.apache.org/jira/browse/YARN-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922970#comment-13922970 ] Ramya Sunil commented on YARN-1792: --- Duplicate of YARN-1619 Add a CLI to kill yarn container Key: YARN-1792 URL: https://issues.apache.org/jira/browse/YARN-1792 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Tassapol Athiapinya Assignee: Xuan Gong One of my teammates saw an issue when there was dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
Karthik Kambatla created YARN-1793: -- Summary: yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922989#comment-13922989 ] Karthik Kambatla commented on YARN-1793: {code} if (application.isAppSafeToTerminate()) { RMAuditLogger.logSuccess(callerUGI.getShortUserName(), AuditConstants.KILL_APP_REQUEST, ClientRMService, applicationId); return KillApplicationResponse.newInstance(true); } else { this.rmContext.getDispatcher().getEventHandler() .handle(new RMAppEvent(applicationId, RMAppEventType.KILL)); return KillApplicationResponse.newInstance(false); } {code} Looks like we don't do anything bug log and return if the app is unmanaged. If the AM continues to run, it continues to hold onto all the containers that were allocated to it. [~jianhe], [~vinodkv], [~bikassaha] - any thoughts off the top of your head? yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13922990#comment-13922990 ] Hadoop QA commented on YARN-1790: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633204/0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3279//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3279//console This message is automatically generated. FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1685) Bugs around log URL
[ https://issues.apache.org/jira/browse/YARN-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1685: -- Attachment: YARN-1685.4.patch Upload a new patch: 1. Move the logic of constructing the log URL pointing the current timeline server in ApplicationHistoryManagerImpl, which ensures the log URL correctness in RPC interface as well, and simplifies the changes. 2. Add the test case to verify the logURL delivered from RPC interface. Bugs around log URL --- Key: YARN-1685 URL: https://issues.apache.org/jira/browse/YARN-1685 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Zhijie Shen Attachments: YARN-1685-1.patch, YARN-1685.2.patch, YARN-1685.3.patch, YARN-1685.4.patch 1. Log URL should be different when the container is running and finished 2. Null case needs to be handled 3. The way of constructing log URL should be corrected -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923039#comment-13923039 ] Karthik Kambatla commented on YARN-1525: bq. In the case the standby RM, where isStandbyMode() returns true and we start to buildRedirectPath, switches to be active, keeping the variable RMWebApp#standbyMode allows the behavior to be consistent from the time we tested isStandbyMode() first in RMDispatcher.service(). I see. This can be a little confusing. Can we change {{boolean isStandbyMode}} to {{void checkStandbyMode}} and set the field standbyMode. We can then may be access this field directly or through an accessor. bq. I've been using a temporary configuration (which you can find from RMWebApp.getRedirectPath()). I was actually resetting RMid back in the original code. It was necessary if I was not using a temporary configuration. But I'll remove them since I've been using temporary configuration. Now I see why it was working fine. Still, I think we should also fix RMHAUtils#findActiveRMHAId to use a copy and not mutate the conf that is passed to it, as this is a util method and will likely be used at other places. We should also get rid of getting and setting rm-id in RMWebApp#buildRedirectPath. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1792) Add a CLI to kill yarn container
[ https://issues.apache.org/jira/browse/YARN-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli resolved YARN-1792. --- Resolution: Duplicate Closing as duplicate. Add a CLI to kill yarn container Key: YARN-1792 URL: https://issues.apache.org/jira/browse/YARN-1792 Project: Hadoop YARN Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Tassapol Athiapinya Assignee: Xuan Gong One of my teammates saw an issue when there was dangling container. The reason could have been because of a bug in YARN application or unexpected environment failure. It is nice if YARN can handle this from YARN framework. I suggest YARN to provide a CLI to kill container(s). Security should be obeyed. In first phase, we could allow only YARN admin to kill container(s). The method should also work in both Linux and Windows platform. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1619) Add cli to kill yarn container
[ https://issues.apache.org/jira/browse/YARN-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-1619: --- Assignee: Xuan Gong Add cli to kill yarn container -- Key: YARN-1619 URL: https://issues.apache.org/jira/browse/YARN-1619 Project: Hadoop YARN Issue Type: New Feature Reporter: Ramya Sunil Assignee: Xuan Gong Fix For: 2.4.0 It will be useful to have a generic cli tool to kill containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1783) yarn application does not make any progress even when no other application is running when RM is being restarted in the background
[ https://issues.apache.org/jira/browse/YARN-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923073#comment-13923073 ] Hadoop QA commented on YARN-1783: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633212/YARN-1783.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3280//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3280//console This message is automatically generated. yarn application does not make any progress even when no other application is running when RM is being restarted in the background -- Key: YARN-1783 URL: https://issues.apache.org/jira/browse/YARN-1783 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Jian He Priority: Critical Attachments: YARN-1783.1.patch, YARN-1783.2.patch, YARN-1783.3.patch Noticed that during HA tests some tests took over 3 hours to run when the test failed. Looking at the logs i see the application made no progress for a very long time. However if i look at application log from yarn it actually ran in 5 mins I am seeing same behavior when RM was being restarted in the background and when both RM and AM were being restarted. This does not happen for all applications but a few will hit this in the nightly run. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923099#comment-13923099 ] Karthik Kambatla commented on YARN-1793: What do you think about getting rid of this if-else altogether and create the new RMAppEvent for kill in both cases? yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1793: --- Attachment: yarn-1793-0.patch Simple patch that seems to fix the issue. Removed UnmanagedAM from the isAppSafeToTerminate check - there are only two uses of this method and looks like we want to treat UnmanagedAMs differently in both places. Looking for early feedback if this is an acceptable approach. In a way, this is similar to what ApplicationMasterService#finishApplicationMaster. TODO: * Rename isAppSafeToTerminate - don't think it conveys what it is intend to do. * Simplify ApplicationMasterService#finishApplicationMaster. We seem to be doing the same for both managed and unmanaged AMs. The method can use some simplification. * Unit tests where possible. yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-1793-0.patch Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923176#comment-13923176 ] Sangjin Lee commented on YARN-1771: --- I have been looking into this from the perspective of reducing the number of unnecessary getFileStatus calls (and thereby reducing the pressure on the name node). So for now I'm gravitating towards a solution that caches the getFileStatus calls for the duration of a container initialization (i.e. resource localization). It would be pretty effective (reducing the number of calls from (m + 3)*n to n + (small constant)). I'll upload a patch for your review shortly. Thanks! many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-1771: -- Attachment: yarn-1771.patch many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923226#comment-13923226 ] Sangjin Lee commented on YARN-1771: --- I have created a status cache at the LocalizerContext level, and let FSDownload utilize the cache when querying the file status for the parent directories. I considered using a simple synchronized map and ConcurrentHashMap, but settled on using guava's LoadingCache. The issue with the localization pattern is that it is bursty. Most of the downloads happen in parallel, and thus most of these getFileStatus calls also go out in a burst. With a synchronized map, the problem is that these calls would be unnecessarily serialized (as it needs to acquire a global lock for this map). With a ConcurrentHashMap, calls can be concurrent, but with a simple ConcurrentMap usage it becomes harder to avoid extra getFileStatus calls. The LoadingCache maintains concurrency *and* limits the getFileStatus calls to strictly one call per path (I added the unit test to verify that). many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1341: - Attachment: YARN-1341.patch Patch to enable the recovery of NMTokens. Like YARN-1338 it uses leveldb as a state store or a null state store if recovery is not enabled. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Attachments: YARN-1341.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923278#comment-13923278 ] Mayank Bansal commented on YARN-1389: - Thanks [~zjshen] for review bq. 1. In javadoc of ApplicationClientProtocol, we shouldn't mention ApplicationHistoryServer, because the applications obtained from this protocol are all from RM cache instead of history store Done bq. 2. attempts won't be null, and getApplications doesn't throw ApplicationNotFoundException when getting an empty list of applications. Let's keep the behavior consistent. Same for getContainers. And in YarnClientImpl, don't process ApplicationAttemptNotFoundException and ContainerNotFoundException in the corresponding places. In some cases in can be null so keeping it that way. We wanted to avoid get status for application. bq. 3. TestClientRMService needs more test cases as well, like what you did in TestYarnClient Done bq. 4. Please test the new APIs in a presudo cluster to verify whether it works or not. Thanks! Done , it works :) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923276#comment-13923276 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633251/YARN-1341.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3283//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3283//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923280#comment-13923280 ] Mayank Bansal commented on YARN-1389: - bq .I realize there will be an issue that may not have immediate solution. Currently, if an application is finished, we can get all the finished containers of it from the history store. However,if an application is still running, YarnScheduler is going to remove the container out of its cache once the container is done. Therefore, we're unable to get the finished containers of a running application. It seems that we need to cache RMContainer until the application is finished. Thoughts? yes we should have this as right now there is inconsistency in finished containers for running apps, I will create another JIRA to track that. Thanks, Mayank ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch, YARN-1389-5.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1389: Attachment: YARN-1389-5.patch Updating latest patch. Thanks, Mayank ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch, YARN-1389-5.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923294#comment-13923294 ] Hadoop QA commented on YARN-1771: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633247/yarn-1771.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/3282//artifact/trunk/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3282//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3282//console This message is automatically generated. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1341: - Attachment: YARN-1341v2.patch Revised patch without the addition of the state store to the NM context since it's not necessary for this change. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1794) Yarn CLI only shows running containers for Running Applications
Mayank Bansal created YARN-1794: --- Summary: Yarn CLI only shows running containers for Running Applications Key: YARN-1794 URL: https://issues.apache.org/jira/browse/YARN-1794 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1794) Yarn CLI only shows running containers for Running Applications
[ https://issues.apache.org/jira/browse/YARN-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-1794: Description: (was: As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users.) Yarn CLI only shows running containers for Running Applications --- Key: YARN-1794 URL: https://issues.apache.org/jira/browse/YARN-1794 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1794) Yarn CLI only shows running containers for Running Applications
[ https://issues.apache.org/jira/browse/YARN-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923297#comment-13923297 ] Mayank Bansal commented on YARN-1794: - After YARN-1389 we have capability to show Attemps and Containers for running application however we can not show finished containers for a running application until and unless app is finished. Thanks, Mayank Yarn CLI only shows running containers for Running Applications --- Key: YARN-1794 URL: https://issues.apache.org/jira/browse/YARN-1794 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923301#comment-13923301 ] Mayank Bansal commented on YARN-1389: - Opened the JIRA for this issue YARN-1794 Thanks, Mayank ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch, YARN-1389-5.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1621) Add CLI to list states of yarn container-IDs/hosts
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923302#comment-13923302 ] Mayank Bansal commented on YARN-1621: - It should be covered by YARN-1389 Thanks, Mayank Add CLI to list states of yarn container-IDs/hosts -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Fix For: 2.4.0 As more applications are moved to YARN, we need generic CLI to list states of yarn containers and their hosts. Today if YARN application running in a container does hang, there is no way other than to manually kill its process. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers appId status where status is one of running/succeeded/killed/failed/all {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1790) FairSchedule UI not showing apps table
[ https://issues.apache.org/jira/browse/YARN-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923306#comment-13923306 ] Sandy Ryza commented on YARN-1790: -- +1 FairSchedule UI not showing apps table -- Key: YARN-1790 URL: https://issues.apache.org/jira/browse/YARN-1790 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.3.0 Reporter: bc Wong Assignee: bc Wong Attachments: 0001-YARN-1790.-FairScheduler-UI-not-showing-apps-table.patch, fs_ui.png, fs_ui_fixed.png There is a running job, which shows up in the summary table in the FairScheduler UI, the queue display, etc. Just not in the apps table at the bottom. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cindy Li updated YARN-1525: --- Attachment: YARN1525.secure.v10.patch Made changes according to Karthik's comments. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1771) many getFileStatus calls made from node manager for localizing a public distributed cache resource
[ https://issues.apache.org/jira/browse/YARN-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-1771: -- Attachment: yarn-1771.patch Fixed javadoc. many getFileStatus calls made from node manager for localizing a public distributed cache resource -- Key: YARN-1771 URL: https://issues.apache.org/jira/browse/YARN-1771 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0 Reporter: Sangjin Lee Assignee: Sangjin Lee Priority: Critical Attachments: yarn-1771.patch, yarn-1771.patch We're observing that the getFileStatus calls are putting a fair amount of load on the name node as part of checking the public-ness for localizing a resource that belong in the public cache. We see 7 getFileStatus calls made for each of these resource. We should look into reducing the number of calls to the name node. One example: {noformat} 2014-02-27 18:07:27,351 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,352 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724 ... 2014-02-27 18:07:27,353 INFO audit: ... cmd=getfileinfo src=/tmp ... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/... 2014-02-27 18:07:27,354 INFO audit: ... cmd=getfileinfo src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... 2014-02-27 18:07:27,355 INFO audit: ... cmd=open src=/tmp/temp-887708724/tmp883330348/foo-0.0.44.jar ... {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1795) Oozie tests are flakey after YARN-713
Robert Kanter created YARN-1795: --- Summary: Oozie tests are flakey after YARN-713 Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923372#comment-13923372 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633265/YARN-1341v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3285//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3285//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1795) Oozie tests are flakey after YARN-713
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1795: --- Priority: Critical (was: Major) Oozie tests are flakey after YARN-713 - Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Critical Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1795) Oozie tests are flakey after YARN-713
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1795: --- Target Version/s: 2.4.0 Oozie tests are flakey after YARN-713 - Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Critical Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1796) container-executor shouldn't require o-r permissions
[ https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron T. Myers updated YARN-1796: - Attachment: YARN-1796.patch Simple patch attached to relax the mode check in the container-executor. This patch also takes the liberty of fixing an inaccurate code comment that was nearby. container-executor shouldn't require o-r permissions Key: YARN-1796 URL: https://issues.apache.org/jira/browse/YARN-1796 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Priority: Minor Attachments: YARN-1796.patch The container-executor currently checks that other users don't have read permissions. This is unnecessary and runs contrary to the debian packaging policy manual. This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1796) container-executor shouldn't require o-r permissions
Aaron T. Myers created YARN-1796: Summary: container-executor shouldn't require o-r permissions Key: YARN-1796 URL: https://issues.apache.org/jira/browse/YARN-1796 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Priority: Minor The container-executor currently checks that other users don't have read permissions. This is unnecessary and runs contrary to the debian packaging policy manual. This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923395#comment-13923395 ] Hadoop QA commented on YARN-1525: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633269/YARN1525.secure.v10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.api.impl.TestNMClient {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3286//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3286//console This message is automatically generated. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923398#comment-13923398 ] Cindy Li commented on YARN-1525: The test org.apache.hadoop.yarn.client.api.impl.TestNMClient is irrelevant. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1525: --- Attachment: YARN1525.secure.v11.patch Thanks Cindy. Posting a patch with cosmetic changes (formatting etc.); also, removed changes to ResourceTrackerPBClientImpl which seemed spurious. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.secure.v11.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923406#comment-13923406 ] Cindy Li commented on YARN-1525: That was for another patch... Ok. I should've removed that. Thanks for removing that. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.secure.v11.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1796) container-executor shouldn't require o-r permissions
[ https://issues.apache.org/jira/browse/YARN-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923433#comment-13923433 ] Vinod Kumar Vavilapalli commented on YARN-1796: --- I think I originally did that code in 1.x and 2.x. I know we were being excessively paranoid, but I haven't seen a reason why it should be opened up either. Where is the problem as it exists today? container-executor shouldn't require o-r permissions Key: YARN-1796 URL: https://issues.apache.org/jira/browse/YARN-1796 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.4.0 Reporter: Aaron T. Myers Assignee: Aaron T. Myers Priority: Minor Attachments: YARN-1796.patch The container-executor currently checks that other users don't have read permissions. This is unnecessary and runs contrary to the debian packaging policy manual. This is the analogous fix for YARN that was done for MR1 in MAPREDUCE-2103. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1410) Handle RM fails over after getApplicationID() and before submitApplication().
[ https://issues.apache.org/jira/browse/YARN-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923434#comment-13923434 ] Vinod Kumar Vavilapalli commented on YARN-1410: --- Okay, I went back and reread the thing. It seems like we diverged off again. The approach in the latest patch seems like it isn't the same as what Bikas and you agreed upon. Is that true? [~bikassaha], can you confirm if it is fine? We now blindly accepts appIDs generated by previous RM. Clearly, there are possibilities of malicious users generating appIDs (which exists today) - but there are a couple of ways in which we can fix that. Originally, it was also suggested that we add app-ID to the SubmitResponse - which we aren't doing anymore as we blindly accept IDs from previous RMs now in the latest patch. Is that fine? Handle RM fails over after getApplicationID() and before submitApplication(). - Key: YARN-1410 URL: https://issues.apache.org/jira/browse/YARN-1410 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Xuan Gong Attachments: YARN-1410-outline.patch, YARN-1410.1.patch, YARN-1410.10.patch, YARN-1410.10.patch, YARN-1410.2.patch, YARN-1410.2.patch, YARN-1410.3.patch, YARN-1410.4.patch, YARN-1410.5.patch, YARN-1410.6.patch, YARN-1410.7.patch, YARN-1410.8.patch, YARN-1410.9.patch Original Estimate: 48h Remaining Estimate: 48h App submission involves 1) creating appId 2) using that appId to submit an ApplicationSubmissionContext to the user. The client may have obtained an appId from an RM, the RM may have failed over, and the client may submit the app to the new RM. Since the new RM has a different notion of cluster timestamp (used to create app id) the new RM may reject the app submission resulting in unexpected failure on the client side. The same may happen for other 2 step client API operations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) Oozie tests are flakey after YARN-713
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923457#comment-13923457 ] Vinod Kumar Vavilapalli commented on YARN-1795: --- Per [~sseth], it is likely that you are confusing the ports because it is MiniYarnCluster setup where you are running multiple NMs on the same machine? The bug seems valid, but may be the analysis isn't. Not sure completely either ways. It'll be useful if you can capture RM logs specifically for this container. Oozie tests are flakey after YARN-713 - Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Critical Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1342) Recover container tokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-1342: - Attachment: YARN-1342.patch Patch to recover container tokens after a restart. This is very similar to the patch for YARN-1341 but for container tokens instead of NM tokens. Recover container tokens upon nodemanager restart - Key: YARN-1342 URL: https://issues.apache.org/jira/browse/YARN-1342 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Attachments: YARN-1342.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923492#comment-13923492 ] Hadoop QA commented on YARN-1525: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633287/YARN1525.secure.v11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3288//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3288//console This message is automatically generated. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.secure.v11.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923491#comment-13923491 ] Jian He commented on YARN-1793: --- Took a quick look at the patch, if I remember correctly, the special check in isAppSafeToTerminate for unmanaged AM is for this reason https://issues.apache.org/jira/browse/YARN-540?focusedCommentId=13762533page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13762533 yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-1793-0.patch Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1780) Improve logging in timeline service
[ https://issues.apache.org/jira/browse/YARN-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923507#comment-13923507 ] Vinod Kumar Vavilapalli commented on YARN-1780: --- This looks good, +1. Checking this in. Improve logging in timeline service --- Key: YARN-1780 URL: https://issues.apache.org/jira/browse/YARN-1780 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-1780.1.patch, YARN-1780.1.patch It's difficult to trace whether the client has successfully posted the entity to the timeline service or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1787) yarn applicationattempt/container print wrong usage information
[ https://issues.apache.org/jira/browse/YARN-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923521#comment-13923521 ] Vinod Kumar Vavilapalli commented on YARN-1787: --- I am fairly sure you broke bin/yarn appliciation etc after the patch, can you please verify? The patch looks fine overall other than bin/yarn changes. Ideally, we should split the CLI into separate classes for app, appattempts etc. Will file a ticket. The other thing is that -queue Queue Name shouldn't be an option, it should just be an argument to -movetoqueue. Will file a ticket for that also. yarn applicationattempt/container print wrong usage information --- Key: YARN-1787 URL: https://issues.apache.org/jira/browse/YARN-1787 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: ApplicationCLI.java.rej, YARN-1787.1.patch, YARN-1787.2.patch yarn applicationattempt prints: {code} Invalid Command Usage : usage: application -appStates States Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of the following: ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN NING,FINISHED,FAILED,KILLED -appTypes Types Works with -list to filter applications based on input comma-separated list of application types. -help Displays help for all commands. -kill Application ID Kills the application. -list arg List application attempts for aplication from AHS. -movetoqueue Application ID Moves the application to a different queue. -queue Queue Name Works with the movetoqueue command to specify which queue to move an application to. -status Application IDPrints the status of the application. {code} yarn container prints: {code} Invalid Command Usage : usage: application -appStates States Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of the following: ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUN NING,FINISHED,FAILED,KILLED -appTypes Types Works with -list to filter applications based on input comma-separated list of application types. -help Displays help for all commands. -kill Application ID Kills the application. -list arg List application attempts for aplication from AHS. -movetoqueue Application ID Moves the application to a different queue. -queue Queue Name Works with the movetoqueue command to specify which queue to move an application to. -status Application IDPrints the status of the application. {code} Both commands print irrelevant yarn application usage information. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1795) Oozie tests are flakey after YARN-713
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-1795: Attachment: syslog org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt I've attached the output from one of the tests; the RM logs are intermixed in it; but its easy to just grep from the container in question. I've also attached the syslog from one of the containers ({{container_1394161202967_0004_01_04}}) that had the problem. I modified the NMTokenCache to print out the tokens whenever getToken is called, so that's in there too. Oozie tests are flakey after YARN-713 - Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Critical Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923531#comment-13923531 ] Karthik Kambatla commented on YARN-1525: Thanks Cindy. +1 on the latest patch. Committing this shortly. We can address any improvements in follow-up JIRAs. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.secure.v11.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-1389: -- Target Version/s: 2.4.0 This is important part of the generic-history feature for 2.4. ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch, YARN-1389-5.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1621) Add CLI to list states of yarn container-IDs/hosts
[ https://issues.apache.org/jira/browse/YARN-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923540#comment-13923540 ] Vinod Kumar Vavilapalli commented on YARN-1621: --- It doesn't look like YARN-1389 is tracking filters for containers, so we need to track this separately. Add CLI to list states of yarn container-IDs/hosts -- Key: YARN-1621 URL: https://issues.apache.org/jira/browse/YARN-1621 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.2.0 Reporter: Tassapol Athiapinya Fix For: 2.4.0 As more applications are moved to YARN, we need generic CLI to list states of yarn containers and their hosts. Today if YARN application running in a container does hang, there is no way other than to manually kill its process. For each running application, it is useful to differentiate between running/succeeded/failed/killed containers. {code:title=proposed yarn cli} $ yarn application -list-containers appId status where status is one of running/succeeded/killed/failed/all {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1797) TestNodeManagerResync,jjjjjjjjjjjjj
Tsuyoshi OZAWA created YARN-1797: Summary: TestNodeManagerResync,j Key: YARN-1797 URL: https://issues.apache.org/jira/browse/YARN-1797 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux
Tsuyoshi OZAWA created YARN-1798: Summary: TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux Key: YARN-1798 URL: https://issues.apache.org/jira/browse/YARN-1798 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux
[ https://issues.apache.org/jira/browse/YARN-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923545#comment-13923545 ] Tsuyoshi OZAWA commented on YARN-1798: -- Here is {code} Failed tests: TestContainerLaunch.testDelayedKill:723 -internalKillTest:679 -BaseContainerManagerTest.waitForContainerState:254 -BaseContainerManagerTest.waitForContainerState:276 ContainerState is not correct (timedout) expected:COMPLETE but was:RUNNING TestContainerLaunch.testImmediateKill:728-internalKillTest:679-BaseContainerManagerTest.waitForContainerState:254 -BaseContainerManagerTest.waitForContainerState:276 ContainerState is not correct (timedout) expected:COMPLETE but was:RUNNING TestContainerLaunch.testContainerEnvVariables:557 Process is not alive! TestContainerManager.testContainerLaunchAndStop:333 Process is not alive! TestContainersMonitor.testContainerKillOnMemoryOverflow:273 expected:143 but was:0 TestNodeManagerShutdown.testKillContainersOnShutdown:153 Did not find sigterm message TestNodeStatusUpdater.testNodeStatusUpdaterRetryAndNMShutdown:1186 Containers not cleaned up when NM stopped Tests in error: TestNodeManagerResync.testKillContainersOnResync:91 ? Metrics Metrics source J... Tests run: 203, Failures: 7, Errors: 1, Skipped: 1 {code} TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux - Key: YARN-1798 URL: https://issues.apache.org/jira/browse/YARN-1798 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux
[ https://issues.apache.org/jira/browse/YARN-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923546#comment-13923546 ] Tsuyoshi OZAWA commented on YARN-1798: -- Here is a result of test execution locally. TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux - Key: YARN-1798 URL: https://issues.apache.org/jira/browse/YARN-1798 Project: Hadoop YARN Issue Type: Bug Reporter: Tsuyoshi OZAWA -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1795) Oozie tests are flakey after YARN-713
[ https://issues.apache.org/jira/browse/YARN-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923552#comment-13923552 ] Robert Kanter commented on YARN-1795: - Looking at the printouts I added to the NMTokenCache, I think I figured out some more: During the tests, we run 2 NodeManagers. With YARN-713, the NMTokenCache only ever has 1 token in it; when in the cases where a container is trying to use one NM and the token is for the other, we get the InvalidToken error. I tried running without YARN-713, and the NMTokenCache usually has 2 tokens in it, so the containers are able to find the token in the NMTokenCache. I haven't had a chance to look into it more yet, but I did notice that YARN-713 changes NMTokenSecretManagerInRM's createAndGetNMTokens method, which returns a list of tokens, to be createAndGetNMToken, which returns a single token. Perhaps that has something to do with this? Oozie tests are flakey after YARN-713 - Key: YARN-1795 URL: https://issues.apache.org/jira/browse/YARN-1795 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.4.0 Reporter: Robert Kanter Priority: Critical Attachments: org.apache.oozie.action.hadoop.TestMapReduceActionExecutor-output.txt, syslog Running the Oozie unit tests against a Hadoop build with YARN-713 causes many of the tests to be flakey. Doing some digging, I found that they were failing because some of the MR jobs were failing; I found this in the syslog of the failed jobs: {noformat} 2014-03-05 16:18:23,452 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1394064846476_0013_m_00_0: Container launch failed for container_1394064846476_0013_01_03 : org.apache.hadoop.security.token.SecretManager$InvalidToken: No NMToken sent for 192.168.1.77:50759 at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.newProxy(ContainerManagementProtocolProxy.java:206) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy$ContainerManagementProtocolProxyData.init(ContainerManagementProtocolProxy.java:196) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.getProxy(ContainerManagementProtocolProxy.java:117) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:403) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:138) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} I did some debugging and found that the NMTokenCache has a different port number than what's being looked up. For example, the NMTokenCache had one token with address 192.168.1.77:58217 but ContainerManagementProtocolProxy.java:119 is looking for 192.168.1.77:58213. The 58213 address comes from ContainerLauncherImpl's constructor. So when the Container is being launched it somehow has a different port than when the token was created. Any ideas why the port numbers wouldn't match? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1525) Web UI should redirect to active RM when HA is enabled.
[ https://issues.apache.org/jira/browse/YARN-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923551#comment-13923551 ] Hadoop QA commented on YARN-1525: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633287/YARN1525.secure.v11.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3291//console This message is automatically generated. Web UI should redirect to active RM when HA is enabled. --- Key: YARN-1525 URL: https://issues.apache.org/jira/browse/YARN-1525 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Cindy Li Attachments: YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch, YARN1525.patch.v1, YARN1525.patch.v2, YARN1525.patch.v3, YARN1525.secure.v10.patch, YARN1525.secure.v11.patch, YARN1525.v7.patch, YARN1525.v7.patch, YARN1525.v8.patch, YARN1525.v9.patch, Yarn1525.secure.patch, Yarn1525.secure.patch When failover happens, web UI should redirect to the current active rm. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1799) Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff
Sunil G created YARN-1799: - Summary: Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff Key: YARN-1799 URL: https://issues.apache.org/jira/browse/YARN-1799 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Sunil G LocalDirAllocator provides paths for all tasks for its local write. This considers the good list of directories which are selected by the HealthCheck mechamnism in LocalDirsHandlerService getLocalPathForWrite() considers whether input demand size can meet the capacity in lastAccessed directory. If more tasks asks for path from LocalDirAllocator, then it is possible that the allocation is done based on the current disk availability at that given time. But this path would have earlier given to some other tasks to write and they may be sequentially doing writing. It is better to check for an upper cutoff for disk availability -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1781) NM should allow users to specify max disk utilization for local disks
[ https://issues.apache.org/jira/browse/YARN-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923563#comment-13923563 ] Sunil G commented on YARN-1781: --- I have created a separate JIRA YARN-1799 as per the comment from Vinod. NM should allow users to specify max disk utilization for local disks - Key: YARN-1781 URL: https://issues.apache.org/jira/browse/YARN-1781 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: apache-yarn-1781.0.patch, apache-yarn-1781.1.patch, apache-yarn-1781.2.patch This is related to YARN-257(it's probably a sub task?). Currently, the NM does not detect full disks and allows full disks to be used by containers leading to repeated failures. YARN-257 deals with graceful handling of full disks. This ticket is only about detection of full disks by the disk health checkers. The NM should allow users to set a maximum disk utilization for local disks and mark disks as bad once they exceed that utilization. At the very least, the NM should at least detect full disks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1780) Improve logging in timeline service
[ https://issues.apache.org/jira/browse/YARN-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923564#comment-13923564 ] Hudson commented on YARN-1780: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5280 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5280/]) YARN-1780. Improved logging in the Timeline client and server. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1575141) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/webapp/TimelineWebServices.java Improve logging in timeline service --- Key: YARN-1780 URL: https://issues.apache.org/jira/browse/YARN-1780 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.4.0 Attachments: YARN-1780.1.patch, YARN-1780.1.patch It's difficult to trace whether the client has successfully posted the entity to the timeline service or not. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1793: --- Attachment: yarn-1793-1.patch yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-1793-0.patch, yarn-1793-1.patch Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1764) Handle RM fail overs after the submitApplication call.
[ https://issues.apache.org/jira/browse/YARN-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923566#comment-13923566 ] Vinod Kumar Vavilapalli commented on YARN-1764: --- I think we should also mark getApplicationReport() to be idempotent in this patch itself as RM can fail-over after submitApplication() returned but *during* a getApplicationReport(). We will need to add some tests for this too. Handle RM fail overs after the submitApplication call. -- Key: YARN-1764 URL: https://issues.apache.org/jira/browse/YARN-1764 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-1764.1.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1793) yarn application -kill doesn't kill UnmanagedAMs
[ https://issues.apache.org/jira/browse/YARN-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923568#comment-13923568 ] Karthik Kambatla commented on YARN-1793: Thanks for digging up the reason, [~jianhe]. Can you take a look at the updated patch. Looked into this more carefully. Updated patch changes both ClientRMService and ApplicationMasterService. I believe we have three cases based on the state and kind of application. * Applications that have already reached a final state - do nothing, trivially log success. * Applications that aren't in a final state yet - kill / unregister the application ** UnmanagedAM - falsely acknowledge kill/ unregister so they don't retry ** ManagedAM - return false, so they keep retrying Submitted the patch to see what Jenkins has to say. Still need to add unit tests. yarn application -kill doesn't kill UnmanagedAMs Key: YARN-1793 URL: https://issues.apache.org/jira/browse/YARN-1793 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Critical Attachments: yarn-1793-0.patch, yarn-1793-1.patch Trying to kill an Unmanaged AM though CLI (yarn application -kill id) logs a success, but doesn't actually kill the AM or reclaim the containers allocated to it. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923569#comment-13923569 ] Zhijie Shen commented on YARN-1389: --- Thanks for the update, Mayank! The patch is general fine. Here're some additional comments 1. is it simpler to use e instanceof NotFoundException? {code} + // Even if history-service is enabled, treat all exceptions still the same + // except the following + if (e.getClass() != ApplicationNotFoundException.class + e.getClass() != ApplicationAttemptNotFoundException.class) { +throw e; + } {code} 2. getFinishedStatus() is not necessary. You can directly do when() on getDiagnostics/getExitStatus/getState. {code} +ContainerStatus cs = mock(ContainerStatus.class); +when(containerimpl.getFinishedStatus()).thenReturn(cs); +when(containerimpl.getFinishedStatus().getDiagnostics()).thenReturn(N/A); +when(containerimpl.getFinishedStatus().getExitStatus()).thenReturn(0); +when(containerimpl.getFinishedStatus().getState()).thenReturn( +ContainerState.COMPLETE); {code} 3. There're a lot of code duplication in TestClientRMService. You can move the common code into a private createClientRMService method, which is called by your test methods. 4. Shouldn't we remove throw new YarnException(History service is not enabled.); in YarnClientImpl? 5. Shouldn't we assert fail here, because the exception is not excepted? Similar in other test cases. {code} +} catch (ApplicationNotFoundException ex) { + Assert.assertEquals(ex.getMessage(), + Application with id ' + request.getApplicationAttemptId() + + ' doesn't exist in RM.); +} {code} In addition to that, personally, I'm still object to throwing AppAttempt/Container not found exception when getting empty appattempt and container list. Let's assume history service is disabled. Then, getting empty applications is allowed while getting empty appattempt/container list will result in exception. Th inconsistent behavior is going to confuse users. In particular, It is likely that a running application doesn't have any appattempt (e.g. the app is before ACCEPTED and is the first attempt). ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs - Key: YARN-1389 URL: https://issues.apache.org/jira/browse/YARN-1389 Project: Hadoop YARN Issue Type: Sub-task Reporter: Mayank Bansal Assignee: Mayank Bansal Attachments: YARN-1389-1.patch, YARN-1389-2.patch, YARN-1389-3.patch, YARN-1389-4.patch, YARN-1389-5.patch As we plan to have the APIs in ApplicationHistoryProtocol to expose the reports of *finished* application attempts and containers, we should do the same for ApplicationClientProtocol, which will return the reports of *running* attempts and containers. Later on, we can improve YarnClient to direct the query of running instance to ApplicationClientProtocol, while that of finished instance to ApplicationHistoryProtocol, making it transparent to the users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1799) Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff
[ https://issues.apache.org/jira/browse/YARN-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923570#comment-13923570 ] Sunil G commented on YARN-1799: --- I would like to take up this JIRA. Enhance LocalDirAllocator in NM to consider DiskMaxUtilization cutoff - Key: YARN-1799 URL: https://issues.apache.org/jira/browse/YARN-1799 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Sunil G LocalDirAllocator provides paths for all tasks for its local write. This considers the good list of directories which are selected by the HealthCheck mechamnism in LocalDirsHandlerService getLocalPathForWrite() considers whether input demand size can meet the capacity in lastAccessed directory. If more tasks asks for path from LocalDirAllocator, then it is possible that the allocation is done based on the current disk availability at that given time. But this path would have earlier given to some other tasks to write and they may be sequentially doing writing. It is better to check for an upper cutoff for disk availability -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1389) ApplicationClientProtocol and ApplicationHistoryProtocol should expose analogous APIs
[ https://issues.apache.org/jira/browse/YARN-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923574#comment-13923574 ] Zhijie Shen commented on YARN-1389: --- I've tested the patch locally. yarn applicationattempt seems to be able to get and list attempts of RM cached application. yarn container is going to result in the following crash: {code} zjshen-mac-pc:Deployment zshen$ yarn container -status container_1394168341541_0003_01_01 14/03/06 21:06:51 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:9104 14/03/06 21:06:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/03/06 21:06:51 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 Exception in thread main java.lang.NullPointerException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.getDiagnosticsInfo(RMContainerImpl.java:253) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.createContainerReport(RMContainerImpl.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:413) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:364) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:349) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:932) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2071) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2067) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:394) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1597) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2065) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getContainerReport(ApplicationClientProtocolPBClientImpl.java:375) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:189) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy10.getContainerReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getContainerReport(YarnClientImpl.java:519) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.printContainerReport(ApplicationCLI.java:292) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:150) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:76) Caused by: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.getDiagnosticsInfo(RMContainerImpl.java:253) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.createContainerReport(RMContainerImpl.java:439) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:413) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:364) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:349) at