[jira] [Commented] (YARN-400) RM can return null application resource usage report leading to NPE in client
[ https://issues.apache.org/jira/browse/YARN-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583095#comment-13583095 ] Hudson commented on YARN-400: - Integrated in Hadoop-Yarn-trunk #134 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/134/]) YARN-400. RM can return null application resource usage report leading to NPE in client (Jason Lowe via tgraves) (Revision 1448241) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448241 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java RM can return null application resource usage report leading to NPE in client - Key: YARN-400 URL: https://issues.apache.org/jira/browse/YARN-400 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Fix For: 3.0.0, 0.23.7, 2.0.4-beta Attachments: YARN-400-branch-0.23.patch, YARN-400.patch RMAppImpl.createAndGetApplicationReport can return a report with a null resource usage report if full access to the app is allowed but the application has no current attempt. This leads to NPEs in client code that assumes an app report will always have at least an empty resource usage report. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-236) RM should point tracking URL to RM web page when app fails to start
[ https://issues.apache.org/jira/browse/YARN-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583097#comment-13583097 ] Hudson commented on YARN-236: - Integrated in Hadoop-Yarn-trunk #134 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/134/]) YARN-236. RM should point tracking URL to RM web page when app fails to start (Jason Lowe via jeagles) (Revision 1448406) Result = SUCCESS jeagles : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448406 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java RM should point tracking URL to RM web page when app fails to start --- Key: YARN-236 URL: https://issues.apache.org/jira/browse/YARN-236 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.4 Reporter: Jason Lowe Assignee: Jason Lowe Labels: usability Fix For: 3.0.0, 0.23.7, 2.0.4-beta Attachments: YARN-236.patch Similar to YARN-165, the RM should redirect the tracking URL to the specific app page on the RM web UI when the application fails to start. For example, if the AM completely fails to start due to bad AM config or bad job config like invalid queuename, then the user gets the unhelpful The requested application exited before setting a tracking URL. Usually the diagnostic string on the RM app page has something useful, so we might as well point there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-400) RM can return null application resource usage report leading to NPE in client
[ https://issues.apache.org/jira/browse/YARN-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583154#comment-13583154 ] Hudson commented on YARN-400: - Integrated in Hadoop-Hdfs-0.23-Build #532 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/532/]) YARN-400. RM can return null application resource usage report leading to NPE in client (Jason Lowe via tgraves) (Revision 1448244) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448244 Files : * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java RM can return null application resource usage report leading to NPE in client - Key: YARN-400 URL: https://issues.apache.org/jira/browse/YARN-400 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Fix For: 3.0.0, 0.23.7, 2.0.4-beta Attachments: YARN-400-branch-0.23.patch, YARN-400.patch RMAppImpl.createAndGetApplicationReport can return a report with a null resource usage report if full access to the app is allowed but the application has no current attempt. This leads to NPEs in client code that assumes an app report will always have at least an empty resource usage report. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-236) RM should point tracking URL to RM web page when app fails to start
[ https://issues.apache.org/jira/browse/YARN-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583155#comment-13583155 ] Hudson commented on YARN-236: - Integrated in Hadoop-Hdfs-0.23-Build #532 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/532/]) YARN-236. RM should point tracking URL to RM web page when app fails to start (Jason Lowe via jeagles) (Revision 1448411) Result = SUCCESS jeagles : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448411 Files : * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java RM should point tracking URL to RM web page when app fails to start --- Key: YARN-236 URL: https://issues.apache.org/jira/browse/YARN-236 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.4 Reporter: Jason Lowe Assignee: Jason Lowe Labels: usability Fix For: 3.0.0, 0.23.7, 2.0.4-beta Attachments: YARN-236.patch Similar to YARN-165, the RM should redirect the tracking URL to the specific app page on the RM web UI when the application fails to start. For example, if the AM completely fails to start due to bad AM config or bad job config like invalid queuename, then the user gets the unhelpful The requested application exited before setting a tracking URL. Usually the diagnostic string on the RM app page has something useful, so we might as well point there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-400) RM can return null application resource usage report leading to NPE in client
[ https://issues.apache.org/jira/browse/YARN-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583174#comment-13583174 ] Hudson commented on YARN-400: - Integrated in Hadoop-Mapreduce-trunk #1351 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1351/]) YARN-400. RM can return null application resource usage report leading to NPE in client (Jason Lowe via tgraves) (Revision 1448241) Result = FAILURE tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1448241 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java RM can return null application resource usage report leading to NPE in client - Key: YARN-400 URL: https://issues.apache.org/jira/browse/YARN-400 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Critical Fix For: 3.0.0, 0.23.7, 2.0.4-beta Attachments: YARN-400-branch-0.23.patch, YARN-400.patch RMAppImpl.createAndGetApplicationReport can return a report with a null resource usage report if full access to the app is allowed but the application has no current attempt. This leads to NPEs in client code that assumes an app report will always have at least an empty resource usage report. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-380) yarn node -status prints Last-Last-Node-Status
[ https://issues.apache.org/jira/browse/YARN-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-380: - Labels: usability (was: ) yarn node -status prints Last-Last-Node-Status -- Key: YARN-380 URL: https://issues.apache.org/jira/browse/YARN-380 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Labels: usability I assume the Last-Last-NodeStatus is a typo and it should just be Last-Node-Status. $ yarn node -status foo.com:8041 Node Report : Node-Id : foo.com:8041 Rack : /10.10.10.0 Node-State : RUNNING Node-Http-Address : foo.com:8042 Health-Status(isNodeHealthy) : true Last-Last-Health-Update : 1360118400219 Health-Report : Containers : 0 Memory-Used : 0M Memory-Capacity : 24576 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-379) yarn [node,application] command print logger info messages
[ https://issues.apache.org/jira/browse/YARN-379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-379: - Labels: usability (was: ) yarn [node,application] command print logger info messages -- Key: YARN-379 URL: https://issues.apache.org/jira/browse/YARN-379 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.0.3-alpha Reporter: Thomas Graves Labels: usability Running the yarn node and yarn applications command results in annoying log info messages being printed: $ yarn node -list 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. Total Nodes:1 Node-IdNode-State Node-Http-Address Health-Status(isNodeHealthy)Running-Containers foo:8041RUNNING foo:8042 true 0 13/02/06 02:36:50 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. $ yarn application 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is inited. 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is started. Invalid Command Usage : usage: application -kill arg Kills the application. -list Lists all the Applications from RM. -status arg Prints the status of the application. 13/02/06 02:38:47 INFO service.AbstractService: Service:org.apache.hadoop.yarn.client.YarnClientImpl is stopped. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-198) If we are navigating to Nodemanager UI from Resourcemanager,then there is not link to navigate back to Resource manager
[ https://issues.apache.org/jira/browse/YARN-198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-198: - Labels: usability (was: ) If we are navigating to Nodemanager UI from Resourcemanager,then there is not link to navigate back to Resource manager --- Key: YARN-198 URL: https://issues.apache.org/jira/browse/YARN-198 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Ramgopal N Assignee: Senthil V Kumar Priority: Minor Labels: usability If we are navigating to Nodemanager by clicking on the node link in RM,there is no link provided on the NM to navigate back to RM. If there is a link to navigate back to RM it would be good -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-410) Miscellaneous web UI issues
[ https://issues.apache.org/jira/browse/YARN-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-410: - Labels: usability (was: ) Miscellaneous web UI issues --- Key: YARN-410 URL: https://issues.apache.org/jira/browse/YARN-410 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Labels: usability We need to fix the following issues on YARN web-UI: - Remove the Note column from the application list. When a failure happens, this Note spoils the table layout. - When the Application is still not running, the Tracking UI should be title UNASSIGNED, for some reason it is titled ApplicationMaster but (correctly) links to #. - The per-application page has all the RM related information like version, start-time etc. Must be some accidental change by one of the patches. - The diagnostics for a failed app on the per-application page don't retain new lines and wrap'em around - looks hard to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (YARN-410) Miscellaneous web UI issues
[ https://issues.apache.org/jira/browse/YARN-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli moved MAPREDUCE-3152 to YARN-410: - Component/s: (was: mrv2) Fix Version/s: (was: 0.24.0) Assignee: (was: Subroto Sanyal) Affects Version/s: (was: 0.23.0) Key: YARN-410 (was: MAPREDUCE-3152) Project: Hadoop YARN (was: Hadoop Map/Reduce) Miscellaneous web UI issues --- Key: YARN-410 URL: https://issues.apache.org/jira/browse/YARN-410 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli We need to fix the following issues on YARN web-UI: - Remove the Note column from the application list. When a failure happens, this Note spoils the table layout. - When the Application is still not running, the Tracking UI should be title UNASSIGNED, for some reason it is titled ApplicationMaster but (correctly) links to #. - The per-application page has all the RM related information like version, start-time etc. Must be some accidental change by one of the patches. - The diagnostics for a failed app on the per-application page don't retain new lines and wrap'em around - looks hard to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-69) RM should throw different exceptions for while querying app/node/queue
[ https://issues.apache.org/jira/browse/YARN-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-69: Issue Type: Sub-task (was: Bug) Parent: YARN-386 RM should throw different exceptions for while querying app/node/queue -- Key: YARN-69 URL: https://issues.apache.org/jira/browse/YARN-69 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli We should distinguish the exceptions for absent app/node/queue, illegally accessed app/node/queue etc. Today everything is a {{YarnRemoteException}}. We should extend {{YarnRemoteException}} to add {{NotFoundException}}, {{AccessControlException}} etc. Today, {{AccessControlException}} exists but not as part of the protocol descriptions (i.e. only available to Java). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (YARN-411) Per-state RM app-pages should have search ala JHS pages
[ https://issues.apache.org/jira/browse/YARN-411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli moved MAPREDUCE-3778 to YARN-411: - Component/s: (was: webapps) (was: mrv2) Affects Version/s: (was: 0.23.0) Key: YARN-411 (was: MAPREDUCE-3778) Project: Hadoop YARN (was: Hadoop Map/Reduce) Per-state RM app-pages should have search ala JHS pages --- Key: YARN-411 URL: https://issues.apache.org/jira/browse/YARN-411 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-412) FifoScheduler incorrectly checking for node locality
Roger Hoover created YARN-412: - Summary: FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Priority: Minor In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) where as node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this but does not affect the actual scheduling decisions by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roger Hoover updated YARN-412: -- Attachment: YARN-412.patch Please review this patch for the fix plus a unit test case FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Priority: Minor Labels: patch Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) where as node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this but does not affect the actual scheduling decisions by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roger Hoover updated YARN-412: -- Attachment: YARN-412.patch Added a timeout on the unit test FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Priority: Minor Labels: patch Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583522#comment-13583522 ] Hadoop QA commented on YARN-412: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12570348/YARN-412.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 one of tests included doesn't have a timeout.{color} {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/416//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/416//console This message is automatically generated. FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Priority: Minor Labels: patch Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-413) With log aggregation on, nodemanager dies on startup if it can't connect to HDFS
Sandy Ryza created YARN-413: --- Summary: With log aggregation on, nodemanager dies on startup if it can't connect to HDFS Key: YARN-413 URL: https://issues.apache.org/jira/browse/YARN-413 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza If log aggregation is on, when the nodemanager starts up, it tries to create the remote log directory. If this fails, it kills itself. It doesn't seem like turning log aggregation on should ever cause the nodemanager to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-408) Capacity Scheduler delay scheduling should not be disabled by default
[ https://issues.apache.org/jira/browse/YARN-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583625#comment-13583625 ] Mayank Bansal commented on YARN-408: Yeah sure. There are two intents to change this value 1. Make it enable by default 2. The algorithm which we use to check for the applications to get the scheduling opportunity is based on heart beat from the NM , hence if we just use the number of racks it will not be much of the value add for actually to achieve the node locality. The intent of using the number of delay is to actually at least wait for one heart beat from each node in the cluster and then move task to next locality level. I defaulted it to number of nodes in one rack generally. Thanks, Mayank Capacity Scheduler delay scheduling should not be disabled by default - Key: YARN-408 URL: https://issues.apache.org/jira/browse/YARN-408 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 3.0.0, 2.0.3-alpha Reporter: Mayank Bansal Assignee: Mayank Bansal Priority: Minor Attachments: YARN-408-trunk.patch Capacity Scheduler delay scheduling should not be disabled by default. Enabling it to number of nodes in one rack. Thanks, Mayank -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-413) With log aggregation on, nodemanager dies on startup if it can't connect to HDFS
[ https://issues.apache.org/jira/browse/YARN-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583660#comment-13583660 ] Sandy Ryza commented on YARN-413: - 2013-02-21 13:27:24,307 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager org.apache.hadoop.yarn.YarnException: Failed to Start org.apache.hadoop.yarn.server.nodemanager.NodeManager at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:199) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:322) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:359) Caused by: org.apache.hadoop.yarn.YarnException: Failed to Start org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.start(ContainerManagerImpl.java:248) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) ... 3 more Caused by: org.apache.hadoop.yarn.YarnException: Failed to create remoteLogDir [/tmp/logs] at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:207) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.start(LogAggregationService.java:132) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) ... 5 more Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/logs. Name node is in safe mode. The reported blocks 7 has reached the threshold 0.9990 of total blocks 7. Safe mode will be turned off automatically in 25 seconds. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3067) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3045) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3024) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:667) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:468) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:40995) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:482) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1018) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1778) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1774) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1488) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1772) at org.apache.hadoop.ipc.Client.call(Client.java:1237) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy9.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:163) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:82) at $Proxy9.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:450) at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2115) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2086) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:540) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:204) ... 7 more 2013-02-21 13:27:24,308 INFO org.apache.hadoop.ipc.Server: Stopping server on 47223 2013-02-21 13:27:24,308 INFO
[jira] [Resolved] (YARN-413) With log aggregation on, nodemanager dies on startup if it can't connect to HDFS
[ https://issues.apache.org/jira/browse/YARN-413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-413. - Resolution: Duplicate With log aggregation on, nodemanager dies on startup if it can't connect to HDFS Key: YARN-413 URL: https://issues.apache.org/jira/browse/YARN-413 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza If log aggregation is on, when the nodemanager starts up, it tries to create the remote log directory. If this fails, it kills itself. It doesn't seem like turning log aggregation on should ever cause the nodemanager to die. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583713#comment-13583713 ] Bikas Saha commented on YARN-392: - From what I understand this seems to be tangentially going down the path of the discussion that happened in YARN-371. The crucial point is that the YARN resource scheduler is *not* a task scheduler. So introducing concepts that directly or indirectly make it do task scheduling would be inconsistent with the design. Its a coarse grained resource allocator that gives the app containers that represent chunks of resources using which the app can schedule its tasks. Different versions of the scheduler change the way the resource sharing is being done. Fair/Capacity or otherwise. Ideally we should have only 1 scheduler that has hooks to change the sharing policy. The code kind off reflects that because there is so much common code/logic between both implementations. Unfortunately, in both the Fair and Capacity Scheduler the implementations have mixed up 1) decision to allocate at and below a given topology level [say * level] with 2) whether there are resource requests at that level. E.g. when allocation cycle is started for an app, the logic starts at the * and checks if the resource requests count 0. If yes then it goes into racks and then nodes. Which means that if an application wants resources only at a node then it has to create requests at the rack and * level too. This is because locality relaxation has gotten mixed up with being schedulable, if you catch my drift. My strong belief is that if we can fix this overload then we wont need to fix this jira. However I can see that fixing the overload will be a very complicated knot to untie and perhaps impossible to do now because it may be inextricably linked with the API. Which is why I created this jira. Now, if the problem is the * overloaded that I describe above, then the problem is the entanglement of delay scheduling (for locality). Here is an alternative proposal that addresses this problem. Lets make the delay of the delay scheduling specifiable by the application. So an application can specify how long to wait before relaxing its node requests to rack and *. When an app wants containers on specific nodes it basically means that it does not want the RM to automatically relax its locality - thus specifying a large value for the delay. The end result being allocation on specific nodes if resources become available on those nodes. This also serves as a useful extension of delay scheduling. Short apps can be aggressive in relaxing locality while long+large jobs can be more conservative in trading of scheduling speed with network IO. The catch in the proposal is that such requests have to be made at a different priority level. Resource requests at the same priority level get aggregated and we dont want to aggregate relaxable resource requests with non-relaxable resource requests. I think this is a good thing to do anyways because it makes the application think and decide which kind of tasks it needs to get running first. An extension of this approach also ties in nicely with the API enhancement suggested by YARN-394. The RM could actually inform the app that it has not been able to allocate a resource request on a node and the time limit has elapsed. At which point, the app could cancel that request and ask for an alternative set of nodes. I agree I am hand-waving in this paragraph. Thoughts? Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-414) [Umbrella] Usability issues in YARN
Hitesh Shah created YARN-414: Summary: [Umbrella] Usability issues in YARN Key: YARN-414 URL: https://issues.apache.org/jira/browse/YARN-414 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Priority: Blocker Umbrella jira to track all forms of usability issues in YARN that need to be addressed before YARN can be considered stable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-404) Node Manager leaks Data Node connections
[ https://issues.apache.org/jira/browse/YARN-404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-404: - Priority: Blocker (was: Critical) Node Manager leaks Data Node connections Key: YARN-404 URL: https://issues.apache.org/jira/browse/YARN-404 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.2-alpha, 0.23.6 Reporter: Devaraj K Assignee: Devaraj K Priority: Blocker RM is missing to give some applications to NM for clean up, due to this log aggregation is not happening for those applications and also it is leaking data node connections in NM side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-386) [Umbrella] YARN API cleanup
[ https://issues.apache.org/jira/browse/YARN-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-386: - Priority: Blocker (was: Major) [Umbrella] YARN API cleanup --- Key: YARN-386 URL: https://issues.apache.org/jira/browse/YARN-386 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Priority: Blocker This is the umbrella ticket to capture any and every API cleanup that we wish to do before YARN can be deemed beta/stable. Doing this API cleanup now and ASAP will help us escape the pain of supporting bad APIs in beta/stable releases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-397) RM Scheduler api enhancements
[ https://issues.apache.org/jira/browse/YARN-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-397: - Priority: Blocker (was: Major) RM Scheduler api enhancements - Key: YARN-397 URL: https://issues.apache.org/jira/browse/YARN-397 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Priority: Blocker Umbrella jira tracking enhancements to RM apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-41: Priority: Blocker (was: Major) The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.0-alpha Reporter: Ravi Teja Ch N V Assignee: Devaraj K Priority: Blocker Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-142) Change YARN APIs to throw IOException
[ https://issues.apache.org/jira/browse/YARN-142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-142: - Priority: Blocker (was: Critical) Change YARN APIs to throw IOException - Key: YARN-142 URL: https://issues.apache.org/jira/browse/YARN-142 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 0.23.3, 2.0.0-alpha Reporter: Siddharth Seth Assignee: Xuan Gong Priority: Blocker Attachments: YARN-142.1.patch, YARN-142.2.patch, YARN-142.3.patch, YARN-142.4.patch Ref: MAPREDUCE-4067 All YARN APIs currently throw YarnRemoteException. 1) This cannot be extended in it's current form. 2) The RPC layer can throw IOExceptions. These end up showing up as UndeclaredThrowableExceptions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-69) RM should throw different exceptions for while querying app/node/queue
[ https://issues.apache.org/jira/browse/YARN-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-69: Priority: Blocker (was: Major) RM should throw different exceptions for while querying app/node/queue -- Key: YARN-69 URL: https://issues.apache.org/jira/browse/YARN-69 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Blocker We should distinguish the exceptions for absent app/node/queue, illegally accessed app/node/queue etc. Today everything is a {{YarnRemoteException}}. We should extend {{YarnRemoteException}} to add {{NotFoundException}}, {{AccessControlException}} etc. Today, {{AccessControlException}} exists but not as part of the protocol descriptions (i.e. only available to Java). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-85) Allow per job log aggregation configuration
[ https://issues.apache.org/jira/browse/YARN-85?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-85: Priority: Critical (was: Major) Allow per job log aggregation configuration --- Key: YARN-85 URL: https://issues.apache.org/jira/browse/YARN-85 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Currently, if log aggregation is enabled for a cluster - logs for all jobs will be aggregated - leading to a whole bunch of files on hdfs which users may not want. Users should be able to control this along with the aggregation policy - failed only, all, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-117: - Priority: Blocker (was: Major) Enhance YARN service model -- Key: YARN-117 URL: https://issues.apache.org/jira/browse/YARN-117 Project: Hadoop YARN Issue Type: Improvement Reporter: Steve Loughran Assignee: Steve Loughran Priority: Blocker Having played the YARN service model, there are some issues that I've identified based on past work and initial use. This JIRA issue is an overall one to cover the issues, with solutions pushed out to separate JIRAs. h2. state model prevents stopped state being entered if you could not successfully start the service. In the current lifecycle you cannot stop a service unless it was successfully started, but * {{init()}} may acquire resources that need to be explicitly released * if the {{start()}} operation fails partway through, the {{stop()}} operation may be needed to release resources. *Fix:* make {{stop()}} a valid state transition from all states and require the implementations to be able to stop safely without requiring all fields to be non null. Before anyone points out that the {{stop()}} operations assume that all fields are valid; and if called before a {{start()}} they will NPE; MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is independent of the rest of the issues in this doc but it will aid making {{stop()}} execute from all states other than stopped. MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review and take up; this can be done with issues linked to this one. h2. AbstractService doesn't prevent duplicate state change requests. The {{ensureState()}} checks to verify whether or not a state transition is allowed from the current state are performed in the base {{AbstractService}} class -yet subclasses tend to call this *after* their own {{init()}}, {{start()}} {{stop()}} operations. This means that these operations can be performed out of order, and even if the outcome of the call is an exception, all actions performed by the subclasses will have taken place. MAPREDUCE-3877 demonstrates this. This is a tricky one to address. In HADOOP-3128 I used a base class instead of an interface and made the {{init()}}, {{start()}} {{stop()}} methods {{final}}. These methods would do the checks, and then invoke protected inner methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to retrofit the same behaviour to everything that extends {{AbstractService}} -something that must be done before the class is considered stable (because once the lifecycle methods are declared final, all subclasses that are out of the source tree will need fixing by the respective developers. h2. AbstractService state change doesn't defend against race conditions. There's no concurrency locks on the state transitions. Whatever fix for wrong state calls is added should correct this to prevent re-entrancy, such as {{stop()}} being called from two threads. h2. Static methods to choreograph of lifecycle operations Helper methods to move things through lifecycles. init-start is common, stop-if-service!=null another. Some static methods can execute these, and even call {{stop()}} if {{init()}} raises an exception. These could go into a class {{ServiceOps}} in the same package. These can be used by those services that wrap other services, and help manage more robust shutdowns. h2. state transition failures are something that registered service listeners may wish to be informed of. When a state transition fails a {{RuntimeException}} can be thrown -and the service listeners are not informed as the notification point isn't reached. They may wish to know this, especially for management and diagnostics. *Fix:* extend {{ServiceStateChangeListener}} with a callback such as {{stateChangeFailed(Service service,Service.State targeted-state, RuntimeException e)}} that is invoked from the (final) state change methods in the {{AbstractService}} class (once they delegate to their inner {{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing implementations of the interface. h2. Service listener failures not handled Is this an error an error or not? Log and ignore may not be what is desired. *Proposed:* during {{stop()}} any exception by a listener is caught and discarded, to increase the likelihood of a better shutdown, but do not add try-catch clauses to the other state changes. h2. Support static listeners for all AbstractServices Add support to {{AbstractService}} that allow callers to register listeners for all instances. The existing listener interface could be used. This allows
[jira] [Updated] (YARN-99) Jobs fail during resource localization when directories in file cache reaches to unix directory limit
[ https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-99: Priority: Blocker (was: Major) Jobs fail during resource localization when directories in file cache reaches to unix directory limit - Key: YARN-99 URL: https://issues.apache.org/jira/browse/YARN-99 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.0.0-alpha Reporter: Devaraj K Assignee: Devaraj K Priority: Blocker If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception. {code:xml} java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-34) Split/Cleanup YARN and MAPREDUCE documentation
[ https://issues.apache.org/jira/browse/YARN-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-34: Priority: Blocker (was: Major) Split/Cleanup YARN and MAPREDUCE documentation -- Key: YARN-34 URL: https://issues.apache.org/jira/browse/YARN-34 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Blocker Post YARN-1, we need to have clear separation between YARN and mapreduce. We need to have separate sections on site and docs - we already have separate documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-387) Fix inconsistent protocol naming
[ https://issues.apache.org/jira/browse/YARN-387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-387: - Priority: Blocker (was: Major) Fix inconsistent protocol naming Key: YARN-387 URL: https://issues.apache.org/jira/browse/YARN-387 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Blocker Labels: incompatible We now have different and inconsistent naming schemes for various protocols. It was hard to explain to users, mainly in direct interactions at talks/presentations and user group meetings, with such naming. We should fix these before we go beta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-378) ApplicationMaster retry times should be set by Client
[ https://issues.apache.org/jira/browse/YARN-378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-378: - Priority: Major (was: Minor) ApplicationMaster retry times should be set by Client - Key: YARN-378 URL: https://issues.apache.org/jira/browse/YARN-378 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Environment: suse Reporter: xieguiming Labels: usability We should support that different client or user have different ApplicationMaster retry times. It also say that yarn.resourcemanager.am.max-retries should be set by client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-71) Ensure/confirm that the NodeManager cleanup their local filesystem when they restart
[ https://issues.apache.org/jira/browse/YARN-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-71: Issue Type: Bug (was: Test) Ensure/confirm that the NodeManager cleanup their local filesystem when they restart Key: YARN-71 URL: https://issues.apache.org/jira/browse/YARN-71 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Vinod Kumar Vavilapalli Assignee: Xuan Gong Attachments: YARN-71.1.patch, YARN-71.2.patch We have to make sure that NodeManagers cleanup their local files on restart. It may already be working like that in which case we should have tests validating this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-71) Ensure/confirm that the NodeManager cleanup their local filesystem when they restart
[ https://issues.apache.org/jira/browse/YARN-71?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-71: Priority: Critical (was: Major) Ensure/confirm that the NodeManager cleanup their local filesystem when they restart Key: YARN-71 URL: https://issues.apache.org/jira/browse/YARN-71 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Vinod Kumar Vavilapalli Assignee: Xuan Gong Priority: Critical Attachments: YARN-71.1.patch, YARN-71.2.patch We have to make sure that NodeManagers cleanup their local files on restart. It may already be working like that in which case we should have tests validating this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-226) Log aggregation should not assume an AppMaster will have containerId 1
[ https://issues.apache.org/jira/browse/YARN-226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-226: - Priority: Blocker (was: Major) Log aggregation should not assume an AppMaster will have containerId 1 -- Key: YARN-226 URL: https://issues.apache.org/jira/browse/YARN-226 Project: Hadoop YARN Issue Type: Bug Reporter: Siddharth Seth Priority: Blocker In case of reservcations, etc - AppMasters may not get container id 1. We likely need additional info in the CLC / tokens indicating whether a container is an AM or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583792#comment-13583792 ] Sandy Ryza commented on YARN-392: - The proposal of per-app delay-scheduling parameters is one I hadn't thought of, and I think a good one for many use cases. Do you mean that the delay threshold would be configurable per-app or per-priority? The cases that I don't think it supports are: * If the delay threshold is only configurable per app, an app needs some containers strictly on specific nodes, and for other containers only has loose preferences. * An application wants two containers, the first on only node1 or node2 and the second on only node3 or node4. What tells the scheduler not to assign both of the containers on node1 and node2? These containers could be requested at different priorities, but that would essentially be using priorities to do task-centric scheduling. Are these use cases non-goals for YARN? Correct me if I'm wrong, but my understanding was that the primary reason that the resource scheduler is not a task scheduler is for performance reasons. If we can allow it to be task-centric when necessary, but avoid the performance impact of making it task-centric all the time, it will support location-specific scheduling in the most flexible and intuitive way. I hope this isn't rehashing the debate from YARN-371. For anybody who will be the YARN meetup tomorrow, it would be great to chat about this for a couple minutes. Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-196) Nodemanager if started before starting Resource manager is getting shutdown.But if both RM and NM are started and then after if RM is going down,NM is retrying for the RM
[ https://issues.apache.org/jira/browse/YARN-196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13583823#comment-13583823 ] Hitesh Shah commented on YARN-196: -- Patch still has trailing whitespace issues. In YarnConfiguration.java: + /** Max time to wait for ResourceManger start + */ Please rephrase to max time to wait to establish a connection to the ResourceManager when the NodeManager starts + /** Time interval for each NM attempt to connect RM + */ Rephrase to Time interval between each NM attempt to connect to the ResourceManager You can use the same descriptions in yarn-default.xml. Not sure if After that period of time, NM will throw out exceptions is valid in yarn-default.xml. A better description could mention that the NM will shutdown if it cannot connect to the RM within the specified max time period. Description should also mention how to use -1 to retry forever. Earlier comment had a point of switching to using SECONDS instead of MS for users to understand more easily. + // this.hostName = InetAddress.getLocalHost().getCanonicalHostName(); - Please remove commented out code if not being used. Unit test does not really seem to be testing the flow of RESOURCEMANAGER_CONNECT_WAIT_MS being set to -1. waitForEver is being explicitly set to true/false based on the updater's ctor and not really based on the config value. If that flow cannot be tested, it might be better to remove the additional complexity from the test. Also, patch will need to be updated due to https://issues.apache.org/jira/browse/HADOOP-9112. Nodemanager if started before starting Resource manager is getting shutdown.But if both RM and NM are started and then after if RM is going down,NM is retrying for the RM. --- Key: YARN-196 URL: https://issues.apache.org/jira/browse/YARN-196 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.0.0-alpha Reporter: Ramgopal N Assignee: Xuan Gong Attachments: MAPREDUCE-3676.patch, YARN-196.1.patch, YARN-196.2.patch, YARN-196.3.patch, YARN-196.4.patch If NM is started before starting the RM ,NM is shutting down with the following error {code} ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149) at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242) Caused by: java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145) ... 3 more Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131) at $Proxy23.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59) ... 5 more Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857) at org.apache.hadoop.ipc.Client.call(Client.java:1141) at org.apache.hadoop.ipc.Client.call(Client.java:1100) at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128) ... 7 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
[jira] [Updated] (YARN-47) Security issues in YARN
[ https://issues.apache.org/jira/browse/YARN-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-47: Priority: Major (was: Blocker) Security issues in YARN Key: YARN-47 URL: https://issues.apache.org/jira/browse/YARN-47 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli JIRA tracking YARN related security issues. Moving over YARN only stuff from MAPREDUCE-3101. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-396) Rationalize AllocateResponse in RM scheduler API
[ https://issues.apache.org/jira/browse/YARN-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-396: - Priority: Major (was: Blocker) Rationalize AllocateResponse in RM scheduler API Key: YARN-396 URL: https://issues.apache.org/jira/browse/YARN-396 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Arun C Murthy AllocateResponse contains an AMResponse and cluster node count. AMResponse that more data. Unless there is a good reason for this object structure, there should be either AMResponse or AllocateResponse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-99) Jobs fail during resource localization when directories in file cache reaches to unix directory limit
[ https://issues.apache.org/jira/browse/YARN-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-99: Priority: Major (was: Blocker) Jobs fail during resource localization when directories in file cache reaches to unix directory limit - Key: YARN-99 URL: https://issues.apache.org/jira/browse/YARN-99 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.0.0, 2.0.0-alpha Reporter: Devaraj K Assignee: Devaraj K If we have multiple jobs which uses distributed cache with small size of files, the directory limit reaches before reaching the cache size and fails to create any directories in file cache. The jobs start failing with the below exception. {code:xml} java.io.IOException: mkdir of /tmp/nm-local-dir/usercache/root/filecache/1701886847734194975 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:909) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:143) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:706) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:703) at org.apache.hadoop.fs.FileContext$FSLinkResolver.resolve(FileContext.java:2325) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:703) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:147) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} We should have a mechanism to clean the cache files if it crosses specified number of directories like cache size. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-34) Split/Cleanup YARN and MAPREDUCE documentation
[ https://issues.apache.org/jira/browse/YARN-34?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-34: Priority: Major (was: Blocker) Split/Cleanup YARN and MAPREDUCE documentation -- Key: YARN-34 URL: https://issues.apache.org/jira/browse/YARN-34 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Post YARN-1, we need to have clear separation between YARN and mapreduce. We need to have separate sections on site and docs - we already have separate documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-41) The RM should handle the graceful shutdown of the NM.
[ https://issues.apache.org/jira/browse/YARN-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-41: Priority: Major (was: Blocker) The RM should handle the graceful shutdown of the NM. - Key: YARN-41 URL: https://issues.apache.org/jira/browse/YARN-41 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.0-alpha Reporter: Ravi Teja Ch N V Assignee: Devaraj K Attachments: MAPREDUCE-3494.1.patch, MAPREDUCE-3494.2.patch, MAPREDUCE-3494.patch Instead of waiting for the NM expiry, RM should remove and handle the NM, which is shutdown gracefully. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-414) [Umbrella] Usability issues in YARN
[ https://issues.apache.org/jira/browse/YARN-414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-414: - Priority: Major (was: Blocker) [Umbrella] Usability issues in YARN --- Key: YARN-414 URL: https://issues.apache.org/jira/browse/YARN-414 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Umbrella jira to track all forms of usability issues in YARN that need to be addressed before YARN can be considered stable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-69) RM should throw different exceptions for while querying app/node/queue
[ https://issues.apache.org/jira/browse/YARN-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-69: Priority: Major (was: Blocker) RM should throw different exceptions for while querying app/node/queue -- Key: YARN-69 URL: https://issues.apache.org/jira/browse/YARN-69 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli We should distinguish the exceptions for absent app/node/queue, illegally accessed app/node/queue etc. Today everything is a {{YarnRemoteException}}. We should extend {{YarnRemoteException}} to add {{NotFoundException}}, {{AccessControlException}} etc. Today, {{AccessControlException}} exists but not as part of the protocol descriptions (i.e. only available to Java). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-397) RM Scheduler api enhancements
[ https://issues.apache.org/jira/browse/YARN-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-397: - Priority: Major (was: Blocker) RM Scheduler api enhancements - Key: YARN-397 URL: https://issues.apache.org/jira/browse/YARN-397 Project: Hadoop YARN Issue Type: Bug Reporter: Arun C Murthy Umbrella jira tracking enhancements to RM apis. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira