[jira] [Created] (YARN-1005) Log aggregators should check for FSDataOutputStream close before renaming to aggregated file.
Rohith Sharma K S created YARN-1005: --- Summary: Log aggregators should check for FSDataOutputStream close before renaming to aggregated file. Key: YARN-1005 URL: https://issues.apache.org/jira/browse/YARN-1005 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Rohith Sharma K S If AggregatedLogFormat.LogWriter.closeWriter() is interuppted, then remoteNodeTmpLogFileForApp is renamed to remoteNodeLogFileForApp file. This renamed file does not contain valid aggregated logs. There can be situation renamed file can be not in BCFile format. This cause issue while viewing from JobHistoryServer web page. {noformat} 2013-07-27 18:51:14,787 ERROR org.apache.hadoop.yarn.webapp.View: Error getting logs for job_1374918614757_0002 java.io.IOException: Not a valid BCFile. at org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927) at org.apache.hadoop.io.file.tfile.BCFile$Reader.init(BCFile.java:628) at org.apache.hadoop.io.file.tfile.TFile$Reader.init(TFile.java:804) at org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.init(AggregatedLogFormat.java:337) at org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:89) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:64) at org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:74) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
Rohith Sharma K S created YARN-1061: --- Summary: NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737990#comment-13737990 ] Rohith Sharma K S commented on YARN-1061: - Extracted thread dump from NodeManager is {noformat} Node Status Updater prio=10 tid=0x414dc000 nid=0x1d754 in Object.wait() [0x7fefa2dec000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1231) - locked 0xdef4f158 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy28.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy30.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348) {noformat} NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739178#comment-13739178 ] Rohith Sharma K S commented on YARN-1061: - Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure scenario for resourcemanager is hang up state in real cluster. The same scenario can be simulated manually bringing resourcemanager to hang up state with help of linux command KILL -STOP RM_PID. All the NM-RM call wait indefinitely. Another case where we can observer indefinite wait is Add new NodeManager when ResouceMangaer is hang up state. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748469#comment-13748469 ] Rohith Sharma K S commented on YARN-1061: - I added all the ipc configurations to log4j.properities file, stil same issue recured. bq. How can NM wait infinitely? I mean what is your connection timeout set to? When I debug the issue , found that it is an issue with IPC layer. This problem ocure in DataNode to NameNode communication also. When process is in T state(for running process, state is S1. This can be seen by ps -p pid -o pid,stat ) i.e process is stopped using kill -stop pid , ipc proxy does not throw any timeout exception. This is becaue , during proxy creation RPC timetime out is set to Zero(hardcoded) at RPC.waitForProtocolProxy method. Settiing rpc timeout to Zero makes ipc call does not throw any exception.Always ipc call(client) retry for sendPing to server(RM). This can be seen in Client.handleTimeout method {noformat} private void handleTimeout(SocketTimeoutException e) throws IOException { if (shouldCloseConnection.get() || !running.get() || rpcTimeout 0) { throw e; } else { sendPing(); } } {noformat} I think RPC timeout should be taken from configurations instead of hardcoding to 0. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (YARN-1112) MR AppMaster command options does not replace @taskid@ with the current task ID.
[ https://issues.apache.org/jira/browse/YARN-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S moved MAPREDUCE-5460 to YARN-1112: Component/s: (was: applicationmaster) (was: mrv2) Assignee: (was: Rohith Sharma K S) Target Version/s: (was: 3.0.0, 2.1.1-beta) Affects Version/s: (was: 2.1.1-beta) (was: 3.0.0) 2.1.1-beta 3.0.0 Key: YARN-1112 (was: MAPREDUCE-5460) Project: Hadoop YARN (was: Hadoop Map/Reduce) MR AppMaster command options does not replace @taskid@ with the current task ID. Key: YARN-1112 URL: https://issues.apache.org/jira/browse/YARN-1112 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chris Nauroth The description of {{yarn.app.mapreduce.am.command-opts}} in mapred-default.xml states that occurrences of {{@taskid@}} will be replaced by the current task ID. This substitution is not happening. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1112) MR AppMaster command options does not replace @taskid@ with the current task ID.
[ https://issues.apache.org/jira/browse/YARN-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1112: Attachment: YARN-1112.patch Attaching patch for replacement of @appid@ in am.command_opts. @appid@ is replaced with app attempt id. MR AppMaster command options does not replace @taskid@ with the current task ID. Key: YARN-1112 URL: https://issues.apache.org/jira/browse/YARN-1112 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 2.1.1-beta Reporter: Chris Nauroth Attachments: YARN-1112.patch The description of {{yarn.app.mapreduce.am.command-opts}} in mapred-default.xml states that occurrences of {{@taskid@}} will be replaced by the current task ID. This substitution is not happening. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1145) Potential file handle leak in aggregated logs web ui
[ https://issues.apache.org/jira/browse/YARN-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1145: Attachment: YARN-1145.patch Thank you Vinod Kumar Vavilapalli and Jason Lowe for reviewing patch :-) I have addressed Vinode comments and attached updated patch. Please review updated patch. Potential file handle leak in aggregated logs web ui Key: YARN-1145 URL: https://issues.apache.org/jira/browse/YARN-1145 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha, 0.23.9, 2.1.1-beta Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: MAPREDUCE-5486.patch, YARN-1145.patch Any problem in getting aggregated logs for rendering on web ui, then LogReader is not closed. Now, it reader is not closed which causing many connections in close_wait state. hadoopuser@hadoopuser: jps *27909* JobHistoryServer DataNode port is 50010. When greped with DataNode port, many connections are in CLOSE_WAIT from JHS. hadoopuser@hadoopuser: netstat -tanlp |grep 50010 tcp0 0 10.18.40.48:50010 0.0.0.0:* LISTEN 21453/java tcp1 0 10.18.40.48:20596 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19667 10.18.40.152:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:20593 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:12290 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19662 10.18.40.152:50010 CLOSE_WAIT *27909*/java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1145) Potential file handle leak in aggregated logs web ui
[ https://issues.apache.org/jira/browse/YARN-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1145: Attachment: YARN-1145.1.patch Handled clean up during reader creation.Previous patch misses this clean up. Potential file handle leak in aggregated logs web ui Key: YARN-1145 URL: https://issues.apache.org/jira/browse/YARN-1145 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha, 0.23.9, 2.1.1-beta Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: MAPREDUCE-5486.patch, YARN-1145.1.patch, YARN-1145.patch Any problem in getting aggregated logs for rendering on web ui, then LogReader is not closed. Now, it reader is not closed which causing many connections in close_wait state. hadoopuser@hadoopuser: jps *27909* JobHistoryServer DataNode port is 50010. When greped with DataNode port, many connections are in CLOSE_WAIT from JHS. hadoopuser@hadoopuser: netstat -tanlp |grep 50010 tcp0 0 10.18.40.48:50010 0.0.0.0:* LISTEN 21453/java tcp1 0 10.18.40.48:20596 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19667 10.18.40.152:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:20593 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:12290 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19662 10.18.40.152:50010 CLOSE_WAIT *27909*/java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1145) Potential file handle leak in aggregated logs web ui
[ https://issues.apache.org/jira/browse/YARN-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1145: Attachment: YARN-1145.2.patch Please ignore YARN-1145.1.patch. All the comments has been fixed in YARN-1145.2.patch. Please consider this patch for review. Potential file handle leak in aggregated logs web ui Key: YARN-1145 URL: https://issues.apache.org/jira/browse/YARN-1145 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha, 0.23.9, 2.1.1-beta Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: MAPREDUCE-5486.patch, YARN-1145.1.patch, YARN-1145.2.patch, YARN-1145.patch Any problem in getting aggregated logs for rendering on web ui, then LogReader is not closed. Now, it reader is not closed which causing many connections in close_wait state. hadoopuser@hadoopuser: jps *27909* JobHistoryServer DataNode port is 50010. When greped with DataNode port, many connections are in CLOSE_WAIT from JHS. hadoopuser@hadoopuser: netstat -tanlp |grep 50010 tcp0 0 10.18.40.48:50010 0.0.0.0:* LISTEN 21453/java tcp1 0 10.18.40.48:20596 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19667 10.18.40.152:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:20593 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:12290 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19662 10.18.40.152:50010 CLOSE_WAIT *27909*/java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-1145) Potential file handle leak in aggregated logs web ui
[ https://issues.apache.org/jira/browse/YARN-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1145: Attachment: YARN-1145.3.patch Modified the patch for closing the streams only on return from render method. Potential file handle leak in aggregated logs web ui Key: YARN-1145 URL: https://issues.apache.org/jira/browse/YARN-1145 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha, 0.23.9, 2.1.1-beta Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: MAPREDUCE-5486.patch, YARN-1145.1.patch, YARN-1145.2.patch, YARN-1145.3.patch, YARN-1145.patch Any problem in getting aggregated logs for rendering on web ui, then LogReader is not closed. Now, it reader is not closed which causing many connections in close_wait state. hadoopuser@hadoopuser: jps *27909* JobHistoryServer DataNode port is 50010. When greped with DataNode port, many connections are in CLOSE_WAIT from JHS. hadoopuser@hadoopuser: netstat -tanlp |grep 50010 tcp0 0 10.18.40.48:50010 0.0.0.0:* LISTEN 21453/java tcp1 0 10.18.40.48:20596 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19667 10.18.40.152:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:20593 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:12290 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19662 10.18.40.152:50010 CLOSE_WAIT *27909*/java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13809382#comment-13809382 ] Rohith Sharma K S commented on YARN-1366: - Hi Bikas, I have gone through your pdf file attached (YARN-556) and got understand about over all idea behind this subtask. I have some doubts , please clariffy 1. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM I understood like, need to reset lastResponseID to 0 and should not clear ask , release , blacklistAdditions and blacklistRemovals. Is am I correct? 2. During RM restart , RM get new AMRMTokenSecretManager. At this time, there will be difference password. Is this handled from RM side during recovery for individual application? Otherwise impact is , heatbeat to restarted RM get fail with an authentication error passoword does not match ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1145) Potential file handle leak in aggregated logs web ui
[ https://issues.apache.org/jira/browse/YARN-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1145: Attachment: YARN-1145.4.patch Apologies for delayed response. Thank you Vinod for reviewing patch :-) Attaching patch addressing all Vinod comments. For 5th comment, added try{}finally{} for whole render method in AggregatedLogsBlock.java. Eventhough patch looks with lot of difference(since try catch added for whole render method), modified code is {noformat} protected void render(Block html) { +AggregatedLogFormat.LogReader reader = null; +try{ // render block : NO CHANGE Path remoteRootLogDir = new Path(conf.get( YarnConfiguration.NM_REMOTE_APP_LOG_DIR, YarnConfiguration.DEFAULT_NM_REMOTE_APP_LOG_DIR)); -AggregatedLogFormat.LogReader reader = null; // render block : NO CHANGE +} finally{ + if (reader != null) { +reader.close(); + } + } } {noformat} Potential file handle leak in aggregated logs web ui Key: YARN-1145 URL: https://issues.apache.org/jira/browse/YARN-1145 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.5-alpha, 0.23.9, 2.1.1-beta Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: MAPREDUCE-5486.patch, YARN-1145.1.patch, YARN-1145.2.patch, YARN-1145.3.patch, YARN-1145.4.patch, YARN-1145.patch Any problem in getting aggregated logs for rendering on web ui, then LogReader is not closed. Now, it reader is not closed which causing many connections in close_wait state. hadoopuser@hadoopuser: jps *27909* JobHistoryServer DataNode port is 50010. When greped with DataNode port, many connections are in CLOSE_WAIT from JHS. hadoopuser@hadoopuser: netstat -tanlp |grep 50010 tcp0 0 10.18.40.48:50010 0.0.0.0:* LISTEN 21453/java tcp1 0 10.18.40.48:20596 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19667 10.18.40.152:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:20593 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:12290 10.18.40.48:50010 CLOSE_WAIT *27909*/java tcp1 0 10.18.40.48:19662 10.18.40.152:50010 CLOSE_WAIT *27909*/java -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
[ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-1366: Attachment: YARN-1366.patch Correct me If am wrong, I have prepared initial patch and attached the same. RM should differentiate Resync and Shutdown command. Please review whether this will fullfill expectations mentioned in JIra. ApplicationMasterService should Resync with the AM upon allocate call after restart --- Key: YARN-1366 URL: https://issues.apache.org/jira/browse/YARN-1366 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Attachments: YARN-1366.patch The ApplicationMasterService currently sends a resync response to which the AM responds by shutting down. The AM behavior is expected to change to calling resyncing with the RM. Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire outstanding request to the RM. Note that if the AM is making its first allocate call to the RM then things should proceed like normal without needing a resync. The RM will return all containers that have completed since the RM last synced with the AM. Some container completions may be reported more than once. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1398) Deadlock in capacity scheduler leaf queue and parent queue for getQueueInfo and completedConatiner call
[ https://issues.apache.org/jira/browse/YARN-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826421#comment-13826421 ] Rohith Sharma K S commented on YARN-1398: - Hi Sunil, I think this is same as https://issues.apache.org/jira/i#browse/YARN-325. Deadlock in capacity scheduler leaf queue and parent queue for getQueueInfo and completedConatiner call --- Key: YARN-1398 URL: https://issues.apache.org/jira/browse/YARN-1398 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.2.0 Reporter: Sunil G Priority: Critical getQueueInfo in parentQueue will call child.getQueueInfo(). This will try acquire the leaf queue lock over parent queue lock. Now at same time if a completedContainer call comes and acquired LeafQueue lock and it will wait for ParentQueue's completedConatiner call. This lock usage is not in synchronous and can lead to deadlock. With JCarder, this is showing as a potential deadlock scenario. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1469) ApplicationMaster crash cause the TaskAttemptImpl couldn't handle the TA_TOO_MANY_FETCH_FAILURE at KILLED
[ https://issues.apache.org/jira/browse/YARN-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13837533#comment-13837533 ] Rohith Sharma K S commented on YARN-1469: - This is duplicate of https://issues.apache.org/jira/i#browse/MAPREDUCE-5409. ApplicationMaster crash cause the TaskAttemptImpl couldn't handle the TA_TOO_MANY_FETCH_FAILURE at KILLED -- Key: YARN-1469 URL: https://issues.apache.org/jira/browse/YARN-1469 Project: Hadoop YARN Issue Type: Bug Reporter: qus-jiawei Attachments: job_1384857622207_15-amlog.txt This bug could happen when using demission command to demission an nodemanager.The detail is bellow: 1.one job running happily on the yarn cluster and some MapTask finish on machine A then begin to schedule the reduce task.Now,the MapTask's state is successed. 2.The hadoop admin demission machine A 's NodeManager. 3.The ApplicationMaster find the some MapTask hived finish on a demissioned nodemanager, change this MapTask 's state to KILLED. 4.Some running ReduceTask couldn't get the data from MapTask throw an event TA_TOO_MANY_FETCH_FAILURE to TaskAttemptImpl. 5.TaskAttemptImpl couldn't handle TA_TOO_MANY_FETCH_FAILURE at KILLED state then throw an exception,cause the ApplicationMaster turn to ERROR. I think TaskAttemptImpl could just ignore the TA_TOO_MANY_FETCH_FAILURE event at KILLED state -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-3790) TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595372#comment-14595372 ] Rohith Sharma K S commented on YARN-3790: - [~jianhe] Do you have any comments on the patch? TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3001) RM dies because of divide by zero
[ https://issues.apache.org/jira/browse/YARN-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595325#comment-14595325 ] Rohith Sharma K S commented on YARN-3001: - Hi [~huizane], thanks for reply. Would you please attach the RM logs if you have? RM dies because of divide by zero - Key: YARN-3001 URL: https://issues.apache.org/jira/browse/YARN-3001 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.5.1 Reporter: hoelog Assignee: Rohith Sharma K S RM dies because of divide by zero exception. {code} 2014-12-31 21:27:05,022 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.ArithmeticException: / by zero at org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator.computeAvailableContainers(DefaultResourceCalculator.java:37) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1332) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1218) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1177) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:877) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:656) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:570) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:851) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:900) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:745) 2014-12-31 21:27:05,023 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603382#comment-14603382 ] Rohith Sharma K S commented on YARN-3849: - I mean for TestProportionalPreemptinPolicy. Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603308#comment-14603308 ] Rohith Sharma K S commented on YARN-3849: - The below is the log trace for the issue. In our cluster, there are 3 NodeManager and each with resource {{memory:327680, vCores:35}}. Total cluster resource is {{clusterResource: memory:983040, vCores:105}} with CapacityScheduler configured queue's with name *default* and *QueueA*. # Application app-1 is submitted to queue default and containers are started running the applications with 10 containers,each with {{resource: memory:1024, vCores:10}}. so total used is {{usedResources=memory:10240, vCores:91}} {noformat} default user=spark used=memory:10240, vCores:91 numContainers=10 headroom = memory:1024, vCores:10 user-resources=memory:10240, vCores:91 Re-sorting assigned queue: root.default stats: default: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:10240, vCores:91, usedCapacity=1.733, absoluteUsedCapacity=0.867, numApps=1, numContainers=10 {noformat} *NOTE : Resource allocation is by CPU DOMINANT* After 10 container running, available NodeManagers memory is {noformat} linux-174, available: memory:323584, vCores:4 linux-175, available: memory:324608, vCores:5 linux-223, available: memory:324608, vCores:5 {noformat} # Application app-2 is submitted to QueueA. ApplicationMaster container started running and NodeManager memory is {{available: memory:322560, vCores:3}} {noformat} Assigned container container_1435072598099_0002_01_01 of capacity memory:1024, vCores:1 on host linux-174:26009, which has 5 containers, memory:5120, vCores:32 used and memory:322560, vCores:3 available after allocation | SchedulerNode.java:154 linux-174, available: memory:322560, vCores:3 {noformat} # the preemption policy does the below calculation {noformat} 2015-06-23 23:20:51,127 NAME: QueueA CUR: memory:0, vCores:0 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:0, vCores:0 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:0, vCores:0 2015-06-23 23:20:51,128 NAME: default CUR: memory:851968, vCores:91 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: 1.0 IDEAL_ASSIGNED: memory:851968, vCores:91 IDEAL_PREEMPT: memory:0, vCores:0 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:360448, vCores:39 {noformat} In the above log , observe for the queue default *CUR is memory:851968, vCores:91*, but actually *usedResources=memory:10240, vCores:91*. Here, only CPU is matching but not MEMORY. The CUR calculation is done below formula #* CUR= {{clusterResource: memory:983040, vCores:105}} * {{absoluteUsedCapacity(0.8)}} = {{memory:851968, vCores:91}} #* GAR= {{clusterResource: memory:983040, vCores:105}} * {{absoluteCapacity(0.5)}} = {{ memory:491520, vCores:52}} #* PREEMPTABLE= GAR - CUR = {{memory:360448, vCores:39}} # App-2 request for the containers with {{resource: memory:1024, vCores:10}}. So, the preemption cycle finds that how much memory toBePreempt {noformat} 2015-06-23 23:21:03,131 | DEBUG | SchedulingMonitor (ProportionalCapacityPreemptionPolicy) | 1435072863131: NAME: default CUR: memory:851968, vCores:91 PEN: memory:0, vCores:0 GAR: memory:491520, vCores:52 NORM: NaN IDEAL_ASSIGNED: memory:491520, vCores:52 IDEAL_PREEMPT: memory:97043, vCores:10 ACTUAL_PREEMPT: memory:0, vCores:0 UNTOUCHABLE: memory:0, vCores:0 PREEMPTABLE: memory:360448, vCores:39 {noformat} Observe that *IDEAL_PREEMPT: memory:97043, vCores:10*, but app-2 in queue QueueA required only 10 CPU resource to be preempt, but memory to be preempt is 97043 but memory sufficiently available. Below is the calculations which does IDEAL_PREMPT, #* totalPreemptionAllowed = clusterResource: memory:983040, vCores:105 * 0.1 = memory:98304, vCores:10.5 #* totPreemptionNeeded = CUR - IDEAL_ASSIGNED = CUR: memory:851968, vCores:91 #* scalingFactor = Resources.divide(drc, memory:491520, vCores:52, memory:98304, vCores:10.5, memory:851968, vCores:91); scalingFactor = 0.114285715 #* toBePreempted = CUR: memory:851968, vCores:91 * scalingFactor(0.1139045128455529) = memory:97368, vCores:10 {{resource-to-obtain = memory:97043, vCores:10}} *So the problem is in either of the below steps* # As [~sunilg] said, usedResources=memory:10240, vCores:91 but preemption policy calculate wrongly that current used capacity as {{memory:851968, vCores:91}}. This is mainly becaue preemption policy is using absoluteCapacity for calculating for Current usage which always gives wrong result for one of the resources in DominantResourceAllocator used. I think, fraction should not be used which caused problem in DRC(Multi dimentional resources) instead we should be usedResource from CSQueue. # Even bypassing
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603375#comment-14603375 ] Rohith Sharma K S commented on YARN-3849: - For the test,how it would be using parameterized test class which uses defaultRC and dominatRC ? Too much of preemption activity causing continuos killing of containers across queues - Key: YARN-3849 URL: https://issues.apache.org/jira/browse/YARN-3849 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used. 1. An app is submitted in QueueA which is consuming full cluster capacity 2. After submitting an app in QueueB, there are some demand and invoking preemption in QueueA 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now. Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3790) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container
[ https://issues.apache.org/jira/browse/YARN-3790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3790: Summary: usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container (was: TestWorkPreservingRMRestart#testSchedulerRecovery fails intermittently in trunk for FS scheduler) usedResource from rootQueue metrics may get stale data for FS scheduler after recovering the container -- Key: YARN-3790 URL: https://issues.apache.org/jira/browse/YARN-3790 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, test Reporter: Rohith Sharma K S Assignee: zhihai xu Attachments: YARN-3790.000.patch Failure trace is as follows {noformat} Tests run: 28, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 284.078 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart testSchedulerRecovery[1](org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart) Time elapsed: 6.502 sec FAILURE! java.lang.AssertionError: expected:6144 but was:8192 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.assertMetrics(TestWorkPreservingRMRestart.java:853) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.checkFSQueue(TestWorkPreservingRMRestart.java:342) at org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart.testSchedulerRecovery(TestWorkPreservingRMRestart.java:241) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662754#comment-14662754 ] Rohith Sharma K S commented on YARN-3250: - Thanks [~eepayne] and [~leftnoteasy] for suggestion. I have taken care of this pattern for ApplicationCLI change. And since current JIRA is only for admin proto changes and RMAdminCLI, ApplicationCLI changes are done at YARN-4014. I have updated version-1 patch for both i.e current jira and yarn-4014, kindly review both the patches. Support admin cli interface in for Application Priority --- Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Attachments: 0001-YARN-3250-V1.patch Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4034) Render cluster Max Priority in scheduler metrics in RM web UI
Rohith Sharma K S created YARN-4034: --- Summary: Render cluster Max Priority in scheduler metrics in RM web UI Key: YARN-4034 URL: https://issues.apache.org/jira/browse/YARN-4034 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Currently Scheduler Metric renders the common scheduler metrics in RM web UI. It would be helpful for the user to know what is the configured cluster max priority from web UI. So, in RM web UI front page, Scheduler Metrics can render configured max cluster priority. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662757#comment-14662757 ] Rohith Sharma K S commented on YARN-4014: - Thinking that should users i.e ApplicationClientProtocol should {{getClusterMaxPriority}} API exposed even though RM take care of resetting to clusterMax priority?? any thoughts? Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4034) Render cluster Max Priority in scheduler metrics in RM web UI
[ https://issues.apache.org/jira/browse/YARN-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4034: Priority: Minor (was: Major) Render cluster Max Priority in scheduler metrics in RM web UI - Key: YARN-4034 URL: https://issues.apache.org/jira/browse/YARN-4034 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Priority: Minor Currently Scheduler Metric renders the common scheduler metrics in RM web UI. It would be helpful for the user to know what is the configured cluster max priority from web UI. So, in RM web UI front page, Scheduler Metrics can render configured max cluster priority. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4035) Some tests in TestRMAdminService fails with NPE
Rohith Sharma K S created YARN-4035: --- Summary: Some tests in TestRMAdminService fails with NPE Key: YARN-4035 URL: https://issues.apache.org/jira/browse/YARN-4035 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S It is observed that after YARN-4019 some tests are failing in TestRMAdminService with null pointer exceptions in build [build failure |https://builds.apache.org/job/PreCommit-YARN-Build/8792/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt] {noformat} unning org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 11.541 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService testModifyLabelsOnNodesWithDistributedConfigurationDisabled(org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService) Time elapsed: 0.132 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.util.JvmPauseMonitor.stop(JvmPauseMonitor.java:86) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:601) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:983) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1038) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250) at org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService.testModifyLabelsOnNodesWithDistributedConfigurationDisabled(TestRMAdminService.java:824) testRemoveClusterNodeLabelsWithDistributedConfigurationEnabled(org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService) Time elapsed: 0.121 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.util.JvmPauseMonitor.stop(JvmPauseMonitor.java:86) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:601) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:983) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1038) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250) at org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService.testRemoveClusterNodeLabelsWithDistributedConfigurationEnabled(TestRMAdminService.java:867) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4035) Some tests in TestRMAdminService fails with NPE
[ https://issues.apache.org/jira/browse/YARN-4035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4035: Affects Version/s: 2.8.0 Some tests in TestRMAdminService fails with NPE Key: YARN-4035 URL: https://issues.apache.org/jira/browse/YARN-4035 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.8.0 Reporter: Rohith Sharma K S It is observed that after YARN-4019 some tests are failing in TestRMAdminService with null pointer exceptions in build [build failure |https://builds.apache.org/job/PreCommit-YARN-Build/8792/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt] {noformat} unning org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService Tests run: 19, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 11.541 sec FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService testModifyLabelsOnNodesWithDistributedConfigurationDisabled(org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService) Time elapsed: 0.132 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.util.JvmPauseMonitor.stop(JvmPauseMonitor.java:86) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:601) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:983) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1038) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250) at org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService.testModifyLabelsOnNodesWithDistributedConfigurationDisabled(TestRMAdminService.java:824) testRemoveClusterNodeLabelsWithDistributedConfigurationEnabled(org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService) Time elapsed: 0.121 sec ERROR! java.lang.NullPointerException: null at org.apache.hadoop.util.JvmPauseMonitor.stop(JvmPauseMonitor.java:86) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:601) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.stopActiveServices(ResourceManager.java:983) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1038) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStop(ResourceManager.java:1085) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.AbstractService.close(AbstractService.java:250) at org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService.testRemoveClusterNodeLabelsWithDistributedConfigurationEnabled(TestRMAdminService.java:867) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695038#comment-14695038 ] Rohith Sharma K S commented on YARN-4014: - Tried the syntax app-id,but options does not take app-id as valid input. May this is the reason other commands has camel case. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0001-YARN-4014.patch Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695085#comment-14695085 ] Rohith Sharma K S commented on YARN-4014: - Updated the working patch with test cases, kindly review it. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695084#comment-14695084 ] Rohith Sharma K S commented on YARN-4014: - thanks Sunil G for the review.. bq. In ApplicationCLI, public static final String SET_PRIORITY = setPriority; Done, changed to updatePriority bq. In future --appId can be used with other parameters also, correct? Yes, Done bq. updateApplicationPriority can throw NumberFormatException Since exception is directly thrown back to client cli, I think this should be fine. bq. ClientRMService.java has few commented code. Yes , Since YARN-3887 was not committed, I was used that patch to compile but while uploading patch I commented for HadoopQA compilation. Now I have uncommented those lines. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3924) Submitting an application to standby ResourceManager should respond better than Connection Refused
[ https://issues.apache.org/jira/browse/YARN-3924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694745#comment-14694745 ] Rohith Sharma K S commented on YARN-3924: - bq. A more informative error message might be enough here? Yes, user wants to differentiate RM state like *StandbyRM* VS *Not Started RM/attempt to connect invalid RM ha-ids*. So error message would help more. Submitting an application to standby ResourceManager should respond better than Connection Refused -- Key: YARN-3924 URL: https://issues.apache.org/jira/browse/YARN-3924 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Dustin Cote Assignee: Ajith S Priority: Minor When submitting an application directly to a standby resource manager, the resource manager responds with 'Connection Refused' rather than indicating that it is a standby resource manager. Because the resource manager is aware of its own state, I feel like we can have the 8032 port open for standby resource managers and reject the request with something like 'Cannot process application submission from this standby resource manager'. This would be especially helpful for debugging oozie problems when users put in the wrong address for the 'jobtracker' (i.e. they don't put the logical RM address but rather point to a specific resource manager). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3924) Submitting an application to standby ResourceManager should respond better than Connection Refused
[ https://issues.apache.org/jira/browse/YARN-3924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694770#comment-14694770 ] Rohith Sharma K S commented on YARN-3924: - bq. None of the RMs specified by ha-ids appear to be active. This error message would be more appropriate to me. Submitting an application to standby ResourceManager should respond better than Connection Refused -- Key: YARN-3924 URL: https://issues.apache.org/jira/browse/YARN-3924 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Dustin Cote Assignee: Ajith S Priority: Minor When submitting an application directly to a standby resource manager, the resource manager responds with 'Connection Refused' rather than indicating that it is a standby resource manager. Because the resource manager is aware of its own state, I feel like we can have the 8032 port open for standby resource managers and reject the request with something like 'Cannot process application submission from this standby resource manager'. This would be especially helpful for debugging oozie problems when users put in the wrong address for the 'jobtracker' (i.e. they don't put the logical RM address but rather point to a specific resource manager). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3689) FifoComparator logic is wrong. In method compare in FifoPolicy.java file, the s1 and s2 should change position when compare priority
[ https://issues.apache.org/jira/browse/YARN-3689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694687#comment-14694687 ] Rohith Sharma K S commented on YARN-3689: - As per the application priority design, *Higher Interger* indicates *Higher priority*, so comparator implementation seems to be fine for me. And the test by [~ajithshetty] also proving higher prioirty value i.e 2 is first in the list. FifoComparator logic is wrong. In method compare in FifoPolicy.java file, the s1 and s2 should change position when compare priority - Key: YARN-3689 URL: https://issues.apache.org/jira/browse/YARN-3689 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler, scheduler Affects Versions: 2.5.0 Reporter: zhoulinlin Assignee: Ajith S In method compare in FifoPolicy.java file, the s1 and s2 should change position when compare priority. I did a test. Configured the schedulerpolicy fifo, submitted 2 jobs to the same queue. The result is below: 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: before sort -- 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432094103221_0001 appPririty:4 appStartTime:1432094170038 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432094103221_0002 appPririty:2 appStartTime:1432094173131 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: after sort % 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432094103221_0001 appPririty:4 appStartTime:1432094170038 2015-05-20 11:57:41,449 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432094103221_0002 appPririty:2 appStartTime:1432094173131 But when change the s1 and s2 position like below: public int compare(Schedulable s1, Schedulable s2) { int res = s2.getPriority().compareTo(s1.getPriority()); .} The result: 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: before sort -- 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432090734333_0009 appPririty:4 appStartTime:1432092992503 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432090734333_0010 appPririty:2 appStartTime:1432092996437 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: after sort % 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432090734333_0010 appPririty:2 appStartTime:1432092996437 2015-05-20 11:36:37,119 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue: appName:application_1432090734333_0009 appPririty:4 appStartTime:1432092992503 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700755#comment-14700755 ] Rohith Sharma K S commented on YARN-4014: - Updated the patch fixing race condition in updating the priority vs SchedulerApplicationAttemp creation which would take up old priority rather than updated priority. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0002-YARN-4017.patch Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4017.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699591#comment-14699591 ] Rohith Sharma K S commented on YARN-4014: - bq. we can make updateApplicationPriority throw an ApplicationNotRunningException and let client catch the exception and prints “Application not running “ msg In {{ClientRMService#updateApplicationPriority}}, update priority to scheduler will not be called if application is in NEW , NEW_SAVING also. So I feel having new exception ApplicationNotRunningException would lead to confusion. I think we can throw YarnException with message Application in app-state state cannot be update priority. Any thoughts? Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0004-YARN-4014.patch Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700753#comment-14700753 ] Rohith Sharma K S commented on YARN-4014: - bq. That means the updated priority is lost Discussed offline with Jian He, updated priority wont be lost if application is in ACCEPTED state. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699930#comment-14699930 ] Rohith Sharma K S commented on YARN-4014: - I did the above check leaving SUBMITTED, ACCEPTED, RUNNING state because thinking that application priority should be able to update in these states. Should we update only for RUNNING? I feel these states should be allowed to change priority. What do you think? Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702421#comment-14702421 ] Rohith Sharma K S commented on YARN-4014: - test failures are unrelated to this patch. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708470#comment-14708470 ] Rohith Sharma K S commented on YARN-3893: - I had closer look at either of the solutions as above. One of the potential issue in both are # Moving createAndInitService just before starting activeServices in transitionToActive. ## switch time will be impacted since every transitionToActive initializes active services. ## And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started. # Moving refreshAll before transitionToActive in adminService is same as triggering RMAdminCIi on standby node. This call throw StandByException and retried to active RM in RMAdminCli. When it comes to AdminService#transitionedToActive(), refreshing before {{rm.transitionedToActive}} throws an standby exception. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708471#comment-14708471 ] Rohith Sharma K S commented on YARN-3893: - I think for any configuration issues while transitioningToActive, Adminservice should not allow JVM to continue. Because if AdminService throws any exception back to elector, elector again try to make RM active which goes in loop forever filling the logs. There could be 2 calls can lead to point of failures i.e first {{rm.transitionedToActive}}, second {{refreshAll()}}. # If any failures in {{rm.transitionedToActive}} then RM services will be stopped and RM will be in STANDBY state. # If {{refreshAll()}} fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call {{rm.handleTransitionToStandBy}}. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4044) Running applications information changes such as movequeue is not published to TimeLine server
[ https://issues.apache.org/jira/browse/YARN-4044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706224#comment-14706224 ] Rohith Sharma K S commented on YARN-4044: - Thanks [~sunilg] for the patch.. The patch mostly looks good to me.. Have you verified in the real cluster? Running applications information changes such as movequeue is not published to TimeLine server -- Key: YARN-4044 URL: https://issues.apache.org/jira/browse/YARN-4044 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.7.0 Reporter: Sunil G Assignee: Sunil G Priority: Critical Attachments: 0001-YARN-4044.patch SystemMetricsPublisher need to expose an appUpdated api to update any change for a running application. Events can be - change of queue for a running application. - change of application priority for a running application. This ticket intends to handle both RM and timeline side changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706319#comment-14706319 ] Rohith Sharma K S commented on YARN-3896: - Thanks [~hex108] for the patch, overall patch looks good to me.. Verified the tests without source, it is failing every time.. nit: Can you add public modifier to the interface api i.e. {{void resetLastNodeHeartBeatResponse();}}? RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset --- Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706464#comment-14706464 ] Rohith Sharma K S commented on YARN-3896: - Thanks for the clariffication.. RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset --- Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3896: Attachment: 0001-YARN-3896.patch When I applying the patch, patch apply was failing for 2 chunks in RMNodeImpl. So rebased patch against trunk and uploading to check Jenkins result.. Once HadooQA runs, will commit it.. RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset --- Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: 0001-YARN-3896.patch, YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706269#comment-14706269 ] Rohith Sharma K S commented on YARN-4014: - bq. When 2nd or subsequent AM attempt is spawned, we are never setting the old attempt as null in SchedulerApplication, correct? Hence there is a chance that we set priority to old attempt while new attempt is getting created.. Right.. Since latest priority has been reset to attempt after attempt got updated in the SchedulerApplication#setCurrentAttempt, I think there would NOT ocur any possibility where currentAttempt has old priority. So I believe currentAttempt NEED NOT to be volatile. [~jianhe] Could you give your opinion on this? Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0002-YARN-4014.patch Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0002-YARN-4017.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: (was: 0002-YARN-4017.patch) Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699628#comment-14699628 ] Rohith Sharma K S commented on YARN-4014: - Updating the modified patch, kindly review the patch. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0004-YARN-4014.patch Updating the same with fixing java doc issues.. Kick off jenkins Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch, 0004-YARN-4014.patch, 0004-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701292#comment-14701292 ] Rohith Sharma K S commented on YARN-3250: - [~sunilg] [~jianhe] would you have look at patch please? I will rebase the patch based on the review comments. Support admin cli interface in for Application Priority --- Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Attachments: 0001-YARN-3250-V1.patch, 0002-YARN-3250.patch Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3986) getTransferredContainers in AbstractYarnScheduler should be present in YarnScheduler interface instead
[ https://issues.apache.org/jira/browse/YARN-3986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701334#comment-14701334 ] Rohith Sharma K S commented on YARN-3986: - +1 for the latest patch.. getTransferredContainers in AbstractYarnScheduler should be present in YarnScheduler interface instead -- Key: YARN-3986 URL: https://issues.apache.org/jira/browse/YARN-3986 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-3986.01.patch, YARN-3986.02.patch, YARN-3986.03.patch Currently getTransferredContainers is present in {{AbstractYarnScheduler}}. *But in ApplicationMasterService, while registering AM, we are calling this method by typecasting it to AbstractYarnScheduler, which is incorrect.* This method should be moved to YarnScheduler. Because if a custom scheduler is to be added, it will implement YarnScheduler, not AbstractYarnScheduler. As ApplicationMasterService is calling getTransferredContainers by typecasting it to AbstractYarnScheduler, it is imposing an indirect dependency on AbstractYarnScheduler for any pluggable custom scheduler. We can move the method to YarnScheduler and leave the definition in AbstractYarnScheduler as it is. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-4014: Attachment: 0003-YARN-4014.patch updating the patch that check only for ACCEPTED and RUNNING application state before updating priority of an application. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch, 0003-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1461#comment-1461 ] Rohith Sharma K S commented on YARN-4014: - If the application is in SUBMITTED state, update priority should not be called because application would not be added to scheduler. In ACCEPTED state, update priority can be called. One of the doubt Jian He hasis if application is in ACCEPTED state, then application attempt would not be created. I rechecked the code flow, where we can do update in ACCEPTED state even though application is not created. IIRR, While doing YARN-3887, this specific scenario we discussed and handled *null* entry adding to SchedulableEntity. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Attachments: 0001-YARN-4014-V1.patch, 0001-YARN-4014.patch, 0002-YARN-4014.patch Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698908#comment-14698908 ] Rohith Sharma K S commented on YARN-3893: - Sorry for coming very late.. This issue has become stale, need to move forward!! Regarding the patch, # Instead of setting boolean flag for reinitActiveServices in AdminService and other changes, moving {{createAndInitActiveServices();}} from transitionedToStandby to just before starting activeServices would solve such issues. And on exception transitioningToActive, handle add method stopActiveServices in ResourceManager#transitioningToActive() only. # Probably we can remove refreshAll() from AdminService#transitioneToActive if the above approach. Any thoughts? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset synchronously
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3896: Labels: resourcemanager (was: ) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset synchronously - Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Labels: resourcemanager Fix For: 2.8.0 Attachments: 0001-YARN-3896.patch, YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch, YARN-3896.07.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset synchronously
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3896: Summary: RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset synchronously (was: RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset synchronously - Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Fix For: 2.8.0 Attachments: 0001-YARN-3896.patch, YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch, YARN-3896.07.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3896) RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset
[ https://issues.apache.org/jira/browse/YARN-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708810#comment-14708810 ] Rohith Sharma K S commented on YARN-3896: - Test failures are unrelated to the patch.. committing shortly.. RMNode transitioned from RUNNING to REBOOTED because its response id had not been reset --- Key: YARN-3896 URL: https://issues.apache.org/jira/browse/YARN-3896 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Jun Gong Assignee: Jun Gong Attachments: 0001-YARN-3896.patch, YARN-3896.01.patch, YARN-3896.02.patch, YARN-3896.03.patch, YARN-3896.04.patch, YARN-3896.05.patch, YARN-3896.06.patch, YARN-3896.07.patch {noformat} 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.util.RackResolver: Resolved 10.208.132.153 to /default-rack 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Reconnect from the node at: 10.208.132.153 2015-07-03 16:49:39,075 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: NodeManager from node 10.208.132.153(cmPort: 8041 httpPort: 8080) registered with capability: memory:6144, vCores:60, diskCapacity:213, assigned nodeId 10.208.132.153:8041 2015-07-03 16:49:39,104 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Too far behind rm response id:2506413 nm response id:0 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node 10.208.132.153:8041 as it is now REBOOTED 2015-07-03 16:49:39,137 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: 10.208.132.153:8041 Node Transitioned from RUNNING to REBOOTED {noformat} The node(10.208.132.153) reconnected with RM. When it registered with RM, RM set its lastNodeHeartbeatResponse's id to 0 asynchronously. But the node's heartbeat come before RM succeeded setting the id to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-842) Resource Manager Node Manager UI's doesn't work with IE
[ https://issues.apache.org/jira/browse/YARN-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14709098#comment-14709098 ] Rohith Sharma K S commented on YARN-842: I verified in the IE9 and greater, able to view the applicaitons. Does this issue still anyone facing in community? else can it be closed? Resource Manager Node Manager UI's doesn't work with IE - Key: YARN-842 URL: https://issues.apache.org/jira/browse/YARN-842 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 2.0.4-alpha Reporter: Devaraj K {code:xml} Webpage error details User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0) Timestamp: Mon, 17 Jun 2013 12:06:03 UTC Message: 'JSON' is undefined Line: 41 Char: 218 Code: 0 URI: http://10.18.40.24:8088/cluster/apps {code} RM NM UI's are not working with IE and showing the above error for every link on the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3919: Attachment: 0003-YARN-3919.patch The current patch not able to apply in my machine. So regenerating the same patches from my machine. Uploading to HadoopQA to kick off before commmit.. NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646448#comment-14646448 ] Rohith Sharma K S commented on YARN-3919: - No... git apply --whitespace=fix patch-file NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644008#comment-14644008 ] Rohith Sharma K S commented on YARN-3887: - Hi [~sunilg] For REST support, proto changes are not done, but for admin/user proto changes to be done. So I mean it can be done separate jira Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3948) Display Application Priority in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644150#comment-14644150 ] Rohith Sharma K S commented on YARN-3948: - +lgtm, [~sunilg] would have look at the findbugs failures? Display Application Priority in RM Web UI - Key: YARN-3948 URL: https://issues.apache.org/jira/browse/YARN-3948 Project: Hadoop YARN Issue Type: Sub-task Components: webapp Affects Versions: 2.7.1 Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3948.patch, 0002-YARN-3948.patch, 0003-YARN-3948.patch, ApplicationPage.png, ClusterPage.png Application Priority can be displayed in RM Web UI Application page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14644169#comment-14644169 ] Rohith Sharma K S commented on YARN-3919: - +1 for trivial change, lgtm.. will commit it.. NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Attachments: YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646510#comment-14646510 ] Rohith Sharma K S commented on YARN-3979: - Oops, 50 lakh events I checked the attached logs, since you have attached only ERROR logs, did not able to trace it. One observation is there are many InvalidStateTransitions events CLEAN_UP in RMNodeImpl. # Would you possible give RM logs, if not able to attach to JIRA, could you send me through mail. # would give more info like what is the cluster size? how much is apps are running? how many were completed? What is the state of state of NodeManager i.e whether they are running OR any other state? Which version of Hadoop are you using? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3919) NPEs' while stopping service after exception during CommonNodeLabelsManager#start
[ https://issues.apache.org/jira/browse/YARN-3919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3919: Priority: Trivial (was: Major) NPEs' while stopping service after exception during CommonNodeLabelsManager#start - Key: YARN-3919 URL: https://issues.apache.org/jira/browse/YARN-3919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Varun Saxena Assignee: Varun Saxena Priority: Trivial Fix For: 2.8.0 Attachments: 0003-YARN-3919.patch, YARN-3919.01.patch, YARN-3919.02.patch We get NPE during CommonNodeLabelsManager#serviceStop and AsyncDispatcher#serviceStop if ConnectException on call to CommonNodeLabelsManager#serviceStart occurs. {noformat} 2015-07-10 19:39:37,825 WARN main-EventThread org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.close(FileSystemNodeLabelsStore.java:99) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStop(CommonNodeLabelsManager.java:278) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:588) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:998) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1039) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1035) {noformat} {noformat} java.lang.NullPointerException at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:142) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) {noformat} These NPEs' fill up the logs. Although, this doesn't cause any functional issue but its a nuisance and we ideally should have null checks in serviceStop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646490#comment-14646490 ] Rohith Sharma K S commented on YARN-3250: - Adding to User API discussion, the ApplicationCLI command can be {{./yarn application appId --set-priority ApplicationId --priority value}} Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646598#comment-14646598 ] Rohith Sharma K S commented on YARN-3543: - I got what you mean!! Right.. Modifying other files like *ApplicationStartData* and others are related to applicationhistoryservice I think. Is it so? ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646478#comment-14646478 ] Rohith Sharma K S commented on YARN-3250: - Hi [~sunilg] As part of this JIRA, # User API : ## I am planning to introduce {{ApplicationClientProtocol#setPriority(SetApplicationProrityRequest)}}. *SetApplicationProrityRequest* comprises of ApplicationId and Priority. The clientRMService invokes API introduced by YARN-3887 i.e. updateApplicationPriority(); ## Thinking that does getPriority is required at user side? I feel that, since ApplicationReport can gives the priority of an application, this API is NOT required to have. What do you suggests, any thoughts? # Admin API : ## As admin, he should be able to change the *cluster-max-application-priority* value. Having an rmadmin API would be great!!. But one issue in with api is that cluster-max-application-priority is inmemory, but when rmadmin updates it, inmemory value can be updated. But in HA/Restart cases, the configuration value is taken. So I suggest to store cluster-level-application-priority in store and whenever RM is switched/Restarted, give higher preference to store. What do you think about this approach? Apart from above API's , should there any new API's to be added? Kindly share your thoughts? Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646568#comment-14646568 ] Rohith Sharma K S commented on YARN-3543: - Thanks [~xgong] for review.. bq. But we still made some un-necessary changes. sorry could not get what are un necessary changes. Could you explain please? ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646495#comment-14646495 ] Rohith Sharma K S commented on YARN-3250: - small correction in above syntax. Correct syntax is {{./yarn application --set-priority ApplicationId --priority value}} Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646622#comment-14646622 ] Rohith Sharma K S commented on YARN-3543: - I have one doubt that whether it is able to render on timeline web UI. I remember that these changes I did for timeline web UI fetching the data. Anyway I will verify it tomorrow and confirm does it required. ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647507#comment-14647507 ] Rohith Sharma K S commented on YARN-3250: - bq. I think one problem is that if there's ever a value set in state-store, RM cannot pick up the value using the config any more I see, I agree. configuration files would become stale after one restart/switch. How about having command that read yarn-site.xml specific configurations very likely similar to {{./yarn rmadmin refreshAdminAcls}}.This read *yarn.admin.acl* from yarn-site.xml configuration when refreshAdminAcls invoked. Similar line, setting cluster-max-application-priorty would be {{./yarn rmadmin refreshClusterMaxPriority}} or {{./yarn rmadmin refreshClusterPriority}}. Thoughts? bq. How about yarn application ApplicationId -setPriority priority ? Make sense. Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647177#comment-14647177 ] Rohith Sharma K S commented on YARN-3979: - Thanks for the information!! bq. NodeManager in one times all lost and recovery for a monment I can think of the scenario very close to YARN-3990. Since you have 2 lakh apps completed and 1600 NodeManager, when the all the nodes lost and reconnected, the number of events that generated are {{(2lakh completed + 550 running = 200550)*1600(number of NodeManager) = 32088}} events..Ooops!!! Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao Attachments: ERROR103.log 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653002#comment-14653002 ] Rohith Sharma K S commented on YARN-3887: - One comment # TreeSet will throw NullPointerException while adding/removing null object. Suppose, SchedulintApplicationAttempt is not created then {{application.getCurrentAppAttempt()}} will be null which would throw NPE. I think this has to be handled in {{AbstractComparatorOrderingPolicy#removeSchedulableEntity}} and {{AbstractComparatorOrderingPolicy#addSchedulableEntity}} Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch, 0003-YARN-3887.patch, 0004-YARN-3887.patch, 0005-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4014) Support user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653078#comment-14653078 ] Rohith Sharma K S commented on YARN-4014: - Basic API's discussions were done in YARN-3250, [comment1|https://issues.apache.org/jira/browse/YARN-3250?focusedCommentId=14646478page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14646478]. Just reiterating the discussion summary here # User API : ## For changing priority of an application, the API {{ApplicationClientProtocol#setApplicationPriority(SetApplicationProrityRequest)}} wil be added. *SetApplicationProrityRequest comprises of ApplicationId and Priority*. The clientRMService invokes API introduced by YARN-3887 i.e. updateApplicationPriority(); ## For getting prioryt of any applicaiton, there will be NO api's will be added. Retriving an priority of any application can be done using ApplicationReport after YARN-3948 committed. Support user cli interface in for Application Priority -- Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3948) Display Application Priority in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653024#comment-14653024 ] Rohith Sharma K S commented on YARN-3948: - Hi [~sunilg], Would you rebase the patch since YARN-3543 has committed !! Display Application Priority in RM Web UI - Key: YARN-3948 URL: https://issues.apache.org/jira/browse/YARN-3948 Project: Hadoop YARN Issue Type: Sub-task Components: webapp Affects Versions: 2.7.1 Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3948.patch, 0002-YARN-3948.patch, 0003-YARN-3948.patch, ApplicationPage.png, ClusterPage.png Application Priority can be displayed in RM Web UI Application page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653033#comment-14653033 ] Rohith Sharma K S commented on YARN-3250: - How about passing any option for specifying applicationId, i.e {{./yarn application --appId Applicationid --setPriority value}}? Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653023#comment-14653023 ] Rohith Sharma K S commented on YARN-3543: - Thanks [~xgong] for review and commit. I really appreciate your detailed review :-) ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Fix For: 2.8.0 Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, 0007-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4014) Support user cli interface in for Application Priority
Rohith Sharma K S created YARN-4014: --- Summary: Support user cli interface in for Application Priority Key: YARN-4014 URL: https://issues.apache.org/jira/browse/YARN-4014 Project: Hadoop YARN Issue Type: Sub-task Components: client, resourcemanager Reporter: Rohith Sharma K S Assignee: Rohith Sharma K S Track the changes for user-RM client protocol i.e ApplicationClientProtocol changes and discussions in this jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3250) Support admin cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3250: Summary: Support admin cli interface in for Application Priority (was: Support admin/user cli interface in for Application Priority) Support admin cli interface in for Application Priority --- Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3250) Support admin/user cli interface in for Application Priority
[ https://issues.apache.org/jira/browse/YARN-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653072#comment-14653072 ] Rohith Sharma K S commented on YARN-3250: - Moving user CLI(ApplicationClientProtocol) changes to separate jira YARN-4014 for more distinguish discussions and reviews!! Support admin/user cli interface in for Application Priority Key: YARN-3250 URL: https://issues.apache.org/jira/browse/YARN-3250 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Sunil G Assignee: Rohith Sharma K S Current Application Priority Manager supports only configuration via file. To support runtime configurations for admin cli and REST, a common management interface has to be added which can be shared with NodeLabelsManager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648817#comment-14648817 ] Rohith Sharma K S commented on YARN-3543: - Thanks [~xgong] for identifying ApplicationHistoryServer modifications that are not required at all.. !! Updated the patch by removing ApplicationHistoryServer modifications. This patchs contains only TimeLineServer modifications. I verified the patch in cluster to check Timeline WebUi is rendering *unmanagedApplication*. And also verified with REST apis for obtaininig applicationReport. [~xgong] would have look at updated patch please? ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, 0007-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3543) ApplicationReport should be able to tell whether the Application is AM managed or not.
[ https://issues.apache.org/jira/browse/YARN-3543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3543: Attachment: 0007-YARN-3543.patch ApplicationReport should be able to tell whether the Application is AM managed or not. --- Key: YARN-3543 URL: https://issues.apache.org/jira/browse/YARN-3543 Project: Hadoop YARN Issue Type: Improvement Components: api Affects Versions: 2.6.0 Reporter: Spandan Dutta Assignee: Rohith Sharma K S Attachments: 0001-YARN-3543.patch, 0001-YARN-3543.patch, 0002-YARN-3543.patch, 0002-YARN-3543.patch, 0003-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0004-YARN-3543.patch, 0005-YARN-3543.patch, 0006-YARN-3543.patch, 0007-YARN-3543.patch, YARN-3543-AH.PNG, YARN-3543-RM.PNG Currently we can know whether the application submitted by the user is AM managed from the applicationSubmissionContext. This can be only done at the time when the user submits the job. We should have access to this info from the ApplicationReport as well so that we can check whether an app is AM managed or not anytime. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected
Rohith Sharma K S created YARN-3990: --- Summary: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645425#comment-14645425 ] Rohith Sharma K S commented on YARN-3887: - thanks [~sunilg] for updating patch. Some comments # The invocation {{rmContext.getStateStore().updateApplicationState(appState);}} is asynchronous. So I feel stil there will be corner case would ocure where priority has set in scheduler but not updated to RMStateSstore. So if any RM switch/Restart would end up in resulting in old priority set. I think this particular invocation should be should be synchronous like any others API's E.g: {{storeRMDelegationToken}}, {{storeRMDTMasterKey}} Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645404#comment-14645404 ] Rohith Sharma K S commented on YARN-3979: - [~piaoyu zhang] In the description you have given NM logs, but in previous comment you have give stack trace of RM. It would be easy to analyze if you can provide more info like RM logs, NM logs and AM logs if started. And NM stack trace would help much since NM side holding 10 mins. Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-3990: Summary: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected (was: AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected ) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3979) Am in ResourceLocalizationService hang 10 min cause RM kill AM
[ https://issues.apache.org/jira/browse/YARN-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645535#comment-14645535 ] Rohith Sharma K S commented on YARN-3979: - How many applications completed? How many applications are running? How many NM are running? When is this event queeu is full? Any observation you made? Am in ResourceLocalizationService hang 10 min cause RM kill AM --- Key: YARN-3979 URL: https://issues.apache.org/jira/browse/YARN-3979 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Environment: CentOS 6.5 Hadoop-2.2.0 Reporter: zhangyubiao 2015-07-27 02:46:17,348 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1437735375558 _104282_01_01 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1437735375558_104282_01 (auth:SIMPLE) 2015-07-27 02:56:18,510 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for appattempt_1437735375558_104282_0 1 (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.api.ContainerManagementProtocolPB -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645566#comment-14645566 ] Rohith Sharma K S commented on YARN-3887: - Your understanding is correct. I was meant to say to have new synchronous API like {{updateApplicationStateSynchronizly}} in RMStateStore. [~jianhe] what do you think having new synchronous api in RMstatstore? Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3990) AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected
[ https://issues.apache.org/jira/browse/YARN-3990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14645583#comment-14645583 ] Rohith Sharma K S commented on YARN-3990: - thanks [~bibinchundatt] for reproducing the issue. I believe in you clustesr appsCompleted/appsRunning are 2 and max number of completed apps to keep is set to 20k? AsyncDispatcher may overloaded with RMAppNodeUpdateEvent when Node is connected/disconnected Key: YARN-3990 URL: https://issues.apache.org/jira/browse/YARN-3990 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Rohith Sharma K S Assignee: Bibin A Chundatt Priority: Critical Whenever node is added or removed, NodeListManager sends RMAppNodeUpdateEvent to all the applications that are in the rmcontext. But for finished/killed/failed applications it is not required to send these events. Additional check for wheather app is finished/killed/failed would minimizes the unnecessary events {code} public void handle(NodesListManagerEvent event) { RMNode eventNode = event.getNode(); switch (event.getType()) { case NODE_UNUSABLE: LOG.debug(eventNode + reported unusable); unusableRMNodesConcurrentSet.add(eventNode); for(RMApp app: rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_UNUSABLE)); } break; case NODE_USABLE: if (unusableRMNodesConcurrentSet.contains(eventNode)) { LOG.debug(eventNode + reported usable); unusableRMNodesConcurrentSet.remove(eventNode); } for (RMApp app : rmContext.getRMApps().values()) { this.rmContext .getDispatcher() .getEventHandler() .handle( new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode, RMAppNodeUpdateType.NODE_USABLE)); } break; default: LOG.error(Ignoring invalid eventtype + event.getType()); } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14643931#comment-14643931 ] Rohith Sharma K S commented on YARN-3887: - Hi Jian He, bq. Do you plan to do client side changes as part of this jira ? YARN-3250 is planning to do changes for admin and user CLI i.e ApplicationClientProtocol. This jira is intended for only scheduler side chagnes support for API's. In YARN-3250, will be using these exposed API's and implementing it. Current plan, Admin/User both have previlages to change priority of applications. More API's from Admin and User to be discussed in yarn-3250. Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4015) Is there any way to dynamically change container size after allocation.
[ https://issues.apache.org/jira/browse/YARN-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S resolved YARN-4015. - Resolution: Invalid Hi [~dhruv007] If any queries, post in hadoop user mailing list u...@hadoop.apache.org. JIRA is for tracking development issues. Is there any way to dynamically change container size after allocation. --- Key: YARN-4015 URL: https://issues.apache.org/jira/browse/YARN-4015 Project: Hadoop YARN Issue Type: Wish Reporter: dhruv Priority: Minor Hadoop yarn assumes that container size won't be changed after allocation. It is possible that job do not use resource allocated fully or required more resource for container.so is there any way so that container size change according to run time after allocation of container.Means elasticity for both memory and cpu. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654769#comment-14654769 ] Rohith Sharma K S commented on YARN-3992: - The patch looks overall good, nit: Can you add new API with additional parameter for host instead of changing existing API {{allocateAndWaitForContainers}} arguments? TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3992.patch, 0002-YARN-3992.patch {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3887) Support for changing Application priority during runtime
[ https://issues.apache.org/jira/browse/YARN-3887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648969#comment-14648969 ] Rohith Sharma K S commented on YARN-3887: - [~sunilg] thanks for updating patch One comment # The below code should not be synchronized. If we have synchronized, then there is very high chance of deadlock. The locking order should be always from {{stateMachine -- RMStateStore}} but below code locks in {{RMStateStore -- stateMachine -- RMStateStore}} which causes deadlock. For more discussion refer YARN-2946 {code} + public synchronized void updateApplicationStateSynchronously( + ApplicationStateData appState) { +handleStoreEvent(new RMStateUpdateAppEvent(appState)); + } {code} Support for changing Application priority during runtime Key: YARN-3887 URL: https://issues.apache.org/jira/browse/YARN-3887 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Sunil G Assignee: Sunil G Attachments: 0001-YARN-3887.patch, 0002-YARN-3887.patch, 0003-YARN-3887.patch, 0004-YARN-3887.patch After YARN-2003, adding support to change priority of an application after submission. This ticket will handle the server side implementation for same. A new RMAppEvent will be created to handle this, and will be common for all schedulers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3996) YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305
[ https://issues.apache.org/jira/browse/YARN-3996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648944#comment-14648944 ] Rohith Sharma K S commented on YARN-3996: - CIIAW, SchedulerUtils.normalizeRequests() is being called in allocate() in CS and FS which resourceRequest is normalized(reset) to minimumAllocation. So, it should not be matter for AM container resource Request where normalization is done in RMAppManager. Instead of normalizing at scheduler, here normalization is done at RMAppManager. Does this is impacting? YARN-789 (Support for zero capabilities in fairscheduler) is broken after YARN-3305 --- Key: YARN-3996 URL: https://issues.apache.org/jira/browse/YARN-3996 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Critical RMAppManager#validateAndCreateResourceRequest calls into normalizeRequest with mininumResource for the incrementResource. This causes normalize to return zero if minimum is set to zero as per YARN-789 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3992) TestApplicationPriority.testApplicationPriorityAllocation fails intermittently
[ https://issues.apache.org/jira/browse/YARN-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649001#comment-14649001 ] Rohith Sharma K S commented on YARN-3992: - Thanks [~sunilg] for providing the patch!! One comment # Instead of rewritting below code twice, can you use method {{MockAM#allocateAndWaitForContainers}} so many lines of code can be avoided. {code} +int NUM_CONTAINERS = 7; +// allocate NUM_CONTAINERS containers +am1.allocate(127.0.0.1, 2 * GB, NUM_CONTAINERS, +new ArrayListContainerId()); nm1.nodeHeartbeat(true); -while (alloc1Response.getAllocatedContainers().size() 1) { - LOG.info(Waiting for containers to be created for app 1...); - Thread.sleep(100); - alloc1Response = am1.schedule(); + +// wait for containers to be allocated. +ListContainer allocated1 = am1.allocate(new ArrayListResourceRequest(), +new ArrayListContainerId()).getAllocatedContainers(); +while (allocated1.size() != NUM_CONTAINERS) { + nm1.nodeHeartbeat(true); + allocated1.addAll(am1.allocate(new ArrayListResourceRequest(), + new ArrayListContainerId()).getAllocatedContainers()); + Thread.sleep(200); } {code} TestApplicationPriority.testApplicationPriorityAllocation fails intermittently -- Key: YARN-3992 URL: https://issues.apache.org/jira/browse/YARN-3992 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen Assignee: Sunil G Attachments: 0001-YARN-3992.patch {code} java.lang.AssertionError: expected:7 but was:5 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestApplicationPriority.testApplicationPriorityAllocation(TestApplicationPriority.java:182) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)