[jira] [Created] (YARN-3735) Retain JRE Fatal error logs upon container failure
Srikanth Sundarrajan created YARN-3735: -- Summary: Retain JRE Fatal error logs upon container failure Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure
[ https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562734#comment-14562734 ] Srikanth Sundarrajan commented on YARN-3735: When JRE fails with a fatal error during any of the container launchers, an error file is created in the working directory and this is being removed by the node manager upon container cleanup. It becomes challenging to debug in this scenario. It might be useful to collect this and append it to the container stderr when such errors are encountered. Retain JRE Fatal error logs upon container failure -- Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure
[ https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562803#comment-14562803 ] Srikanth Sundarrajan commented on YARN-3735: There have been cases where MRAppMaster fails with SIGBUS errors in our cluster for instance and it is sporadic. {noformat} sudo -u UUU yarn logs -applicationId application_1432020518439_802161 Unable to get ApplicationState. Attempting to fetch logs directly from the filesystem. Container: container_1432020518439_802161_02_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Container: container_1432020518439_802161_01_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. {noformat} Retain JRE Fatal error logs upon container failure -- Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure
[ https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562807#comment-14562807 ] Srikanth Sundarrajan commented on YARN-3735: There have been cases where MRAppMaster fails with SIGBUS errors in our cluster for instance and it is sporadic. {noformat} sudo -u UUU yarn logs -applicationId application_1432020518439_802161 Unable to get ApplicationState. Attempting to fetch logs directly from the filesystem. Container: container_1432020518439_802161_02_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Container: container_1432020518439_802161_01_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. {noformat} Retain JRE Fatal error logs upon container failure -- Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure
[ https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562804#comment-14562804 ] Srikanth Sundarrajan commented on YARN-3735: There have been cases where MRAppMaster fails with SIGBUS errors in our cluster for instance and it is sporadic. {noformat} sudo -u UUU yarn logs -applicationId application_1432020518439_802161 Unable to get ApplicationState. Attempting to fetch logs directly from the filesystem. Container: container_1432020518439_802161_02_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Container: container_1432020518439_802161_01_01 on host.grid.com_45454 === LogType:stderr Log Upload Time:26-May-2015 16:42:53 LogLength:0 Log Contents: LogType:stdout Log Upload Time:26-May-2015 16:42:53 LogLength:954 Log Contents: # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384 # # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01) # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libzip.so+0x5aad] readCEN+0x79d # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. {noformat} Retain JRE Fatal error logs upon container failure -- Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure
[ https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562814#comment-14562814 ] Srikanth Sundarrajan commented on YARN-3735: Sorry about the multiple post, had issues with JIRA. Retain JRE Fatal error logs upon container failure -- Key: YARN-3735 URL: https://issues.apache.org/jira/browse/YARN-3735 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Srikanth Sundarrajan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547506#comment-14547506 ] Srikanth Sundarrajan commented on YARN-3644: [~vinodkv], YARN-3644 is independent of this. In our setup we ran into this before we ran into YARN-3646. NM gives up trying for about 30 odd mts by default (default settings) before *attempting* to shut itself down. Is there an issue if this wait time is much (infinitely) longer (for both HA Non-HA setup). An orthogonal issue is that when NM attempts to shut itself down, it doesn't actually go down and lingers around for days without actually accepting any containers, unless restarted (will file another issue for this). Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever
[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545548#comment-14545548 ] Srikanth Sundarrajan commented on YARN-3646: {quote} You can probably avoid this situation by setting a bigger value {quote} Would this not cause the client to wait for too long (well after the rm has come back online) Applications are getting stuck some times in case of retry policy forever - Key: YARN-3646 URL: https://issues.apache.org/jira/browse/YARN-3646 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Raju Bairishetti We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER retry policy. Yarn client is infinitely retrying in case of exceptions from the RM as it is using retrying policy as FOREVER. The problem is it is retrying for all kinds of exceptions (like ApplicationNotFoundException), even though it is not a connection failure. Due to this my application is not progressing further. *Yarn client should not retry infinitely in case of non connection failures.* We have written a simple yarn-client which is trying to get an application report for an invalid or older appId. ResourceManager is throwing an ApplicationNotFoundException as this is an invalid or older appId. But because of retry policy FOREVER, client is keep on retrying for getting the application report and ResourceManager is throwing ApplicationNotFoundException continuously. {code} private void testYarnClientRetryPolicy() throws Exception{ YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); YarnClient yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); ApplicationReport report = yarnClient.getApplicationReport(appId); } {code} *RM logs:* {noformat} 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875162 Retry#0 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1430126768987_10645' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport from 10.14.120.231:61621 Call#875163 Retry#0 {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
[ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544971#comment-14544971 ] Srikanth Sundarrajan commented on YARN-3644: {quote} Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). {quote} See YARN-3646 Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3644) Node manager shuts down if unable to connect with RM
Srikanth Sundarrajan created YARN-3644: -- Summary: Node manager shuts down if unable to connect with RM Key: YARN-3644 URL: https://issues.apache.org/jira/browse/YARN-3644 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Srikanth Sundarrajan When NM is unable to connect to RM, NM shuts itself down. {code} } catch (ConnectException e) { //catch and throw the exception if tried MAX wait time to connect RM dispatcher.getEventHandler().handle( new NodeManagerEvent(NodeManagerEventType.SHUTDOWN)); throw new YarnRuntimeException(e); {code} In large clusters, if RM is down for maintenance for longer period, all the NMs shuts themselves down, requiring additional work to bring up the NMs. Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where non connection failures are being retried infinitely by all YarnClients (via RMProxy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)