[jira] [Created] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)
Srikanth Sundarrajan created YARN-3735:
--

 Summary: Retain JRE Fatal error logs upon container failure
 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562734#comment-14562734
 ] 

Srikanth Sundarrajan commented on YARN-3735:


When JRE fails with a fatal error during any of the container launchers, an 
error file is created in the working directory and this is being removed by the 
node manager upon container cleanup. It becomes challenging to debug in this 
scenario. It might be useful to collect this and append it to the container 
stderr when such errors are encountered.

 Retain JRE Fatal error logs upon container failure
 --

 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562803#comment-14562803
 ] 

Srikanth Sundarrajan commented on YARN-3735:


There have been cases where MRAppMaster fails with SIGBUS errors in our cluster 
for instance and it is sporadic.

{noformat}
sudo -u UUU yarn logs -applicationId application_1432020518439_802161
Unable to get ApplicationState. Attempting to fetch logs directly from the 
filesystem.


Container: container_1432020518439_802161_02_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Container: container_1432020518439_802161_01_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
{noformat}


 Retain JRE Fatal error logs upon container failure
 --

 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562807#comment-14562807
 ] 

Srikanth Sundarrajan commented on YARN-3735:


There have been cases where MRAppMaster fails with SIGBUS errors in our cluster 
for instance and it is sporadic.

{noformat}
sudo -u UUU yarn logs -applicationId application_1432020518439_802161
Unable to get ApplicationState. Attempting to fetch logs directly from the 
filesystem.


Container: container_1432020518439_802161_02_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Container: container_1432020518439_802161_01_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
{noformat}


 Retain JRE Fatal error logs upon container failure
 --

 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562804#comment-14562804
 ] 

Srikanth Sundarrajan commented on YARN-3735:


There have been cases where MRAppMaster fails with SIGBUS errors in our cluster 
for instance and it is sporadic.

{noformat}
sudo -u UUU yarn logs -applicationId application_1432020518439_802161
Unable to get ApplicationState. Attempting to fetch logs directly from the 
filesystem.


Container: container_1432020518439_802161_02_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fce44882aad, pid=8391, tid=140523938055936
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_02_01/hs_err_pid8391.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Container: container_1432020518439_802161_01_01 on host.grid.com_45454
===
LogType:stderr
Log Upload Time:26-May-2015 16:42:53
LogLength:0
Log Contents:

LogType:stdout
Log Upload Time:26-May-2015 16:42:53
LogLength:954
Log Contents:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7f2360144aad, pid=8077, tid=139789960816384
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# C  [libzip.so+0x5aad]  readCEN+0x79d
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try ulimit -c unlimited before starting Java again
#
# An error report file with more information is saved as:
# 
/data/d1/yarn/local/usercache/UUU/appcache/application_1432020518439_802161/container_1432020518439_802161_01_01/hs_err_pid8077.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
{noformat}


 Retain JRE Fatal error logs upon container failure
 --

 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3735) Retain JRE Fatal error logs upon container failure

2015-05-28 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562814#comment-14562814
 ] 

Srikanth Sundarrajan commented on YARN-3735:


Sorry about the multiple post, had issues with JIRA.

 Retain JRE Fatal error logs upon container failure
 --

 Key: YARN-3735
 URL: https://issues.apache.org/jira/browse/YARN-3735
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Srikanth Sundarrajan





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547506#comment-14547506
 ] 

Srikanth Sundarrajan commented on YARN-3644:


[~vinodkv], YARN-3644 is independent of this. In our setup we ran into this 
before we ran into YARN-3646. NM gives up trying for about 30 odd mts by 
default (default settings) before *attempting* to shut itself down. Is there an 
issue if this wait time is much (infinitely) longer (for both HA  Non-HA 
setup). An orthogonal issue is that when NM attempts to shut itself down, it 
doesn't actually go down and lingers around for days without actually accepting 
any containers, unless restarted (will file another issue for this).

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3646) Applications are getting stuck some times in case of retry policy forever

2015-05-15 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14545548#comment-14545548
 ] 

Srikanth Sundarrajan commented on YARN-3646:


{quote}
You can probably avoid this situation by setting a bigger value
{quote}

Would this not cause the client to wait for too long (well after the rm has 
come back online)

 Applications are getting stuck some times in case of retry policy forever
 -

 Key: YARN-3646
 URL: https://issues.apache.org/jira/browse/YARN-3646
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Raju Bairishetti

 We have set  *yarn.resourcemanager.connect.wait-ms* to -1 to use  FOREVER 
 retry policy.
 Yarn client is infinitely retrying in case of exceptions from the RM as it is 
 using retrying policy as FOREVER. The problem is it is retrying for all kinds 
 of exceptions (like ApplicationNotFoundException), even though it is not a 
 connection failure. Due to this my application is not progressing further.
 *Yarn client should not retry infinitely in case of non connection failures.*
 We have written a simple yarn-client which is trying to get an application 
 report for an invalid  or older appId. ResourceManager is throwing an 
 ApplicationNotFoundException as this is an invalid or older appId.  But 
 because of retry policy FOREVER, client is keep on retrying for getting the 
 application report and ResourceManager is throwing 
 ApplicationNotFoundException continuously.
 {code}
 private void testYarnClientRetryPolicy() throws  Exception{
 YarnConfiguration conf = new YarnConfiguration();
 conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, 
 -1);
 YarnClient yarnClient = YarnClient.createYarnClient();
 yarnClient.init(conf);
 yarnClient.start();
 ApplicationId appId = ApplicationId.newInstance(1430126768987L, 
 10645);
 ApplicationReport report = yarnClient.getApplicationReport(appId);
 }
 {code}
 *RM logs:*
 {noformat}
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875162 Retry#0
 org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
 with id 'application_1430126768987_10645' doesn't exist in RM.
   at 
 org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284)
   at 
 org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145)
   at 
 org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
 
 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call 
 org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
 from 10.14.120.231:61621 Call#875163 Retry#0
 
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-14 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544971#comment-14544971
 ] 

Srikanth Sundarrajan commented on YARN-3644:


{quote}
Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, 
where non connection failures are being retried infinitely by all YarnClients 
(via RMProxy).
{quote}
See YARN-3646

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-13 Thread Srikanth Sundarrajan (JIRA)
Srikanth Sundarrajan created YARN-3644:
--

 Summary: Node manager shuts down if unable to connect with RM
 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan


When NM is unable to connect to RM, NM shuts itself down.

{code}
  } catch (ConnectException e) {
//catch and throw the exception if tried MAX wait time to connect RM
dispatcher.getEventHandler().handle(
new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
throw new YarnRuntimeException(e);
{code}

In large clusters, if RM is down for maintenance for longer period, all the NMs 
shuts themselves down, requiring additional work to bring up the NMs.

Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, 
where non connection failures are being retried infinitely by all YarnClients 
(via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)