[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-02-07 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1490:
--

Fix Version/s: (was: 2.3.0)
   2.4.0

Reverted from branch-2.3. Setting fix-version as 2.4.

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, 
 YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
 YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-02-07 Thread Robert Kanter (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-1490:


Attachment: org.apache.oozie.service.TestRecoveryService_thread-dump.txt

As reported in the Re-swizzle 2.3 email thread on the mailing lists, when 
running the Oozie unit tests we saw some weird behavior after YARN-1490:
Basically, we use a single MiniMRCluster and MiniDFSCluster across all unit 
tests in a module.  With YARN-1490 we saw that, regardless of test order, the 
last few tests would timeout waiting for an MR job to finish; on slower 
machines, the entire test suite would timeout.  Through some digging, I found 
that we were getting a ton of “Connection refused” Exceptions on LeaseRenewer 
talking to the NN and a few on the AM talking to the RM.
So it sounds like there's something that happens over time...

I've attached a thread dump 
(org.apache.oozie.service.TestRecoveryService_thread-dump.txt) taken during the 
test where we saw the timeout; though its possible that the issue manifests 
itself earlier but isn't noticeable until then.

And here is one of the exceptions that we see in the in the MiniMRCluster's 
syslog for the container used during that test; it repeats many many times:
{noformat}
2014-02-07 14:42:22,998 WARN [LeaseRenewer:test@localhost:56186] 
org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for 
[DFSClient_NONMAPREDUCE_-1380838220_1] for 2419 seconds.  Will retry shortly ...
java.net.ConnectException: Call From rkanter-mbp.local/172.16.1.64 to 
localhost:56186 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor17.newInstance(Unknown 
Source)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1359)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696)
at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458)
at org.apache.hadoop.ipc.Client.call(Client.java:1377)
... 16 more
{noformat}

I'm going to continue looking into why YARN-1490 is causing this behavior, but 
I thought I'd post this info here in case anyone has any ideas.

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Fix For: 2.4.0

 Attachments: 

[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-10 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.12.patch

Fixed the test failure

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, 
 YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
 YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.6.patch

New patch added the test for testing reserved container to be killed by the 
previous attempt.

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch, YARN-1490.5.patch, YARN-1490.6.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.7.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch, YARN-1490.5.patch, YARN-1490.6.patch, YARN-1490.7.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.9.patch

Fixed one test name in TestRMAppAttemptTransitions

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch, YARN-1490.5.patch, YARN-1490.6.patch, YARN-1490.7.patch, 
 YARN-1490.8.patch, YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.10.patch

fix the find bug

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
 YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.11.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, 
 YARN-1490.5.patch, YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, 
 YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-09 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.11.patch

submit the same patch to kick jenkins

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
 YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch, YARN-1490.5.patch, YARN-1490.6.patch, YARN-1490.7.patch, 
 YARN-1490.8.patch, YARN-1490.9.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-08 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.4.patch

Thanks Vinod for such detailed comments !



 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-08 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.5.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch, 
 YARN-1490.4.patch, YARN-1490.5.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-03 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.2.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-03 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.3.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Jian He
 Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

2014-01-02 Thread Jian He (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian He updated YARN-1490:
--

Attachment: YARN-1490.1.patch

 RM should optionally not kill all containers when an ApplicationMaster exits
 

 Key: YARN-1490
 URL: https://issues.apache.org/jira/browse/YARN-1490
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-1490.1.patch


 This is needed to enable work-preserving AM restart. Some apps can chose to 
 reconnect with old running containers, some may not want to. This should be 
 an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)