[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-13 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625088#comment-14625088
 ] 

Varun Vasudev commented on YARN-3644:
-

One other minor comment - can you please change 
yarn.nodemanager.shutdown.on.RM.connection.failures to 
yarn.nodemanager.shutdown-on-rm-connection-failures? 

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-13 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624909#comment-14624909
 ] 

Varun Vasudev commented on YARN-3644:
-

[~raju.bairishetti] the latest patch also conflicts with a recent commit in 
NodeStatusUpdaterImpl. Can you please check it out? Thanks!

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-11 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623462#comment-14623462
 ] 

Raju Bairishetti commented on YARN-3644:


Thanks [~varun_saxena] for the review and comments.

bq. The config name is yarn.nodemanager.shutdown.on.RM.connection.failures. All 
our config names are in lowercase, just for the sake of consistency, maybe RM 
can be in lowercase too. Thoughts?
  Agree. Will change it to lower case.

bq. The test doesnt really check for whether ConnectionException was thrown or 
NM Shutdown event was called or not.
   I ran the test in debugger mode. also. Test is hitting all the source 
changes. *I agree, I will rewrite this test using Mockito  to make it more 
generic*

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-11 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623616#comment-14623616
 ] 

Varun Saxena commented on YARN-3644:


[~raju.bairishetti],
bq. I ran the test in debugger mode. also. Test is hitting all the source 
changes
I did not mean that test is not hitting the change. It is doing so and I 
verified it as well from logs.
What I meant is some Mockito#verify statements can be added or some other 
assertions added to check if required functions or flows are getting hit.
Because the assertion of Service state being STARTED is something which can 
happen irrespective of whether your code is hit or not. 
Let us say somebody changes the code in future in a manner where your part of 
the code is conditional. Unlikely but you never know what happens 6 months down 
the line. 
If you have verification statements checking whether your code has been called 
or not, your test case would fail after future changes, if those parts of code 
are not called. This would force the developer to change your test case as well.
If you have just the check for service being STARTED, after any future changes, 
your tests may still pass despite relevant flow being hit or not. And this may 
mask any mistakes made in this or related part of the main code.  So test case 
should verify if flow is being hit, either by checking function invocations or 
by having a set of assertions which are somewhat unique to test.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622459#comment-14622459
 ] 

Varun Saxena commented on YARN-3644:


Thanks [~raju.bairishetti] for working on this.

FTew comments :
# No need to override {{MyNodeManager3#getNodeStatusUpdater}} and 
{{MyNodeManager3#serviceStop}}. serviceStop for instance only calls 
{{super.serviceStop()}}
# {{MyNodeStatusUpdater6#context}} is not required.
# The test doesnt really check for whether ConnectionException was thrown or NM 
Shutdown event was called or not. I think you check whether the flow has been 
hit or not using Mockito mock or spy. For instance, if you call 
{{NodeManager#start}}, service state will be STARTED irrespective of whether 
code written by you has been hit or not.
# I think log added can be logged at WARN level instead of ERROR.
# Also the log says Not shutting down NodeManager. Retry after default 
heartbeat interval time. We can instead say something like Unable to connect 
to RM...Retry after default heartbeat time.
# The config name is {{yarn.nodemanager.shutdown.on.RM.connection.failures}}. 
All our config names are in lowercase, just for the sake of consistency, maybe 
RM can be in lowercase too. Thoughts ? 



 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622477#comment-14622477
 ] 

Varun Saxena commented on YARN-3644:


A typo. Meant I think you *can* check whether flow has been hit or not using 
Mockito mock or spy

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-10 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622468#comment-14622468
 ] 

Varun Saxena commented on YARN-3644:


Moreover, I think most, if not all of the My classes added by you are not 
required. You can easily use mocking and use current classes to achieve the 
same result.
You just need to throw an exception while calling heartbeat. You can easily use 
Mockito to achieve it.
We can probably change visibility of {{getRMClient}} method in one of the 
MyNodeStatusUpdater* class which you can use so that its visible for use with 
Mockito.
This will greatly reduce unnecessary code.

And IMHO, changing a method of a private class in test scope to public 
shouldn't be an issue. Thoughts ?
You can probably explore this option to refactor your code.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-06 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615631#comment-14615631
 ] 

Naganarasimha G R commented on YARN-3644:
-

Thanks [~raju.bairishetti] for working on this and the latest Patch looks fine 
to me !


 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-07-06 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14614591#comment-14614591
 ] 

Raju Bairishetti commented on YARN-3644:


[~Naganarasimha] Could you kindly review the latest patch?

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603064#comment-14603064
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 30s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 34s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 24s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 44s | The applied patch generated  1 
new checkstyle issues (total was 211, now 211). |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 35s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 34s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   4m 18s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 16s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  53m 37s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12742125/YARN-3644.003.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 8ef07f7 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8357/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-26 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603363#comment-14603363
 ] 

Raju Bairishetti commented on YARN-3644:


Seems checkstyle error was not introduced as part of this patch. File had 
already more than 2000 lines :) .
*Check style error:*  YarnConfiguration.java:1: File length is 2,036 lines (max 
allowed is 2,000).

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.003.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-25 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602006#comment-14602006
 ] 

Naganarasimha G R commented on YARN-3644:
-

As long as Refactoring is taken care in YARN-3847, i don't mind! I will try to 
review the patch as soon as possible

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600710#comment-14600710
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 30s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   2m 51s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741784/YARN-3644.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a815cc1 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8338/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600740#comment-14600740
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  18m 18s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:red}-1{color} | javac |   3m  0s | The patch appears to cause the 
build to fail. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12741790/YARN-3644.002.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / a815cc1 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8340/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.001.patch, 
 YARN-3644.002.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-12 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583093#comment-14583093
 ] 

Naganarasimha G R commented on YARN-3644:
-

Hi [~raju.bairishetti],
IIUC intention of this jira is to only make NM wait for RM infinitely and hence 
we don't want to set  {{yarn.resourcemanager.connect.max-wait.ms}} to  FOREVER 
retry policy which might affect other clients connecting to RM right ?
If so i feel overall approach is fine except for the cosmetic comments below
# {{NM_SHUTSDWON_ON_RM_CONNECTION_FAILURES}}  typo,  SHUTSDWON = SHUTDOWN
# if agree on the earlier then 
{{DEFAULT_NM_SHUTSDOWN_ON_RM_CONNECTION_FAILURES}} = 
{{DEFAULT_NM_SHUTDOWN_ON_RM_CONNECTION_FAILURES}} 
# configuration could be {{yarn.nodemanager.shutdown.on.connection.failures}} 
= {{yarn.nodemanager.shutdown.on.RM.connection.failures}}. correct the same in 
yarn-default.xml's  description and name also
# Testcase introduces new {{MyNodeStatusUpdater6}} whose only change is to get 
the new Resource tracker for the test case, its becoming more and more 
duplicate code for NodeStatusUpdater as most of the other overloaded 
NodeStatusUpdater is also doing the same, so can we bring in a common 
NodeStatusUpdater  class which accepts ResourceTracker  as parameter to 
constructor ? (may be refactoring other classes can be taken up in other jira 
if req)
# {{MyResourceTracker8}} could extend {{MyResourceTracker5}} and just override 
the required methods. Would also appreciate if some documentation is added 
above these classes so that in future it will be helpfull to reuse if req.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-12 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583189#comment-14583189
 ] 

Raju Bairishetti commented on YARN-3644:


[~amareshwari] [~Naganarasimha] Thanks for the review and comments.

[~Naganarasimha] Yes,  this jira is only to make NM wait for RM.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-11 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582981#comment-14582981
 ] 

Amareshwari Sriramadasu commented on YARN-3644:
---

Patch looks good to me.

[~vinodkv], [~sandflee] - do you see  any issue with adding new config 
yarn.nodemanager.shutdown.on.connection.failures for this behavior. I 
understand it is one more config getting added, but other than that i don't see 
any issue. Or do you know of any other solution ?

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-06-11 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582898#comment-14582898
 ] 

Raju Bairishetti commented on YARN-3644:


Could anyone please review the patch?

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-27 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562206#comment-14562206
 ] 

Raju Bairishetti commented on YARN-3644:


[~hex108] Is there any work pending on the jira to assign yourself or was it 
assigned yourself by mistake?


 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Jun Gong
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-27 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562210#comment-14562210
 ] 

Jun Gong commented on YARN-3644:


Sorry, by mistake...

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Jun Gong
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-27 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562209#comment-14562209
 ] 

Jun Gong commented on YARN-3644:


Sorry, by mistake...

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Jun Gong
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558902#comment-14558902
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 38s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 32s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 35s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   1m 43s | The applied patch generated  1 
new checkstyle issues (total was 214, now 215). |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 48s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 26s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 56s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 15s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  49m  2s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12735276/YARN-3644.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 56996a6 |
| checkstyle |  
https://builds.apache.org/job/PreCommit-YARN-Build/8082/artifact/patchprocess/diffcheckstylehadoop-yarn-api.txt
 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8082/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8082/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8082/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8082/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8082/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558955#comment-14558955
 ] 

Hadoop QA commented on YARN-3644:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 36s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 1 new or modified test files. |
| {color:green}+1{color} | javac |   7m 35s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 32s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   2m 20s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m  1s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 33s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   3m 49s | The patch does not introduce 
any new Findbugs (version 3.0.0) warnings. |
| {color:green}+1{color} | yarn tests |   0m 27s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |   1m 57s | Tests passed in 
hadoop-yarn-common. |
| {color:green}+1{color} | yarn tests |   6m 26s | Tests passed in 
hadoop-yarn-server-nodemanager. |
| | |  49m 19s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12735288/YARN-3644.001.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 56996a6 |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8083/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8083/artifact/patchprocess/testrun_hadoop-yarn-common.txt
 |
| hadoop-yarn-server-nodemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/8083/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/8083/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8083/console |


This message was automatically generated.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-26 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560111#comment-14560111
 ] 

sandflee commented on YARN-3644:


Thanks [~vinodkv], what my concerns is long running containers shouldn't be 
killed even if all RM crashes. In our product env, we encountered the problem 
that all running containers are killed since NM couldn't connect to RM (RM 
couldn't be 100% available), the application running on our cluster becomes 
unavailble. 

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan
Assignee: Raju Bairishetti
 Attachments: YARN-3644.001.patch, YARN-3644.patch


 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547489#comment-14547489
 ] 

Vinod Kumar Vavilapalli commented on YARN-3644:
---

bq. In large clusters, if RM is down for maintenance for longer period, all the 
NMs shuts themselves down, requiring additional work to bring up the NMs.
bq. Right now, NM shuts down itself only in case of connection failures. NM 
ignores all other kinds of exceptions and errors while sending heartbeats.
This path usually shouldn't happen at all as the RMProxy layer is supposed to 
retry _enough_, except perhaps for the bug at YARN-3646. We eventually want to 
give up if the retry layer itself gives up. Given that, is this JIRA simply a 
dup of YARN-3646? /cc [~jianhe] [~xgong]

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547506#comment-14547506
 ] 

Srikanth Sundarrajan commented on YARN-3644:


[~vinodkv], YARN-3644 is independent of this. In our setup we ran into this 
before we ran into YARN-3646. NM gives up trying for about 30 odd mts by 
default (default settings) before *attempting* to shut itself down. Is there an 
issue if this wait time is much (infinitely) longer (for both HA  Non-HA 
setup). An orthogonal issue is that when NM attempts to shut itself down, it 
doesn't actually go down and lingers around for days without actually accepting 
any containers, unless restarted (will file another issue for this).

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547492#comment-14547492
 ] 

Vinod Kumar Vavilapalli commented on YARN-3644:
---

Actually, for all the above cases, we want NMs to just continue for a while 
without losing any work and finally give up after some time. The only 
difference between a HA vs non-HA setup is that in HA setup NMs will just wait 
many times over trying each of the RMs.

Getting into the business of detecting and acting on partitions is best left up 
to admins/tools.

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547544#comment-14547544
 ] 

Raju Bairishetti commented on YARN-3644:


W can have a new config like NODEMANAGER_ALIVE_ON_RM_CONNECTION_FAILURES? Based 
on this config value NM takes a decision on shutdown. In this way we can honour 
the existing behaviour as well.

I will provide a patch shortly. Not able to assign myself. Can anyone help me 
in assigning?

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547155#comment-14547155
 ] 

sandflee commented on YARN-3644:


[~raju.bairishetti] thanks for your reply,  If RM HA is not enabled, we can fix 
it like this. But with RM HA, there're some condition to consider.
1, both RM A and RM B  reset the connection,  seems RMs are in trouble, NM keep 
containers alive
2, both RM A and RM B socket timeout, seems NM are network partitioned with RMs 
or RM machine all crashed(Any way to distinguish them?), NM kills all containers
3, one RM reset the connection and the other socket timeout, It's difficult to 
handle, sine we knows nothing about active RM, both RM maybe all crashed, or 
just active RM are network partitioned. 
I suggest backup RM also responses and tells NM I'm backup RM. So It becomes 
3.1  one RM reset the connection and the other socket timeout, seems RM in 
trouble, just keep containers alive 
3.2  one RM are backup and  the other RM socket timeout,  seems NM are network 
partitioned with active RM, kill all containers

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-17 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547159#comment-14547159
 ] 

sandflee commented on YARN-3644:


In our cluster we also have to face this problem, I'd like to have some work on 
this if possible, expecting more comments!

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-16 Thread Raju Bairishetti (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547040#comment-14547040
 ] 

Raju Bairishetti commented on YARN-3644:


[~sandflee] Yes, NM should catch the exception and keeps it alive.

Right now, NM shuts down itself only in case of connection failures. NM ignores 
all other kinds of exceptions and errors while sending heartbeats.
{code}
 } catch (ConnectException e) {
//catch and throw the exception if tried MAX wait time to connect RM
dispatcher.getEventHandler().handle(
new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
throw new YarnRuntimeException(e);
 } catch (Throwable e) {

// TODO Better error handling. Thread can die with the rest of the
// NM still running.
LOG.error(Caught exception in status-updater, e);
} 
{code}

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-16 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546971#comment-14546971
 ] 

sandflee commented on YARN-3644:


If RM is down, NM's connection will be reset by RM machine,  could we catch 
this exception and keeps NM alive?

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM

2015-05-14 Thread Srikanth Sundarrajan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544971#comment-14544971
 ] 

Srikanth Sundarrajan commented on YARN-3644:


{quote}
Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, 
where non connection failures are being retried infinitely by all YarnClients 
(via RMProxy).
{quote}
See YARN-3646

 Node manager shuts down if unable to connect with RM
 

 Key: YARN-3644
 URL: https://issues.apache.org/jira/browse/YARN-3644
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Srikanth Sundarrajan

 When NM is unable to connect to RM, NM shuts itself down.
 {code}
   } catch (ConnectException e) {
 //catch and throw the exception if tried MAX wait time to connect 
 RM
 dispatcher.getEventHandler().handle(
 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
 throw new YarnRuntimeException(e);
 {code}
 In large clusters, if RM is down for maintenance for longer period, all the 
 NMs shuts themselves down, requiring additional work to bring up the NMs.
 Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side 
 effects, where non connection failures are being retried infinitely by all 
 YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)