[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773513#comment-16773513 ] Rayman commented on YARN-3554: -- The RetryUpToMaximumTimeWithFixedSleep policy takes as input a maxTime and a sleepTime. and internally is implemented as a RetryUpToMaximumCountWithFixedSleep with maxCount = maxTime / sleepTime. This has a problem It does not account for the time spent while performing the actual retry. For example, RetryUpToMaximumTimeWithFixedSleep with maxTime = 30 sec and sleepTime = 1sec. Will takeupto 90 seconds, if each retry (e.g., ConnectionTimeout) takes 2 seconds to return. 30 * (2 +1). A policy claiming to be RetryUpToMaximumTimeWithFixedSleep, should *actually* respect the *maximum time*, e.g., by recording a timestamp/timer. > Default value for maximum nodemanager connect wait time is too high > --- > > Key: YARN-3554 > URL: https://issues.apache.org/jira/browse/YARN-3554 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Naganarasimha G R >Priority: Major > Labels: BB2015-05-RFC, newbie > Fix For: 2.8.0, 2.7.1, 2.6.2, 3.0.0-alpha1 > > Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch > > > The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 > msec or 15 minutes, which is way too high. The default container expiry time > from the RM and the default task timeout in MapReduce are both only 10 > minutes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934124#comment-14934124 ] Sangjin Lee commented on YARN-3554: --- The change applied cleanly to branch-2.6 for 2.6.2. > Default value for maximum nodemanager connect wait time is too high > --- > > Key: YARN-3554 > URL: https://issues.apache.org/jira/browse/YARN-3554 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Jason Lowe >Assignee: Naganarasimha G R > Labels: BB2015-05-RFC, newbie > Fix For: 2.7.1, 2.6.2 > > Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch > > > The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 > msec or 15 minutes, which is way too high. The default container expiry time > from the RM and the default task timeout in MapReduce are both only 10 > minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533959#comment-14533959 ] Naganarasimha G R commented on YARN-3554: - Hi [~jlowe], As 3 mins is fine with [~vinodkv] can we have this patch in ? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534710#comment-14534710 ] Naganarasimha G R commented on YARN-3554: - Thanks for reviewing and committing the patch [~jlowe], [~vinodkv] [~gtCarrera9] Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-RFC, newbie Fix For: 2.7.1 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534728#comment-14534728 ] Hudson commented on YARN-3554: -- FAILURE: Integrated in Hadoop-trunk-Commit #7775 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7775/]) YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-RFC, newbie Fix For: 2.7.1 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534554#comment-14534554 ] Jason Lowe commented on YARN-3554: -- +1, committing this. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-RFC, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534901#comment-14534901 ] Hudson commented on YARN-3554: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2137 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2137/]) YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-RFC, newbie Fix For: 2.7.1 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534972#comment-14534972 ] Hudson commented on YARN-3554: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #189 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/189/]) YARN-3554. Default value for maximum nodemanager connect wait time is too high. Contributed by Naganarasimha G R (jlowe: rev 9757864fd662b69445e0c600aedbe307a264982e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-RFC, newbie Fix For: 2.7.1 Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531329#comment-14531329 ] Li Lu commented on YARN-3554: - I think the current conclusion on HADOOP-11398 is that we need to make some non-trivial changes in the retry mechanism to make it work accurately. We may want to have some quick fix before that. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531213#comment-14531213 ] Vinod Kumar Vavilapalli commented on YARN-3554: --- I am good with either. I think the more important fix is HADOOP-11398 to be able to configure things in a predictable manner. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14531224#comment-14531224 ] Naganarasimha G R commented on YARN-3554: - if there are plans to get HADOOP-11398 faster then we can directly modify yarn.client.nodemanager-connect.max-wait-ms to 3mins, if not i feel it would be better to modify it to 1min and drop a comment in HADOOP-11398 to remodify this value back to 3 once its in . thoughts? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: BB2015-05-TBR, newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529591#comment-14529591 ] Naganarasimha G R commented on YARN-3554: - Hi [~vinodkv] [~jlowe], So would configuring yarn.client.nodemanager-connect.max-wait-ms as 1 min better ? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526775#comment-14526775 ] Jason Lowe commented on YARN-3554: -- YARN-3518 is a separate concern with different ramifications. We should discuss it there and not mix these two. bq. set this to a bigger value maybe based on network partition considerations not only for nm restart. What value do you propose? As pointed out earlier, anything over 10 minutes is pointless since the container allocation expires in that time. Is it common for network partitions to take longer than 3 minutes but less than 10 minutes? If so we should tune the value for that. If not then making the value larger just slows recovery time. bq. 3 mins seems dangerous, If rm fails over and the recover takes serval mins, nm maybe kill all containers, in production env, it's not expected. This JIRA is configuring the amount of time NM clients (i.e.: primarily ApplicationMasters and the RM when launching ApplicationMasters) will try to connect to a particular NM before failing. I'm missing how RM failover leads to a mass killing of containers due to this proposed change. This is not a property used by the NM, so the NM is not going to start killing all containers differently based on an updated value for it. The only case where the RM will use this property is when connecting to NMs to launch AM containers, and it will only do so for NMs that have recently heartbeated. Could you explain how this leads to all containers getting killed on a particular node? Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526934#comment-14526934 ] Naganarasimha G R commented on YARN-3554: - Hi [~jlowe], earlier my query of ideal time and [~sandflee]'s comment is related to yarn.resourcemanager.connect.max-wait.ms and as [~gtCarrera] mentioned its just for discussion purpose. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526983#comment-14526983 ] Jason Lowe commented on YARN-3554: -- Ah, thanks [~Naganarasimha], sorry I missed that. We can continue discussing the proper RM connect wait time over at YARN-3518, as obviously I cannot keep them straight here. ;-) Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527072#comment-14527072 ] Vinod Kumar Vavilapalli commented on YARN-3554: --- HADOOP-11398 and YARN-3238 relevant in that they caused AM-NM communication take a long time to timeout. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527087#comment-14527087 ] Vinod Kumar Vavilapalli commented on YARN-3554: --- bq. Are there still objections to lowering it from 15 mins to 3 mins? I'm +1 for the second patch, but I'll wait a few days before committing to give time for alternate proposals. For our users, we explicitly set yarn.client.nodemanager-connect.max-wait-ms to 60,000 (one minute). As HADOOP-11398 is still not in, this ends up becoming 6 minutes timeout (assuming each of the underlying rpc retries takes 1 sec * 50 times to finish (50 secs), plus 10 seconds retry interval, causing 1min per retry and 6 retries overall). Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525143#comment-14525143 ] sandflee commented on YARN-3554: set this to a bigger value maybe based on network partition considerations not only for nm restart. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525245#comment-14525245 ] Naganarasimha G R commented on YARN-3554: - Hi [~gtCarrera9], Thanks for commenting on this jira but did not get the intention completely, whether you are expecting me to merge the changes required for 3518 here ? if so i had few questions 1. yarn-3518 tries to modify default value of yarn.resourcemanager.connect.max-wait.ms from 90 to 60, which not only impacts timeout from AM - RM but also NM - RM and client(cli, web, application report etc..) - RM. Is that ok ? (I am ok with it but just wanted to point it out) 2. Given the current high availability, is it required to wait for 10 mins to detect that RM has failed is valid or shall i decrease that too to 3 mins ? If you inform i can merge the changes of 3518 and also update in yarn-default.xml which is missing in 3518. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525264#comment-14525264 ] sandflee commented on YARN-3554: Hi [~Naganarasimha] 3 mins seems dangerous, If rm fails over and the recover takes serval mins , nm maybe kill all containers, in production env, it's not expected. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525406#comment-14525406 ] Li Lu commented on YARN-3554: - Hi [~Naganarasimha], I just wanted to bring that JIRA into attention. We may want to share some discussions for both JIRAs. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524197#comment-14524197 ] Li Lu commented on YARN-3554: - This is just a quick note that there's another JIRA YARN-3518 that fixes the similar problem for the RM. We may probably want to consider both of them. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Labels: newbie Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519576#comment-14519576 ] Naganarasimha G R commented on YARN-3554: - Agree with you [~jlowe], but what do you feel the ideal timeout should be, 3 mins / 5 mins ? May be as you guys would have better experience with large number of nodes and see frequent NM failures you can suggest a better value here . Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519603#comment-14519603 ] Jason Lowe commented on YARN-3554: -- I suggest we go with 3 minutes. The retry interval is 10 seconds, so we'll get plenty of retries in that time if the failure is fast (e.g.: unknown host, connection refused) and still get a few retries in if the failure is slow (e.g.: connection timeout). Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519840#comment-14519840 ] Hadoop QA commented on YARN-3554: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 34s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 30s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 38s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 7m 35s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 33s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 46s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 30s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 55s | Tests passed in hadoop-yarn-common. | | | | 46m 59s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12729227/YARN-3554-20150429-2.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 8f82970 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7543/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7543/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7543/console | This message was automatically generated. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554-20150429-2.patch, YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519360#comment-14519360 ] Jason Lowe commented on YARN-3554: -- I think 10 minutes is still too high. We didn't even have this functionality until 2.6 because of rolling upgrades, and NMs don't take that long to recover in a rolling upgrade. They recover in tens of seconds rather than tens of minutes. Therefore I don't think it makes much sense to spend a lot of time trying to connect to an NM beyond a few minutes. The chances of successfully connecting after a few minutes of trying is going to be very low, and NMs fail all the time anyway. So if we spend all that extra time trying for essentially no benefit, all we've done is prolonged the application recovery time for no good reason. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3554) Default value for maximum nodemanager connect wait time is too high
[ https://issues.apache.org/jira/browse/YARN-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518559#comment-14518559 ] Hadoop QA commented on YARN-3554: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 15m 14s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 48s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 59s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 5m 32s | There were no new checkstyle issues. | | {color:green}+1{color} | install | 1m 34s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 2m 48s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 21s | Tests passed in hadoop-yarn-api. | | {color:green}+1{color} | yarn tests | 1m 58s | Tests passed in hadoop-yarn-common. | | | | 46m 11s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12728970/YARN-3554.20150429-1.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / c79e7f7 | | hadoop-yarn-api test log | https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-api.txt | | hadoop-yarn-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7531/artifact/patchprocess/testrun_hadoop-yarn-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7531/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf904.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7531/console | This message was automatically generated. Default value for maximum nodemanager connect wait time is too high --- Key: YARN-3554 URL: https://issues.apache.org/jira/browse/YARN-3554 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Jason Lowe Assignee: Naganarasimha G R Attachments: YARN-3554.20150429-1.patch The default value for yarn.client.nodemanager-connect.max-wait-ms is 90 msec or 15 minutes, which is way too high. The default container expiry time from the RM and the default task timeout in MapReduce are both only 10 minutes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)