[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533389#comment-14533389
 ] 

Jian He commented on YARN-3480:
---

[~hex108], thanks for your explanations.
I can see this will be a problem for long running apps, as the number of 
attempts will just keep increasing.
To be consistent with the attemptFailureValidityWindow for long running apps, 
instead of introducing a global hard limit, How about removing the attempt 
records that's beyond the validity window ?

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533731#comment-14533731
 ] 

Jun Gong commented on YARN-3480:


[~jianhe] just catch your option. Do you mean that the configure value for max 
attempts stored in RMAppImpl and RMStateStore is not needed at all, we could 
just remove the attempts records that's beyond the validity window? If so, I 
will update the patch.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533705#comment-14533705
 ] 

Jun Gong commented on YARN-3480:


[~jianhe], thanks for your comments and suggestion.

The latest patch(YARN-3480.03.patch) have been working as expected: just remove 
the attempt records that's beyond the validity window. For the special case 
that the validity window is -1, it might be better to remove attempts until the 
number of attempts is less than hard limit, because 
yarn.resourcemanager.am.max-attempts might be very large(like our scenario) . 
What's you option?

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-07 Thread sandflee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532755#comment-14532755
 ] 

sandflee commented on YARN-3480:


one benefit in [~hex108]'s work is we wouldn't worry about AM' failure. In our 
production env, we don't expect apps to be killed because of AM's failure.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527069#comment-14527069
 ] 

Jian He commented on YARN-3480:
---

[~hex108], generally,  it's better to avoid a global config for an outlier app. 
1. How often do you see an app failed with a large number of attempts? If it's 
limited to a few apps. I wouldn't worry so much.
bq.  make RM recover process much slower.
2. How slower it is in reality in your case?  we've done some benchmark, 
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or 
so.
3. Limiting the attempts to be recorded means we are losing history. it's a 
trade off.

My main point is that if you can provide some real numbers showing how slow the 
recovery process in real scenario, we can figure out where the bottleneck is 
and how to improve it.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527847#comment-14527847
 ] 

Jun Gong commented on YARN-3480:


[~jianhe], sorry for not specifying our scenario: RM HA is enabled, use ZK to 
store apps' info, most apps running in the cluster are long running(service) 
apps, yarn.resourcemanager.am.max-attempts is set to 1 because we have not 
patched YARN-611 and we want apps to retry more times.  There are 10K apps with 
1~1 attempts stored in ZK. It will take about 6 mins to recover those apps 
when RM HA.

{quote}
1. How often do you see an app failed with a large number of attempts? If it's 
limited to a few apps. I wouldn't worry so much.
2. How slower it is in reality in your case? we've done some benchmark, 
recovering 10k apps(with 1 attempt) on ZK is pretty fast, within 20 seconds or 
so.
{quote}
Please see above. I think it will be OK for map-reduce jobs. But it might not 
be OK for service apps which have been running several months.

{quote}
3. Limiting the attempts to be recorded means we are losing history. it's a 
trade off.
{quote}
Yes, I agree.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-03 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14526195#comment-14526195
 ] 

Jun Gong commented on YARN-3480:


[~vinodkv] Thank you for the comments.

{quote}
No, as you noted later, the right solution is for apps to set the 
attempt-failures validity-interval.
{quote}
Yes, I agree with it.

{quote}
We already have a yarn.resourcemanager.am.max-attempts that acts as a global 
limit. Is that not sufficient? A more practical problem is the number of apps 
itself. And we do have an upper limit of 10K by default for this. Is that not 
enough? Are you seeing issues in a real-life scenario?
{quote}
yarn.resourcemanager.am.max-attempts just limits the max attempts in the time 
window which is configured through 'attemptFailuresValidityInterval'. Suppose 
the following scenario: app's  am.max-attempts is set to 2, and its 
attemptFailuresValidityInterval is set to 30, if app failed at 00:00, 00:31, 
00: 62..., it will continue to retry and run because its number of failed 
attempts at the time window(attemptFailuresValidityInterval) is always 1. Then 
attempts' number will increase continously.

{quote}
I think we need to have a lower limit on the failure-validaty interval to avoid 
situations like this. If others agree too, will file a ticket.
{quote}
Please see the above scenario.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14525491#comment-14525491
 ] 

Hadoop QA commented on YARN-3480:
-

\\
\\
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 37s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:green}+1{color} | tests included |   0m  0s | The patch appears to 
include 7 new or modified test files. |
| {color:green}+1{color} | javac |   7m 33s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   1m 34s | There were no new checkstyle 
issues. |
| {color:green}+1{color} | whitespace |   0m 13s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | install |   1m 36s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 32s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   2m 37s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-api. |
| {color:green}+1{color} | yarn tests |  52m 17s | Tests passed in 
hadoop-yarn-server-resourcemanager. |
| | |  91m 30s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12729945/YARN-3480.03.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 6ae2a0d |
| hadoop-yarn-api test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7659/artifact/patchprocess/testrun_hadoop-yarn-api.txt
 |
| hadoop-yarn-server-resourcemanager test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7659/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7659/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7659/console |


This message was automatically generated.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch, YARN-3480.02.patch, 
 YARN-3480.03.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.
 BTW: When 'attemptFailuresValidityInterval'(introduced in YARN-611) is set to 
 a small value, retried attempts might be very large. So we need to delete 
 some attempts stored in RMStateStore and RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3480) Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable

2015-04-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509263#comment-14509263
 ] 

Hadoop QA commented on YARN-3480:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12727636/YARN-3480.01.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 189a63a |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7473/console |


This message was automatically generated.

 Make AM max attempts stored in RMAppImpl and RMStateStore to be configurable
 

 Key: YARN-3480
 URL: https://issues.apache.org/jira/browse/YARN-3480
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 2.6.0
Reporter: Jun Gong
Assignee: Jun Gong
 Attachments: YARN-3480.01.patch


 When RM HA is enabled and running containers are kept across attempts, apps 
 are more likely to finish successfully with more retries(attempts), so it 
 will be better to set 'yarn.resourcemanager.am.max-attempts' larger. However 
 it will make RMStateStore(FileSystem/HDFS/ZK) store more attempts, and make 
 RM recover process much slower. It might be better to set max attempts to be 
 stored in RMStateStore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)