[jira] [Commented] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129611#comment-14129611 ] Hadoop QA commented on YARN-2534: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667934/YARN-2534.000.patch against trunk revision 4be9517. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4882//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4882//console This message is automatically generated. > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal > - > > Key: YARN-2534 > URL: https://issues.apache.org/jira/browse/YARN-2534 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2534.000.patch > > > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal for some cases. > If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE > ,but each individual MAX share is not equal to Integer.MAX_VALUE. then > totalMaxShare will be a negative value, which will cause all fairShare are > wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-611: --- Attachment: YARN-611.9.rebase.patch > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, > YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch, > YARN-611.9.rebase.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129610#comment-14129610 ] Jian He commented on YARN-2496: --- briefly looked at the patch: - CSQueueUtils.java format change only, we can revert - why checking {{labelManager != null}} every where ? we only need to check where it’s needed. - We may not need to change the method signature to add one more parameter, just pass the queues map into NodeLabelManager#reinitializeQueueLabels, to avoid a number of test changes. {code} parseQueue(this, conf, null, CapacitySchedulerConfiguration.ROOT, queues, queues, noop, queueToLabels); {code} - label initialization code is duplicated between ParentQueue and LeafQueue, how about creating an AbstractCSQueue and put common initilazation methods there ? > [YARN-796] Changes for capacity scheduler to support allocate resource > respect labels > - > > Key: YARN-2496 > URL: https://issues.apache.org/jira/browse/YARN-2496 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-2496.patch > > > This JIRA Includes: > - Add/parse labels option to {{capacity-scheduler.xml}} similar to other > options of queue like capacity/maximum-capacity, etc. > - Include a "default-label-expression" option in queue config, if an app > doesn't specify label-expression, "default-label-expression" of queue will be > used. > - Check if labels can be accessed by the queue when submit an app with > labels-expression to queue or update ResourceRequest with label-expression > - Check labels on NM when trying to allocate ResourceRequest on the NM with > label-expression > - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129609#comment-14129609 ] Hadoop QA commented on YARN-611: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667940/YARN-611.9.patch against trunk revision 4be9517. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4883//console This message is automatically generated. > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, > YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129606#comment-14129606 ] Xuan Gong commented on YARN-611: bq. The name sliding window: Sliding window of what? We should make this clear in the API. How about attempt_failures_sliding_window_size? Or should we call it attempt_failures_validity_interval? Any other ideas? Zhijie Shen? Changed to attempt_failures_validity_interval bq. Either ways You will have to change all of the following yarn_protos.proto: sliding_window_size ApplicationSubmissionContext: Rename slidingWindowSize, setters and getters RMAppImpl.slidingWindowSize DONE bq. It is not clear what units the window-size is measured in from the API. Secs? Millis? We should javadoc this everywhere. ADDED bq. RMAppImpl.isAttemptFailureExceedMaxAttempt -> isNumAttemptsBeyondThreshold. Changed bq. TestAMRestart: The tests are very brittle because of the sleeps. Can we instead use a Clock and use it everywhere? That way you can inject manual clock-advance and test deterministically. See SystemClock for main line code usage and the ControlledClock for tests. FIXED > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, > YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-611: --- Attachment: YARN-611.9.patch > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, > YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch, YARN-611.9.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2534: Fix Version/s: (was: 2.6.0) > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal > - > > Key: YARN-2534 > URL: https://issues.apache.org/jira/browse/YARN-2534 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Attachments: YARN-2534.000.patch > > > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal for some cases. > If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE > ,but each individual MAX share is not equal to Integer.MAX_VALUE. then > totalMaxShare will be a negative value, which will cause all fairShare are > wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129582#comment-14129582 ] zhihai xu commented on YARN-2534: - I uploaded a patch YARN-2534.000.patch for review. I added a test case in this patch to prove this issue exit: Two queues: QueueA's maxShare is 1073741824 and QueueB's maxShare is 1073741824, the sum of two maxShare is more than Integer.MAX_VALUE. Without the fix, the test will fail. > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal > - > > Key: YARN-2534 > URL: https://issues.apache.org/jira/browse/YARN-2534 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-2534.000.patch > > > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal for some cases. > If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE > ,but each individual MAX share is not equal to Integer.MAX_VALUE. then > totalMaxShare will be a negative value, which will cause all fairShare are > wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
[ https://issues.apache.org/jira/browse/YARN-2534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2534: Attachment: YARN-2534.000.patch > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal > - > > Key: YARN-2534 > URL: https://issues.apache.org/jira/browse/YARN-2534 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.5.0 >Reporter: zhihai xu >Assignee: zhihai xu > Fix For: 2.6.0 > > Attachments: YARN-2534.000.patch > > > FairScheduler: totalMaxShare is not calculated correctly in > computeSharesInternal for some cases. > If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE > ,but each individual MAX share is not equal to Integer.MAX_VALUE. then > totalMaxShare will be a negative value, which will cause all fairShare are > wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2534) FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal
zhihai xu created YARN-2534: --- Summary: FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal Key: YARN-2534 URL: https://issues.apache.org/jira/browse/YARN-2534 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Fix For: 2.6.0 FairScheduler: totalMaxShare is not calculated correctly in computeSharesInternal for some cases. If the sum of MAX share of all Schedulables is more than Integer.MAX_VALUE ,but each individual MAX share is not equal to Integer.MAX_VALUE. then totalMaxShare will be a negative value, which will cause all fairShare are wrongly calculated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129557#comment-14129557 ] Tsuyoshi OZAWA commented on YARN-2229: -- The latest v16 patch is ready for review. [~jianhe], could you check it? > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129541#comment-14129541 ] Hadoop QA commented on YARN-2229: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667919/YARN-2229.16.patch against trunk revision 83be3ad. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4881//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4881//console This message is automatically generated. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129533#comment-14129533 ] Vinod Kumar Vavilapalli commented on YARN-2440: --- This looks good, +1. Checking this in.. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129528#comment-14129528 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667915/YARN-2229.16.patch against trunk revision 5ec7fcd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestWorkPreservingRMRestart org.apache.hadoop.yarn.server.resourcemanager.scheduler.TestSchedulerApplicationAttempt {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4880//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4880//console This message is automatically generated. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.2.patch Attached updated consolidated patch, named "YARN-796.node-label.consolidate.2.patch", it contains several bug fixes, and support admin changes node label when RM is not running. Please feel free to try and review. Thanks, Wangda > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, > YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2499) [YARN-796] Respect labels in preemption policy of fair scheduler
[ https://issues.apache.org/jira/browse/YARN-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2499: --- Assignee: Naganarasimha G R > [YARN-796] Respect labels in preemption policy of fair scheduler > > > Key: YARN-2499 > URL: https://issues.apache.org/jira/browse/YARN-2499 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2495) [YARN-796] Allow admin specify labels in each NM (Distributed configuration)
[ https://issues.apache.org/jira/browse/YARN-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R reassigned YARN-2495: --- Assignee: Naganarasimha G R > [YARN-796] Allow admin specify labels in each NM (Distributed configuration) > > > Key: YARN-2495 > URL: https://issues.apache.org/jira/browse/YARN-2495 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Naganarasimha G R > > Target of this JIRA is to allow admin specify labels in each NM, this covers > - User can set labels in each NM (by setting yarn-site.xml or using script > suggested by [~aw]) > - NM will send labels to RM via ResourceTracker API > - RM will set labels in NodeLabelManager when NM register/update labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129480#comment-14129480 ] Craig Welch commented on YARN-796: -- Good, what you describe wrt the cli is what I was trying to describe, I just might not have been very clear about it. I'm going to go ahead then and make the changes for the service side to match what we've described. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.16.patch > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: (was: YARN-2229.16.patch) > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, > YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, > YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.16.patch > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129464#comment-14129464 ] Wangda Tan commented on YARN-796: - Hi Craig, I think when RM is running, the solution should be exactly as you described, we should only check if the caller is user on the admin list, and RM will write file itself, by default it's "yarn" user. But when RM is not running, and we need execute a tool to directly modify data in store, we cannot use this way. Because the ACL is retrieved from local configuration file, a malicious user can create a configuration to indicate itself is a admin user and use the configuration to launch tool. IMHO, I think we don't need check ACL when we running a standalone tool, it will modify the file, and the file directory has permission already (like it belongs yarn user). So HDFS will do the check for us. But we should only run such standalone command as same as the user launches RM. Thanks, Wangda > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: (was: YARN-2229.16.patch) > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, > YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, > YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129459#comment-14129459 ] Jian He commented on YARN-1372: --- Thanks for updating the patch. Some comments and naming suggestions: - NodeStatusUpdater import changes only, we can revert - indentation format of the second line. {code} public void removeCompletedContainersFromContext(List containerIds) throws public RMNodeCleanedupContainerNotifiedEvent(NodeId nodeId, ContainerId contId) { {code} - why adding {{context.getContainers().remove(cid);}} in removeVeryOldStoppedContainersFromContext method? won’t this remove the containers from context immediately when we send the container statuses across, which contradicts the rest of the changes? - In NodeStatusUpdaterImpl, previousCompletedContainers cache is not needed any more, as we make NM remove containers from context only after it gets the notification. We can remove this; Instead, in NodeStatusUpdater#getContainerStatuses, while we are looping all the containers, we can check whether the corresponding application exists, if Not, remove it from context. - make sure {{context.getNMStateStore().removeContainer(cid);}} is called after receiving the notification from RM as well. - {{RMNodeEventType#CLEANEDUP_CONTAINER_NOTIFIED}}: put in a new section where source is RMAppAttempt. how about rename to FINISHED_CONTAINERS_PULLED_BY_AM; similarly RMNodeCleanedupContainerNotifiedEvent -> RMNodeFinishedContainersPulledByAMEvent - In RMAppAttemptImpl#BaseFinalTransition, we can clear finishedContainersSentToAM, in case that AM unexpectedly crashes. - I think Map> is more space efficient than: {code} private Map finishedContainersSentToAM = new HashMap(); {code} - : format convention is to have method body in a different line from the method head. {code} public NodeId getNodeId() { return this.nodeId; } {code} - RMNodeImpl#cleanupContainersNotified, may be rename to finishedContainersPulledByAM. similarly CleanedupContainerNotifiedTransition to FinishedContainersPulledByAMTransition. - NodeHeartbeatResponse#addCleanedupContainersNotified, how about addFinishedContainersPulledByAM; similarly for the getter NodeHeartbeatResponse#getCleanedupContainersNotified and the proto file. Also add some code comments to explain why adding this new API. > Ensure all completed containers are reported to the AMs across RM restart > - > > Key: YARN-1372 > URL: https://issues.apache.org/jira/browse/YARN-1372 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1372.001.patch, YARN-1372.001.patch, > YARN-1372.002_NMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, > YARN-1372.prelim.patch, YARN-1372.prelim2.patch > > > Currently the NM informs the RM about completed containers and then removes > those containers from the RM notification list. The RM passes on that > completed container information to the AM and the AM pulls this data. If the > RM dies before the AM pulls this data then the AM may not be able to get this > information again. To fix this, NM should maintain a separate list of such > completed container notifications sent to the RM. After the AM has pulled the > containers from the RM then the RM will inform the NM about it and the NM can > remove the completed container from the new list. Upon re-register with the > RM (after RM restart) the NM should send the entire list of completed > containers to the RM along with any other containers that completed while the > RM was dead. This ensures that the RM can inform the AM's about all completed > containers. Some container completions may be reported more than once since > the AM may have pulled the container but the RM may die before notifying the > NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.16.patch Fixed to pass tests. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.16.patch, YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, > YARN-2229.4.patch, YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, > YARN-2229.8.patch, YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129430#comment-14129430 ] Hadoop QA commented on YARN-2529: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667892/YARN-2529.1.patch against trunk revision 5ec7fcd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4879//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4879//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-applicationhistoryservice.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4879//console This message is automatically generated. > Generic history service RPC interface doesn't work when service authorization > is enabled > > > Key: YARN-2529 > URL: https://issues.apache.org/jira/browse/YARN-2529 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2529.1.patch > > > Here's the problem shown in the log: > {code} > 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for > protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is > unauthorized for user zshen (auth:SIMPLE) > 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: > readAndProcess from client 10.22.2.109 threw exception > [org.apache.hadoop.security.authorize.AuthorizationException: Protocol > interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not > known.] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129429#comment-14129429 ] Hadoop QA commented on YARN-2229: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667886/YARN-2229.15.patch against trunk revision 7f80e14. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.util.TestConverterUtils {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4878//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/4878//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-api.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4878//console This message is automatically generated. > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, > YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, > YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129384#comment-14129384 ] Hadoop QA commented on YARN-415: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667877/YARN-415.201409102216.txt against trunk revision 7f80e14. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4877//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4877//console This message is automatically generated. > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.201409092204.txt, > YARN-415.201409102216.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129378#comment-14129378 ] Karthik Kambatla commented on YARN-1492: I am good with fixing the in-memory store so store-specific details don't creep into the code elsewhere. Personally, I am okay with working on leveldb and zk stores post merge. My main concern is with providing a way to initialize the store, as we don't have a good answer for long-running apps and it will not be required when using leveldb and zk implementations for non-HA and HA cases. I would rather avoid that piece completely. I am okay with having an in-memory store that the tests exercise and has a trivial recovery. Having a "real" store though would definitely boost people's confidence at merge time :) > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Chris Trezzo > Attachments: YARN-1492-all-trunk-v1.patch, > YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, > YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, > shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, > shared_cache_design_v5.pdf, shared_cache_design_v6.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2464) Provide Hadoop as a local resource (on HDFS) which can be used by other projcets
[ https://issues.apache.org/jira/browse/YARN-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129379#comment-14129379 ] Junping Du commented on YARN-2464: -- [~sseth], I will assign to myself to work on it if you haven't start to work on it. > Provide Hadoop as a local resource (on HDFS) which can be used by other > projcets > > > Key: YARN-2464 > URL: https://issues.apache.org/jira/browse/YARN-2464 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Siddharth Seth > > DEFAULT_YARN_APPLICATION_CLASSPATH are used by YARN projects to setup their > AM / task classpaths if they have a dependency on Hadoop libraries. > It'll be useful to provide similar access to a Hadoop tarball (Hadoop libs, > native libraries) etc, which could be used instead - for applications which > do not want to rely upon Hadoop versions from a cluster node. This would also > require functionality to update the classpath/env for the apps based on the > structure of the tar. > As an example, MR has support for a full tar (for rolling upgrades). > Similarly, Tez ships hadoop libraries along with it's build. I'm not sure > about the Spark / Storm / HBase model for this - but using a common copy > instead of everyone localizing Hadoop libraries would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129359#comment-14129359 ] Hadoop QA commented on YARN-1372: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667876/YARN-1372.003.patch against trunk revision 7f80e14. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4876//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4876//console This message is automatically generated. > Ensure all completed containers are reported to the AMs across RM restart > - > > Key: YARN-1372 > URL: https://issues.apache.org/jira/browse/YARN-1372 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1372.001.patch, YARN-1372.001.patch, > YARN-1372.002_NMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, > YARN-1372.prelim.patch, YARN-1372.prelim2.patch > > > Currently the NM informs the RM about completed containers and then removes > those containers from the RM notification list. The RM passes on that > completed container information to the AM and the AM pulls this data. If the > RM dies before the AM pulls this data then the AM may not be able to get this > information again. To fix this, NM should maintain a separate list of such > completed container notifications sent to the RM. After the AM has pulled the > containers from the RM then the RM will inform the NM about it and the NM can > remove the completed container from the new list. Upon re-register with the > RM (after RM restart) the NM should send the entire list of completed > containers to the RM along with any other containers that completed while the > RM was dead. This ensures that the RM can inform the AM's about all completed > containers. Some container completions may be reported more than once since > the AM may have pulled the container but the RM may die before notifying the > NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129354#comment-14129354 ] Junping Du commented on YARN-2033: -- Sure. Will commit it soon. Thanks [~zjshen] for the patch and [~vinodkv] for review! > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, > YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129342#comment-14129342 ] Chris Trezzo commented on YARN-1492: Thanks [~kasha]! A couple of questions: bq. 2. The choice of SCM store should be transparent to the rest of SCM code. It would be better to define an interface for the SCMStore similar to the RMStateStore today. To clarify the above point. An interface does exist in the current implementation (see SCMStore.java in YARN-2180), and all SCMStore implementations should be based off of that. Unfortunately some implementation details from the in-memory store have leaked through via the SCMContext object. I am working on an update to improve the interface so that an SCMContext object is no longer needed and all implementation details are hidden behind SCMStore.java. Does your above point mean that you are looking for a state machine-based interface like RMStateStore, or do you see additional issues with the SCMStore interface outside of the SCMContext fix? bq. 3. Defaulting to the in-memory store requires providing a way to initialize the store with currently running applications and cached jars, which is quite involved and not so elegant either. I propose implementing leveldb and zk stores. We could default to leveldb on non-HA clusters, and ZK store for HA clusters if we choose to embed the SCM in the RM. Do you see the leveldb and zk stores as blockers to merging into trunk/2.6 or would an in-memory store with the interface fixes mentioned above be enough initially? Leveldb and ZK stores could be easily added post-merge in an incremental way as additional SCMStore implementations. > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Chris Trezzo > Attachments: YARN-1492-all-trunk-v1.patch, > YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, > YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, > shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, > shared_cache_design_v5.pdf, shared_cache_design_v6.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2529: -- Attachment: YARN-2529.1.patch I created a patch to make application history protocol use timeline policy when service authorization is enabled. It's not straightforward to add the test cases on top of TestApplicationHistoryClientService, but I've manually verify it on my local single-node cluster. > Generic history service RPC interface doesn't work when service authorization > is enabled > > > Key: YARN-2529 > URL: https://issues.apache.org/jira/browse/YARN-2529 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2529.1.patch > > > Here's the problem shown in the log: > {code} > 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for > protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is > unauthorized for user zshen (auth:SIMPLE) > 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: > readAndProcess from client 10.22.2.109 threw exception > [org.apache.hadoop.security.authorize.AuthorizationException: Protocol > interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not > known.] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-121) Yarn services to throw a YarnException on invalid state changs
[ https://issues.apache.org/jira/browse/YARN-121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-121: -- Fix Version/s: (was: 3.0.0) > Yarn services to throw a YarnException on invalid state changs > -- > > Key: YARN-121 > URL: https://issues.apache.org/jira/browse/YARN-121 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > the {{EnsureCurrentState()}} checks of services throw an > {{IllegalStateException}} if the state is wrong. If this was changed to > {{YarnException}}. wrapper services such as CompositeService could relay this > direct, instead of wrapping it in their own. > Time to implement mainly in changing the lifecycle test cases of > MAPREDUCE-3939 subtasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-120) Make yarn-common services robust
[ https://issues.apache.org/jira/browse/YARN-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer updated YARN-120: -- Fix Version/s: (was: 3.0.0) > Make yarn-common services robust > > > Key: YARN-120 > URL: https://issues.apache.org/jira/browse/YARN-120 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Steve Loughran >Assignee: Steve Loughran > Labels: yarn > Attachments: MAPREDUCE-4014.patch > > > Review the yarn common services ({{CompositeService}}, > {{AbstractLivelinessMonitor}} and make their service startup _and especially > shutdown_ more robust against out-of-lifecycle invocation and partially > complete initialization. > Write tests for these where possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2229) ContainerId can overflow with RM restart
[ https://issues.apache.org/jira/browse/YARN-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2229: - Attachment: YARN-2229.15.patch Talked with Jian offline. Updated to reduce epoch bits from 32 bits to 24 bits and increase id bits from 32 bits to 40 bits because 32 bits for epoch is too much. It is allowed to truncate int32/64 values by the spec of protobuf. https://developers.google.com/protocol-buffers/docs/proto > ContainerId can overflow with RM restart > > > Key: YARN-2229 > URL: https://issues.apache.org/jira/browse/YARN-2229 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Tsuyoshi OZAWA >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2229.1.patch, YARN-2229.10.patch, > YARN-2229.10.patch, YARN-2229.11.patch, YARN-2229.12.patch, > YARN-2229.13.patch, YARN-2229.14.patch, YARN-2229.15.patch, > YARN-2229.2.patch, YARN-2229.2.patch, YARN-2229.3.patch, YARN-2229.4.patch, > YARN-2229.5.patch, YARN-2229.6.patch, YARN-2229.7.patch, YARN-2229.8.patch, > YARN-2229.9.patch > > > On YARN-2052, we changed containerId format: upper 10 bits are for epoch, > lower 22 bits are for sequence number of Ids. This is for preserving > semantics of {{ContainerId#getId()}}, {{ContainerId#toString()}}, > {{ContainerId#compareTo()}}, {{ContainerId#equals}}, and > {{ConverterUtils#toContainerId}}. One concern is epoch can overflow after RM > restarts 1024 times. > To avoid the problem, its better to make containerId long. We need to define > the new format of container Id with preserving backward compatibility on this > JIRA. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2533) Redirect stdout and stderr to a file for all applications/frameworks
Kannan Rajah created YARN-2533: -- Summary: Redirect stdout and stderr to a file for all applications/frameworks Key: YARN-2533 URL: https://issues.apache.org/jira/browse/YARN-2533 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 2.4.1 Reporter: Kannan Rajah Priority: Minor Today, we have the capability to redirect stdout and stderr of shell commands (launched tasks) to a file and also apply a tail length. This logic exists in TaskLog, YARNRunner. But these reside in map reduce specific packages. Every framework has to duplicate this logic. It would be nice to abstract this at YARN level and apply to shell commands that are launched by any framework. ContainerLaunch.call method looks like a good candidate. Does anyone have suggestions or guidelines? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201409102216.txt Thanks a lot, [~jianhe]. I have added comment headers for the new APIs in ApplicationResourceUsageReport. > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.201409092204.txt, > YARN-415.201409102216.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1372: Attachment: YARN-1372.003.patch As per feedback, remove containers when the corresponding application does not exist. That simplified a lot of code from the second iteration. Also added unit tests. Also renamed the previousJustFinishedContainers to finishedContainersSentToAM to clarify the difference. As discussed earlier, this avoids the problem that there is a failure between RM acking this to NM and AM successfully processing this set. By waiting for the next allocate call before acking to NM, we guarantee the AM has successfully received this list. > Ensure all completed containers are reported to the AMs across RM restart > - > > Key: YARN-1372 > URL: https://issues.apache.org/jira/browse/YARN-1372 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Bikas Saha >Assignee: Anubhav Dhoot > Attachments: YARN-1372.001.patch, YARN-1372.001.patch, > YARN-1372.002_NMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, > YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, > YARN-1372.prelim.patch, YARN-1372.prelim2.patch > > > Currently the NM informs the RM about completed containers and then removes > those containers from the RM notification list. The RM passes on that > completed container information to the AM and the AM pulls this data. If the > RM dies before the AM pulls this data then the AM may not be able to get this > information again. To fix this, NM should maintain a separate list of such > completed container notifications sent to the RM. After the AM has pulled the > containers from the RM then the RM will inform the NM about it and the NM can > remove the completed container from the new list. Upon re-register with the > RM (after RM restart) the NM should send the entire list of completed > containers to the RM along with any other containers that completed while the > RM was dead. This ensures that the RM can inform the AM's about all completed > containers. Some container completions may be reported more than once since > the AM may have pulled the container but the RM may die before notifying the > NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129181#comment-14129181 ] Jason Lowe commented on YARN-2056: -- Sorry for coming in late. I think there's an issue with this part of the patch: {code} // The per-queue disablePreemption defaults to false (preemption enabled). // Inherit parent's per-queue disablePreemption value. boolean parentQueueDisablePreemption = false; boolean queueDisablePreemption = false; if (root.getParent() != null) { String parentQueuePropName = BASE_YARN_RM_PREEMPTION + root.getParent().getQueuePath() + SUFFIX_DISABLE_PREEMPTION; parentQueueDisablePreemption = this.conf.getBoolean(parentQueuePropName, false); } String queuePropName = BASE_YARN_RM_PREEMPTION + root.getQueuePath() + SUFFIX_DISABLE_PREEMPTION; queueDisablePreemption = this.conf.getBoolean(queuePropName, parentQueueDisablePreemption); {code} I think it only handles examining the immediate parent for a default value. If preemption is disabled at a parent two levels or more removed from the leaf queue then it appears we won't honor that. > Disable preemption at Queue level > - > > Key: YARN-2056 > URL: https://issues.apache.org/jira/browse/YARN-2056 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Mayank Bansal >Assignee: Eric Payne > Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, > YARN-2056.201408310117.txt, YARN-2056.201409022208.txt > > > We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129141#comment-14129141 ] Jian He commented on YARN-415: -- Eric, thanks for your explanation. sounds good to me. One nit: I found the new APIs added in ApplicationResourceUsageReport don't have code comments. Could you add that too ? I'd like to commit this once this this fixed. thanks for all your patience ! > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129133#comment-14129133 ] Zhijie Shen commented on YARN-611: -- bq. How about attempt_failures_sliding_window_size? Or should we call it attempt_failures_validity_interval? Any other ideas? attempt_failures_validity_interval sounds good to me. > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch, YARN-611.5.patch, > YARN-611.6.patch, YARN-611.7.patch, YARN-611.8.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129119#comment-14129119 ] Eric Payne commented on YARN-415: - Thanks for clarifying [~jianhe]. {quote} is this {{currentAttempt.getAppAttemptId().equals(attemptId)}} still necessary ? since the return value of {{scheduler#getAppResourceUsageReport}} for non-active attempt is anyways empty/null. {quote} I believe that the check is necessary. Here are a couple of points. - First, {{RMAppAttemptMetrics#getAggregateAppResourceUsage}} is called from multiple places, including {{RMAppImpl#getRMAppMetrics}}, which loops through all attempts for any given app. If the app is running and has multiple attempts, we want to charge the current attempt for both the running container stats and those that finished for that attempt. But, in this scenario, when {{RMAppImpl#getRMAppMetrics}} loops through and calls {{RMAppAttemptMetrics#getAggregateAppResourceUsage}} for the finished attempts, {{RMAppAttemptMetrics#getAggregateAppResourceUsage}} needs to know that the attempt ID is not the current attempt so that it doesn't count the running container stats again. - Second, from my tests and my reading of the code, I'm pretty sure that {{scheduler#getAppResourceUsageReport}} always returns the {{ApplicationResourceUsageReport}} for the current attempt, even if you give it a finished attempt. It uses the attemptId to get the app object, and then uses that to get the current attempt. I've tested this, and by taking a look at {{AbstractYarnScheduler#getApplicationAttempt}} (which is called by {{getAppResourceUsageReport}} for both CapacityScheduler and FairScheduler), we can see that it only uses the attemptId to get the app, and then calls app.getCurrentAttempt(). I hope that helps to clarify this. Thank you > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2456) Possible lovelock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2456: -- Summary: Possible lovelock in CapacityScheduler when RM is recovering apps (was: Possible deadlock in CapacityScheduler when RM is recovering apps) > Possible lovelock in CapacityScheduler when RM is recovering apps > - > > Key: YARN-2456 > URL: https://issues.apache.org/jira/browse/YARN-2456 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2456.1.patch > > > Consider this scenario: > 1. RM is configured with a single queue and only one application can be > active at a time. > 2. Submit App1 which uses up the queue's whole capacity > 3. Submit App2 which remains pending. > 4. Restart RM. > 5. App2 is recovered before App1, so App2 is added to the activeApplications > list. Now App1 remains pending (because of max-active-app limit) > 6. All containers of App1 are now recovered when NM registers, and use up the > whole queue capacity again. > 7. Since the queue is full, App2 cannot proceed to allocate AM container. > 8. In the meanwhile, App1 cannot proceed to become active because of the > max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2456) Possible livelock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2456: -- Summary: Possible livelock in CapacityScheduler when RM is recovering apps (was: Possible lovelock in CapacityScheduler when RM is recovering apps) > Possible livelock in CapacityScheduler when RM is recovering apps > - > > Key: YARN-2456 > URL: https://issues.apache.org/jira/browse/YARN-2456 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2456.1.patch > > > Consider this scenario: > 1. RM is configured with a single queue and only one application can be > active at a time. > 2. Submit App1 which uses up the queue's whole capacity > 3. Submit App2 which remains pending. > 4. Restart RM. > 5. App2 is recovered before App1, so App2 is added to the activeApplications > list. Now App1 remains pending (because of max-active-app limit) > 6. All containers of App1 are now recovered when NM registers, and use up the > whole queue capacity again. > 7. Since the queue is full, App2 cannot proceed to allocate AM container. > 8. In the meanwhile, App1 cannot proceed to become active because of the > max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2456) Possible deadlock in CapacityScheduler when RM is recovering apps
[ https://issues.apache.org/jira/browse/YARN-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129066#comment-14129066 ] Jian He commented on YARN-2456: --- Folks, thanks for the comments. Renamed the title as suggested by Wangda. I agree that too many other factors may affect this issue, e.g. NM resync time. This patch really just mitigates the issue, not solving the issue completely. > Possible deadlock in CapacityScheduler when RM is recovering apps > - > > Key: YARN-2456 > URL: https://issues.apache.org/jira/browse/YARN-2456 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2456.1.patch > > > Consider this scenario: > 1. RM is configured with a single queue and only one application can be > active at a time. > 2. Submit App1 which uses up the queue's whole capacity > 3. Submit App2 which remains pending. > 4. Restart RM. > 5. App2 is recovered before App1, so App2 is added to the activeApplications > list. Now App1 remains pending (because of max-active-app limit) > 6. All containers of App1 are now recovered when NM registers, and use up the > whole queue capacity again. > 7. Since the queue is full, App2 cannot proceed to allocate AM container. > 8. In the meanwhile, App1 cannot proceed to become active because of the > max-active-app limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2459: Fix Version/s: 2.6.0 > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Fix For: 2.6.0 > > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2458) Add file handling features to the Windows Secure Container Executor LRPC service
[ https://issues.apache.org/jira/browse/YARN-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129061#comment-14129061 ] Remus Rusanu commented on YARN-2458: The solution proposed here is to have the Windows Secure container Executor use its own FileContext and FileSystem. The WSCE filesystem is derived from the RawLocalFilesystem and overrides the actual creation of directories, setPermissions, setOwner and createOutputStream operations. These operations are executed via the JNI/LRPC by calling corresponding remote methods offered by the hadoopwinutilsvc service. This service runs as a privileged user (LocalSystem) and thus can execute certain operations forbidden to the NM, like writing into the container dirs (owned by the container user). The actual implementation of methods like setOwner/setPermissions is the same as previous ones, whether it was invoked via winutils chown/chmod or via the Native Hadoop.dll JNI, the code is exactly the same and is shared via libwinutils. This changes simply offer a mechanism to exect this code in an elevated process. The patches also contain some changes around classpath jar creation: prviosuly it was created directly into the destination dir (the container private dirs). this is not forbidden because the NM doe snot have the right to do it. Instead the classpath jars are create in the private nmPrivate folder and then moved into the container dirs (via a copy/move API offered by hadoopwinutilsvc). > Add file handling features to the Windows Secure Container Executor LRPC > service > > > Key: YARN-2458 > URL: https://issues.apache.org/jira/browse/YARN-2458 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Remus Rusanu >Assignee: Remus Rusanu > Labels: security, windows > Attachments: YARN-2458.1.patch, YARN-2458.2.patch > > > In the WSCE design the nodemanager needs to do certain privileged operations > like change file ownership to arbitrary users or delete files owned by the > task container user after completion of the task. As we want to remove the > Administrator privilege requirement from the nodemanager service, we have to > move these operations into the privileged LRPC helper service. > Extend the RPC interface to contain methods for change file ownership and > manipulate files, add JNI client side and implement the server side. This > will piggyback on the existing LRPC service so is not much infrastructure to > add (run as service, RPC init, authentictaion and authorization are already > solved). It just needs to be implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129058#comment-14129058 ] Hadoop QA commented on YARN-2440: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667797/apache-yarn-2440.6.patch against trunk revision cbfe263. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4875//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4875//console This message is automatically generated. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129052#comment-14129052 ] Vinod Kumar Vavilapalli commented on YARN-2033: --- +1, looks good to me. [~djp], can you please do the honours given you did early reviews? > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, > YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129042#comment-14129042 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667818/YARN-2033.12.patch against trunk revision 47bdfa0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4874//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4874//console This message is automatically generated. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, > YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2475) ReservationSystem: replan upon capacity reduction
[ https://issues.apache.org/jira/browse/YARN-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129041#comment-14129041 ] Chris Douglas commented on YARN-2475: - {{SimpleCapacityReplanner}} * The Clock can be initialized in the constructor, declared private and final * The exception refers to an InventorySizeAdjusmentPolicy * nit: redundant parenthesis in the main loop, exceeds 80 char * {{curSessions}} cannot be null; prefer {{!isEmpty()}} to {{size() > 0}} ** Is this check even necessary? {{sort}} and the following loop should be noops * A brief comment about the natural order of {{ReservationAllocations}} would help readability of this loop. It's in the class doc, but something inline would be helpful * An internal {{Resource(0,0)}} could be reused, instead of creating it in the loop * Could the inner loop be more readable? The embedded function calls in the {{Resource}} arithmetic are hard to read (pseudo): {code} ArrayList<> curSessions = new ArrayList<>(plan.getResourcesAtTime(t)); Collections.sort(curSessions); for (Iterator<> i = curSessions.iterator(); i.hasNext() && excessCap > 0;) { InMemoryReservationAllocation a = (InMemoryReservationAllocation) i.next(); plan.deleteReservation(a.getReservationId()); excessCap -= a.getResourcesAtTime(t); } {code} * Why is the enforcement window tied to {{CapacitySchedulerConfiguration}}? {{TestSimpleCapacityReplanner}} * Tests should not call {{Thread.sleep}}; instead update the mock * Passing in a mocked {{Clock}} to the cstr rather than assigning it in the test is cleaner * Instead of {{assertTrue(cond != null)}} use {{assertNotNull(cond)}} (same for positive null check) * The test should not catch and discard {{PlanningException}} > ReservationSystem: replan upon capacity reduction > - > > Key: YARN-2475 > URL: https://issues.apache.org/jira/browse/YARN-2475 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-2475.patch > > > In the context of YARN-1051, if capacity of the cluster drops significantly > upon machine failures we need to trigger a reorganization of the planned > reservations. As reservations are "absolute" it is possible that they will > not all fit, and some need to be rejected a-posteriori. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1710) Admission Control: agents to allocate reservation
[ https://issues.apache.org/jira/browse/YARN-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129038#comment-14129038 ] Chris Douglas commented on YARN-1710: - {{GreedyReservationAgent}} * Consider {{@link}} for {{ReservationRequest}} in class javadoc * An inline comment could replace the {{adjustContract()}} method * Most of the javadoc on private methods can be cut * {{currentReservationStage}} does not need to be declared outside the loop * {{allocations}} cannot be null * An internal {{Resource(0, 0)}} could be reused * {{li}} should be part of the loop ({{for}} not {{while}}). Its initialization is unreadable; please use temp vars. * Generally, embedded calls are difficult to read: {code} if (findEarliestTime(allocations.keySet()) > earliestStart) { allocations.put(new ReservationInterval(earliestStart, findEarliestTime(allocations.keySet())), ReservationRequest .newInstance(Resource.newInstance(0, 0), 0)); // consider to add trailing zeros at the end for simmetry } {code} Assuming the {{ReservationRequest}} is never modified by the plan: {code} private final ZERO_RSRC = ReservationRequest.newInstance(Resource.newInstance(0, 0), 0); // ... long allocStart = findEarliestTime(allocations.keySet()); if (allocStart > earliestStart) { ReservationInterval preAlloc = new ReservationInterval(earliestStart, allocStart); allocations.put(preAlloc, ZERO_RSRC); } {code} * {{findEarliestTime(allocations.keySet())}} is called several times and should be memoized ** Would a {{TreeSet}} be more appropriate, given this access pattern? * Instead of: {code} boolean result = false; if (oldReservation != null) { result = plan.updateReservation(capReservation); } else { result = plan.addReservation(capReservation); } return result; {code} Consider: {code} if (oldReservation != null) { return plan.updateReservation(capReservation); } return plan.addReservation(capReservation); {code} * A comment unpacking the arithmetic for calculating {{curMaxGang}} would help readability {{TestGreedyReservationAgent}} * Instead of fixing the seed, consider setting and logging it for each run. * {{testStress}} is brittle, as it verifies only the timeout; {{testBig}} and {{testSmall}} don't verify anything. Both tests are useful, but probably not as part of the build. Dropping the annotation and adding a {{main()}} that calls each fo them would be one alternative. > Admission Control: agents to allocate reservation > - > > Key: YARN-1710 > URL: https://issues.apache.org/jira/browse/YARN-1710 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Carlo Curino >Assignee: Carlo Curino > Attachments: YARN-1710.1.patch, YARN-1710.patch > > > This JIRA tracks the algorithms used to allocate a user ReservationRequest > coming in from the new reservation API (YARN-1708), in the inventory > subsystem (YARN-1709) maintaining the current plan for the cluster. The focus > of this "agents" is to quickly find a solution for the set of contraints > provided by the user, and the physical constraints of the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129024#comment-14129024 ] Craig Welch commented on YARN-796: -- So, I'm adding code to check whether a user should be able to modify labels (is an admin) and I think that we should be checking the UserGroup information but not executing the operation using "doAs". This is because, ultimately, the process is writing data into hdfs and for permissions reasons I think it should always be written as the same user - the user yarn runs as - if we do the doAs there will be a mishmash of users there, and to have the directory be secure there would need to be a group with rights which contains all the admin users, which is extra overhead (otherwise, it has to be world writable, which tends to compromise the security model...) I think the same is true if we use other datastores down the line for holding the label info - really, our interest in the user it to verify access, but we don't really need or want to perform actions on their behalf (like you would when launching a job, etc), this is not one of those cases. So, I propose enforcing the check but executing whatever changes as the user the process is running under (the resourcemanager/yarn user, basically, just dropping the doAs). This means that entry points will need to do the verification, but that's not really an issue, the already have to be aware to gather the info regarding who the user is / are aware of the need for doAs, now, etc. It means that the user will need to be careful if executing a tool which directly modifies the data in hdfs to do that as an appropriate user, but they already have to do that, it's not a new issue which is being created with this approach (it doesn't really make that any better or worse, imho). Thoughts? > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, > Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, > YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.demo.patch.1, > YARN-796.patch, YARN-796.patch4 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129022#comment-14129022 ] Jian He commented on YARN-415: -- Hi [~eepayne], sorry for being unclear. my main question was, is this {{currentAttempt.getAppAttemptId().equals(attemptId)}} still necessary ? since the return value of scheduler#getAppResourceUsageReport for non-active attempt is anyways empty/null. > Capture aggregate memory allocation at the app-level for chargeback > --- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, > YARN-415.201408062232.txt, YARN-415.201408080204.txt, > YARN-415.201408092006.txt, YARN-415.201408132109.txt, > YARN-415.201408150030.txt, YARN-415.201408181938.txt, > YARN-415.201408181938.txt, YARN-415.201408212033.txt, > YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129020#comment-14129020 ] Vinod Kumar Vavilapalli commented on YARN-2440: --- The build machine ran into an issue which [~gkesavan] helped fixing on my offline request. Rekicked the build manually.. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2532) Track pending resources at the application level
[ https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129013#comment-14129013 ] Karthik Kambatla commented on YARN-2532: bq. For the FS at least, this is just FSAppAttempt.getDemand() - FSAppAttempt.getResourceUsage() Yes, it is. Tracking pending resources separately is not necessary for YARN-2353. However, demand for a queue or an app-attempt changes when the app requests more resources (increase in pending resources) or containers complete (consumption goes down). Since we want to track the pending resources information for YARN-2333, I thought we might as well do that first and use that as a trigger to update the demand in YARN-2353. > Track pending resources at the application level > - > > Key: YARN-2532 > URL: https://issues.apache.org/jira/browse/YARN-2532 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > SchedulerApplicationAttempt keeps track of current consumption of an app. It > would be nice to have a similar value tracked for pending requests. > The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and > (2) updating demand in FS in an event-driven style (YARN-2353) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129010#comment-14129010 ] Hadoop QA commented on YARN-2033: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667808/YARN-2033.11.patch against trunk revision 47bdfa0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4873//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4873//console This message is automatically generated. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, > YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2532) Track pending resources at the application level
[ https://issues.apache.org/jira/browse/YARN-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128997#comment-14128997 ] Sandy Ryza commented on YARN-2532: -- For the FS at least, this is just FSAppAttempt.getDemand() - FSAppAttempt.getResourceUsage(), no? > Track pending resources at the application level > - > > Key: YARN-2532 > URL: https://issues.apache.org/jira/browse/YARN-2532 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Affects Versions: 2.5.1 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > SchedulerApplicationAttempt keeps track of current consumption of an app. It > would be nice to have a similar value tracked for pending requests. > The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and > (2) updating demand in FS in an event-driven style (YARN-2353) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2158) TestRMWebServicesAppsModification sometimes fails in trunk
[ https://issues.apache.org/jira/browse/YARN-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128985#comment-14128985 ] Jian He commented on YARN-2158: --- looks good, committing > TestRMWebServicesAppsModification sometimes fails in trunk > -- > > Key: YARN-2158 > URL: https://issues.apache.org/jira/browse/YARN-2158 > Project: Hadoop YARN > Issue Type: Test >Reporter: Ted Yu >Assignee: Varun Vasudev >Priority: Minor > Attachments: apache-yarn-2158.0.patch, apache-yarn-2158.1.patch > > > From https://builds.apache.org/job/Hadoop-Yarn-trunk/582/console : > {code} > Tests run: 10, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 66.144 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification > testSingleAppKill[1](org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification) > Time elapsed: 2.297 sec <<< FAILURE! > java.lang.AssertionError: app state incorrect > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.verifyAppStateJson(TestRMWebServicesAppsModification.java:398) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification.testSingleAppKill(TestRMWebServicesAppsModification.java:289) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128984#comment-14128984 ] Karthik Kambatla commented on YARN-1492: Thanks for updating the design, Chris. Chris and I discussed the design and current implementation offline. A couple of comments in that discussion: # I like the idea of having a separate daemon for SCM, but if it is not very resource (memory) intensive, it might make sense to embed it in the RM by default. This takes care of HA etc. for free. We can do this at the end. # The choice of SCM store should be transparent to the rest of SCM code. It would be better to define an interface for the SCMStore similar to the RMStateStore today. # Defaulting to the in-memory store requires providing a way to initialize the store with currently running applications and cached jars, which is quite involved and not so elegant either. I propose implementing leveldb and zk stores. We could default to leveldb on non-HA clusters, and ZK store for HA clusters if we choose to embed the SCM in the RM. > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Chris Trezzo > Attachments: YARN-1492-all-trunk-v1.patch, > YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, > YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, > shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, > shared_cache_design_v5.pdf, shared_cache_design_v6.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2532) Track pending resources at the application level
Karthik Kambatla created YARN-2532: -- Summary: Track pending resources at the application level Key: YARN-2532 URL: https://issues.apache.org/jira/browse/YARN-2532 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Affects Versions: 2.5.1 Reporter: Karthik Kambatla Assignee: Karthik Kambatla SchedulerApplicationAttempt keeps track of current consumption of an app. It would be nice to have a similar value tracked for pending requests. The immediate uses I see are: (1) Showing this on the Web UI (YARN-2333) and (2) updating demand in FS in an event-driven style (YARN-2353) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.12.patch Missed the changes in YarnConfiguration in the last patch, added them in the newer one > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, > YARN-2033.12.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, > YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, YARN-2033.8.patch, > YARN-2033.9.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, > YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128929#comment-14128929 ] Xuan Gong commented on YARN-2459: - Also, Thanks Mayank for the initial patch. > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2459) RM crashes if App gets rejected for any reason and HA is enabled
[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128927#comment-14128927 ] Xuan Gong commented on YARN-2459: - Committed into trunk and branch-2. Thanks, Jian. > RM crashes if App gets rejected for any reason and HA is enabled > > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.1 >Reporter: Mayank Bansal >Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch, YARN-2459-2.patch, YARN-2459.3.patch, > YARN-2459.4.patch, YARN-2459.5.patch, YARN-2459.6.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2531) CGroups - Admins should be allowed to enforce strict cpu limits
Varun Vasudev created YARN-2531: --- Summary: CGroups - Admins should be allowed to enforce strict cpu limits Key: YARN-2531 URL: https://issues.apache.org/jira/browse/YARN-2531 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev >From YARN-2440 - {quote} The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. {quote} It would be nice to have an option to let admins to enforce strict cpu limits for apps for things like benchmarking, etc. By default this flag should be off so that containers can use available cpu but admin can turn the flag on to determine worst case performance, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.11.patch Good catch! I uploaded a new patch with the updated config names. > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.11.patch, YARN-2033.2.patch, > YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, > YARN-2033.7.patch, YARN-2033.8.patch, YARN-2033.9.patch, > YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, > YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128911#comment-14128911 ] Hudson commented on YARN-2526: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1892 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1892/]) YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. (Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9) * hadoop-yarn-project/CHANGES.txt * hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java > SLS can deadlock when all the threads are taken by AMSimulators > --- > > Key: YARN-2526 > URL: https://issues.apache.org/jira/browse/YARN-2526 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 2.5.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-2526-1.patch > > > The simulation may enter deadlock if all application simulators hold all > threads provided by the thread pool, and all wait for AM container > allocation. In that case, all AM simulators wait for NM simulators to do > heartbeat to allocate resource, and all NM simulators wait for AM simulators > to release some threads. The simulator is deadlocked. > To solve this deadlock, need to remove the while() loop in the MRAMSimulator. > {code} > // waiting until the AM container is allocated > while (true) { > if (response != null && ! response.getAllocatedContainers().isEmpty()) { > // get AM container > . > break; > } > // this sleep time is different from HeartBeat > Thread.sleep(1000); > // send out empty request > sendContainerRequest(); > response = responseQueue.take(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128915#comment-14128915 ] Hudson commented on YARN-1471: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1892 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1892/]) Add missing YARN-1471 to the CHANGES.txt (aw: rev 9b8104575444ed2de9b44fe902f86f7395f249ed) * hadoop-yarn-project/CHANGES.txt > The SLS simulator is not running the preemption policy for CapacityScheduler > > > Key: YARN-1471 > URL: https://issues.apache.org/jira/browse/YARN-1471 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Carlo Curino >Priority: Minor > Fix For: 3.0.0 > > Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, > YARN-1471.patch, YARN-1471.patch > > > The simulator does not run the ProportionalCapacityPreemptionPolicy monitor. > This is because the policy needs to interact with a CapacityScheduler, and > the wrapping done by the simulator breaks this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128877#comment-14128877 ] Vinod Kumar Vavilapalli commented on YARN-2033: --- Looks mostly good. Rename yarn.resourcemanager.metrics-publisher.enabled -> also to say system-metrics-publisher too? Similarly rename yarn.resourcemanager.metrics-publisher.dispatcher.pool-size? > Investigate merging generic-history into the Timeline Store > --- > > Key: YARN-2033 > URL: https://issues.apache.org/jira/browse/YARN-2033 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, > YARN-2033.1.patch, YARN-2033.10.patch, YARN-2033.2.patch, YARN-2033.3.patch, > YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.7.patch, > YARN-2033.8.patch, YARN-2033.9.patch, YARN-2033.Prototype.patch, > YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, > YARN-2033_ALL.4.patch > > > Having two different stores isn't amicable to generic insights on what's > happening with applications. This is to investigate porting generic-history > into the Timeline Store. > One goal is to try and retain most of the client side interfaces as close to > what we have today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128865#comment-14128865 ] Hadoop QA commented on YARN-1492: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667798/shared_cache_design_v6.pdf against trunk revision b67d5ba. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4872//console This message is automatically generated. > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Chris Trezzo > Attachments: YARN-1492-all-trunk-v1.patch, > YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, > YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, > shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, > shared_cache_design_v5.pdf, shared_cache_design_v6.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2530) MapReduce should take cpu into account when doing headroom calculations
Varun Vasudev created YARN-2530: --- Summary: MapReduce should take cpu into account when doing headroom calculations Key: YARN-2530 URL: https://issues.apache.org/jira/browse/YARN-2530 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Vasudev Assignee: Varun Vasudev Currently the MapReduce AM only uses memory when doing headroom calculation as well calculations about launching reducers. It would be preferable to account for CPU as well if the scheduler on the YARN side is using CPU when scheduling. YARN-2448 lets AMs know what resources are being considered when scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128843#comment-14128843 ] Vinod Kumar Vavilapalli edited comment on YARN-2440 at 9/10/14 6:16 PM: bq. As I mentioned before, I think most users would rather not use the functionality proposed by this JIRA but instead setup peer cgroups for other systems and set their relative cgroup shares appropriately. With this JIRA the CPUs could sit idle despite demand from YARN containers, while a peer cgroup setup allows CPU guarantees without idle CPUs if the demand is there. [~jlowe], agree with the general philosophy. Though we are not yet there in practice - datanodes, region servers don't yet live in cgroups in many sites. Looking back at this JIRA, I see a good use for this. Having the overall YARN limit will help ensure that apps' containers don't thrash cpu once we start enabling cgroups support. The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. was (Author: vinodkv): bq. As I mentioned before, I think most users would rather not use the functionality proposed by this JIRA but instead setup peer cgroups for other systems and set their relative cgroup shares appropriately. With this JIRA the CPUs could sit idle despite demand from YARN containers, while a peer cgroup setup allows CPU guarantees without idle CPUs if the demand is there. [~jlowe], agree with the general philosophy. Though we are not yet there in practice - datanodes, region servers don't yet live in cgroups in many sites. Looking back at this JIRA, I see a good use for this. Having the overall YARN limit will help ensure that apps' containers don't thrash cpu once we start enabling support. The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1492) truly shared cache for jars (jobjar/libjar)
[ https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Trezzo updated YARN-1492: --- Attachment: shared_cache_design_v6.pdf Attached v6 design doc to reflect the current implementation. > truly shared cache for jars (jobjar/libjar) > --- > > Key: YARN-1492 > URL: https://issues.apache.org/jira/browse/YARN-1492 > Project: Hadoop YARN > Issue Type: New Feature >Affects Versions: 2.0.4-alpha >Reporter: Sangjin Lee >Assignee: Chris Trezzo > Attachments: YARN-1492-all-trunk-v1.patch, > YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, > YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, > shared_cache_design.pdf, shared_cache_design_v2.pdf, > shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, > shared_cache_design_v5.pdf, shared_cache_design_v6.pdf > > > Currently there is the distributed cache that enables you to cache jars and > files so that attempts from the same job can reuse them. However, sharing is > limited with the distributed cache because it is normally on a per-job basis. > On a large cluster, sometimes copying of jobjars and libjars becomes so > prevalent that it consumes a large portion of the network bandwidth, not to > speak of defeating the purpose of "bringing compute to where data is". This > is wasteful because in most cases code doesn't change much across many jobs. > I'd like to propose and discuss feasibility of introducing a truly shared > cache so that multiple jobs from multiple users can share and cache jars. > This JIRA is to open the discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128843#comment-14128843 ] Vinod Kumar Vavilapalli commented on YARN-2440: --- bq. As I mentioned before, I think most users would rather not use the functionality proposed by this JIRA but instead setup peer cgroups for other systems and set their relative cgroup shares appropriately. With this JIRA the CPUs could sit idle despite demand from YARN containers, while a peer cgroup setup allows CPU guarantees without idle CPUs if the demand is there. [~jlowe], agree with the general philosophy. Though we are not yet there in practice - datanodes, region servers don't yet live in cgroups in many sites. Looking back at this JIRA, I see a good use for this. Having the overall YARN limit will help ensure that apps' containers don't thrash cpu once we start enabling support. The other dimension to this is determinism w.r.t performance. Limiting to allocated cores overall (as well as per container later) helps orgs run workloads and reason about them deterministically. One of the examples is benchmarking apps, but deterministic execution is a desired option beyond benchmarks too. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128842#comment-14128842 ] Hadoop QA commented on YARN-2440: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667797/apache-yarn-2440.6.patch against trunk revision b67d5ba. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4871//console This message is automatically generated. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2440) Cgroups should allow YARN containers to be limited to allocated cores
[ https://issues.apache.org/jira/browse/YARN-2440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev updated YARN-2440: Attachment: apache-yarn-2440.6.patch Uploaded new patch to address Vinod's comments. {quote} {noformat} +Percentage of CPU that can be allocated +for containers. This setting allows users to limit the number of +physical cores that YARN containers use. Currently functional only +on Linux using cgroups. The default is to use 100% of CPU. + +yarn.nodemanager.resource.percentage-physical-cpu-limit +100 + {noformat} "the number of physical cores" part isn't really right. It actually is 75% across all cores, for e.g. We have this sort of "number of physical cores" description in multiple places, let's fix that? For instance, in NodeManagerHardwareUtils, yarn-default.xml etc. {quote} Fixed. {quote} Also, NM_CONTAINERS_CPU_PERC -> NM_RESOURCE_PHYSICAL_CPU_LIMIT Similarly rename DEFAULT_NM_CONTAINERS_CPU_PERC {quote} Done, I'd prefer to have percentage as part of the name. I've changed it to NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT and DEFAULT_NM_RESOURCE_PERCENTAGE_PHYSICAL_CPU_LIMIT. > Cgroups should allow YARN containers to be limited to allocated cores > - > > Key: YARN-2440 > URL: https://issues.apache.org/jira/browse/YARN-2440 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2440.0.patch, apache-yarn-2440.1.patch, > apache-yarn-2440.2.patch, apache-yarn-2440.3.patch, apache-yarn-2440.4.patch, > apache-yarn-2440.5.patch, apache-yarn-2440.6.patch, > screenshot-current-implementation.jpg > > > The current cgroups implementation does not limit YARN containers to the > cores allocated in yarn-site.xml. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128819#comment-14128819 ] Vinod Kumar Vavilapalli commented on YARN-2517: --- I am +1 about a client that makes async calls. The question is whether we need a a new client class (and thus a public interface) or not. Clearly, async calls need call-back handlers _just_ for errors. As of today, there are no APIs that really need to send back *results* (not error) asynchronously. The way you usually handle it is through one of the following {code} // Sync call Result call(Input); // Async call - Type (1) void asyncCall(Input, CallBackHandler); // Async call - Type (2) Future asyncCall(Input); {code} You can do type (1). Having an entire separate client side interface isn't mandatory. If you guys think there is a lot more functionality coming in an async class in the future, can we hear about some of them here? > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2529) Generic history service RPC interface doesn't work when service authorization is enabled
[ https://issues.apache.org/jira/browse/YARN-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2529: -- Summary: Generic history service RPC interface doesn't work when service authorization is enabled (was: Generic history service RPC interface doesn't work wen service authorization is enabled) > Generic history service RPC interface doesn't work when service authorization > is enabled > > > Key: YARN-2529 > URL: https://issues.apache.org/jira/browse/YARN-2529 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Here's the problem shown in the log: > {code} > 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for > protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is > unauthorized for user zshen (auth:SIMPLE) > 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: > readAndProcess from client 10.22.2.109 threw exception > [org.apache.hadoop.security.authorize.AuthorizationException: Protocol > interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not > known.] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2529) Generic history service RPC interface doesn't work wen service authorization is enabled
Zhijie Shen created YARN-2529: - Summary: Generic history service RPC interface doesn't work wen service authorization is enabled Key: YARN-2529 URL: https://issues.apache.org/jira/browse/YARN-2529 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Here's the problem shown in the log: {code} 14/09/10 10:42:44 INFO ipc.Server: Connection from 10.22.2.109:55439 for protocol org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is unauthorized for user zshen (auth:SIMPLE) 14/09/10 10:42:44 INFO ipc.Server: Socket Reader #1 for port 10200: readAndProcess from client 10.22.2.109 threw exception [org.apache.hadoop.security.authorize.AuthorizationException: Protocol interface org.apache.hadoop.yarn.api.ApplicationHistoryProtocolPB is not known.] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128687#comment-14128687 ] Hadoop QA commented on YARN-2527: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12667768/YARN-2527.patch against trunk revision 3072c83. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4870//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4870//console This message is automatically generated. > NPE in ApplicationACLsManager > - > > Key: YARN-2527 > URL: https://issues.apache.org/jira/browse/YARN-2527 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Benoy Antony >Assignee: Benoy Antony > Attachments: YARN-2527.patch, YARN-2527.patch > > > NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. > The relevant stacktrace snippet from the ResourceManager logs is as below > {code} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {code} > This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benoy Antony updated YARN-2527: --- Attachment: YARN-2527.patch Attaching a new patch. Added one more test case to test the case of partial set of ACLS. > NPE in ApplicationACLsManager > - > > Key: YARN-2527 > URL: https://issues.apache.org/jira/browse/YARN-2527 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Benoy Antony >Assignee: Benoy Antony > Attachments: YARN-2527.patch, YARN-2527.patch > > > NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. > The relevant stacktrace snippet from the ResourceManager logs is as below > {code} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {code} > This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2527) NPE in ApplicationACLsManager
[ https://issues.apache.org/jira/browse/YARN-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128622#comment-14128622 ] Benoy Antony commented on YARN-2527: Thank you [~zjshen]. I am investigating on how that happened and will probably open another jira with the root cause. But I believe, the NullPointer issue in _ApplicationACLsManager_ should be fixed regardless of that. Based on the current logic, Admin and application owner should be able to perform actions on the Application regardless of ACLS. The NullPointer issue prevents it. > NPE in ApplicationACLsManager > - > > Key: YARN-2527 > URL: https://issues.apache.org/jira/browse/YARN-2527 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.0 >Reporter: Benoy Antony >Assignee: Benoy Antony > Attachments: YARN-2527.patch > > > NPE in _ApplicationACLsManager_ can result in 500 Internal Server Error. > The relevant stacktrace snippet from the ResourceManager logs is as below > {code} > Caused by: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.security.ApplicationACLsManager.checkAccess(ApplicationACLsManager.java:104) > at > org.apache.hadoop.yarn.server.resourcemanager.webapp.AppBlock.render(AppBlock.java:101) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:66) > at > org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:76) > at org.apache.hadoop.yarn.webapp.View.render(View.java:235) > {code} > This issue was reported by [~miguenther]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128619#comment-14128619 ] Karthik Kambatla commented on YARN-1458: Thanks Zhihai and [~qingwu.fu] for working on this, and Sandy for the reviews. Just committed this to trunk and branch-2. > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, > YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, > yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) FairScheduler: Zero weight can lead to livelock
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1458: --- Summary: FairScheduler: Zero weight can lead to livelock (was: In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely) > FairScheduler: Zero weight can lead to livelock > --- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, > YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, > yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128590#comment-14128590 ] Karthik Kambatla commented on YARN-1458: +1. Committing version 8. > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, > YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, > yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1458) In Fair Scheduler, size based weight can cause update thread to hold lock indefinitely
[ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1458: --- Target Version/s: 2.6.0 (was: 2.2.0) Fix Version/s: (was: 2.2.1) > In Fair Scheduler, size based weight can cause update thread to hold lock > indefinitely > -- > > Key: YARN-1458 > URL: https://issues.apache.org/jira/browse/YARN-1458 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: Centos 2.6.18-238.19.1.el5 X86_64 > hadoop2.2.0 >Reporter: qingwu.fu >Assignee: zhihai xu > Labels: patch > Attachments: YARN-1458.001.patch, YARN-1458.002.patch, > YARN-1458.003.patch, YARN-1458.004.patch, YARN-1458.006.patch, > YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, > YARN-1458.alternative2.patch, YARN-1458.patch, yarn-1458-5.patch, > yarn-1458-7.patch, yarn-1458-8.patch > > Original Estimate: 408h > Remaining Estimate: 408h > > The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when > clients submit lots jobs, it is not easy to reapear. We run the test cluster > for days to reapear it. The output of jstack command on resourcemanager pid: > {code} > "ResourceManager Event Processor" prio=10 tid=0x2aaab0c5f000 nid=0x5dd3 > waiting for monitor entry [0x43aa9000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671) > - waiting to lock <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440) > at java.lang.Thread.run(Thread.java:744) > …… > "FairSchedulerUpdateThread" daemon prio=10 tid=0x2aaab0a2c800 nid=0x5dc8 > runnable [0x433a2000] >java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282) > - locked <0x00070026b6e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255) > at java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128567#comment-14128567 ] Tsuyoshi OZAWA commented on YARN-2517: -- Thanks for your review, Zhijie. I think batch optimization and persisting entities can be done in sync client since the async client uses sync client. Submitting a patch again for merging. Please let me know if you have additional review comments. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128494#comment-14128494 ] Hudson commented on YARN-1471: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1867 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1867/]) Add missing YARN-1471 to the CHANGES.txt (aw: rev 9b8104575444ed2de9b44fe902f86f7395f249ed) * hadoop-yarn-project/CHANGES.txt > The SLS simulator is not running the preemption policy for CapacityScheduler > > > Key: YARN-1471 > URL: https://issues.apache.org/jira/browse/YARN-1471 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Carlo Curino >Priority: Minor > Fix For: 3.0.0 > > Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, > YARN-1471.patch, YARN-1471.patch > > > The simulator does not run the ProportionalCapacityPreemptionPolicy monitor. > This is because the policy needs to interact with a CapacityScheduler, and > the wrapping done by the simulator breaks this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128491#comment-14128491 ] Hudson commented on YARN-2526: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1867 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1867/]) YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. (Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9) * hadoop-yarn-project/CHANGES.txt * hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java > SLS can deadlock when all the threads are taken by AMSimulators > --- > > Key: YARN-2526 > URL: https://issues.apache.org/jira/browse/YARN-2526 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 2.5.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-2526-1.patch > > > The simulation may enter deadlock if all application simulators hold all > threads provided by the thread pool, and all wait for AM container > allocation. In that case, all AM simulators wait for NM simulators to do > heartbeat to allocate resource, and all NM simulators wait for AM simulators > to release some threads. The simulator is deadlocked. > To solve this deadlock, need to remove the while() loop in the MRAMSimulator. > {code} > // waiting until the AM container is allocated > while (true) { > if (response != null && ! response.getAllocatedContainers().isEmpty()) { > // get AM container > . > break; > } > // this sleep time is different from HeartBeat > Thread.sleep(1000); > // send out empty request > sendContainerRequest(); > response = responseQueue.take(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1471) The SLS simulator is not running the preemption policy for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128357#comment-14128357 ] Hudson commented on YARN-1471: -- FAILURE: Integrated in Hadoop-Yarn-trunk #676 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/676/]) Add missing YARN-1471 to the CHANGES.txt (aw: rev 9b8104575444ed2de9b44fe902f86f7395f249ed) * hadoop-yarn-project/CHANGES.txt > The SLS simulator is not running the preemption policy for CapacityScheduler > > > Key: YARN-1471 > URL: https://issues.apache.org/jira/browse/YARN-1471 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Carlo Curino >Assignee: Carlo Curino >Priority: Minor > Fix For: 3.0.0 > > Attachments: SLSCapacityScheduler.java, YARN-1471.2.patch, > YARN-1471.patch, YARN-1471.patch > > > The simulator does not run the ProportionalCapacityPreemptionPolicy monitor. > This is because the policy needs to interact with a CapacityScheduler, and > the wrapping done by the simulator breaks this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2526) SLS can deadlock when all the threads are taken by AMSimulators
[ https://issues.apache.org/jira/browse/YARN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128354#comment-14128354 ] Hudson commented on YARN-2526: -- FAILURE: Integrated in Hadoop-Yarn-trunk #676 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/676/]) YARN-2526. SLS can deadlock when all the threads are taken by AMSimulators. (Wei Yan via kasha) (kasha: rev 28d99db99236ff2a6e4a605802820e2b512225f9) * hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/appmaster/MRAMSimulator.java * hadoop-yarn-project/CHANGES.txt > SLS can deadlock when all the threads are taken by AMSimulators > --- > > Key: YARN-2526 > URL: https://issues.apache.org/jira/browse/YARN-2526 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 2.5.1 >Reporter: Wei Yan >Assignee: Wei Yan >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-2526-1.patch > > > The simulation may enter deadlock if all application simulators hold all > threads provided by the thread pool, and all wait for AM container > allocation. In that case, all AM simulators wait for NM simulators to do > heartbeat to allocate resource, and all NM simulators wait for AM simulators > to release some threads. The simulator is deadlocked. > To solve this deadlock, need to remove the while() loop in the MRAMSimulator. > {code} > // waiting until the AM container is allocated > while (true) { > if (response != null && ! response.getAllocatedContainers().isEmpty()) { > // get AM container > . > break; > } > // this sleep time is different from HeartBeat > Thread.sleep(1000); > // send out empty request > sendContainerRequest(); > response = responseQueue.take(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-1530) [Umbrella] Store, manage and serve per-framework application-timeline data
[ https://issues.apache.org/jira/browse/YARN-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128140#comment-14128140 ] Zhijie Shen edited comment on YARN-1530 at 9/10/14 7:36 AM: [~bcwalrus], thanks for your interests in the timeline server and sharing your idea. Here’re some of my opinions and our previous rationales. bq. Let's have reliability before speed. I think one of the requirement of ATS is: The channel for writing events should be reliable. I agree reliability is an important requirement of the timeline server, but the other requirements such as scalability and efficiency should be orthogonal to it, such that there’s no order of which should come first. We can pursue both enhancement, can’t we? bq. I'm using reliable here in a strong sense, not the TCP-best-effort style reliability. HDFS is reliable. Kafka is reliable. (They are also scalable and robust.) IMHO, it may be unfair to compare the reliability between TCP and HDFS, Kafka, because they’re on the different layer of the communication stack. HDFS and Kafka are also built on top of TCP for communication, right? In my previous [comments|https://issues.apache.org/jira/browse/YARN-1530?focusedCommentId=14125238&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14125238], I’ve mentioned that we need to clearly define *reliability, and I’d like to highlight it here again: 1. Server is reliable: when timeline entities is passed to the timeline server, it should prevent them from being lost. After YARN-2032, we’re going to have HBase timeline store to ensure it. 2. Client is reliable: once the timeline entities are hand over to the timeline client, before the timeline client successfully put in to the timeline sever, it should prevent them being lost at the client side. We may use some techniques to cache the entities locally. I opened YANR-2521 to track the dissuasion along this direction. Between client and server, TCP is the trustworthy protocol. If client gets ACK from server, we should be confident that the server already gets the entities. bq. A normal RPC connection is not. I don't want the ATS to be able to slow down my writes, and therefore, my applications, at all. I’m not sure there's the direct relationship between reliability and nonblocking writing. For example, submitting app via YarnClient to HA RM is reliable, but the user is still likely to blocked until the app submission is responded. Whether writing events is blocking or non-blocking depends on how the user uses the client. In YARN-2033, I make RM put the entities on a separate thread to prevent blocking the dispatcher for managing YARN app lifecycle. And I can see that nonblocking writing is a useful optimization, such that I’ve opened YARN-2517 to implement TimelineClientAsync for general usage. bq. Yes, you could make a distributed reliable scalable "ATS service" to accept writing events. But that seems a lot of work, while we can leverage existing technologies. AFAIK, the timeline server is a stateless machine, it should not be difficult to use Zookeeper to manage a number instances and writing to the same HBase cluster. We may need to pay attention to load balancing, and concurrent writing. I’m not sure it will really be a lot of work. Please let me know if I’ve neglected some important pieces. And in the scope of YARN, we already accumulated similar experience when making HA RM, and it turns out to be a practical solution. Again, this is about scalability, which is orthogonal to reliability. Even we pass the timeline entities via Kafka/HDFS to the timeline server, the single server is still going to be the bottleneck of processing a large volume of requests, no matter how big the Kafaka/HDFS cluster is. bq. If the channel itself is pluggable, then we have lots of options. Kafka is a very good choice, for sites that already deploy Kafka and know how to operate it. Using HDFS as a channel is also a good default implementation, for people already know how to scale and manage HDFS. I’m not object to having different entity publishing channels, but my concern is the effort to maintain the timeline client is going to be folded per number of the channels. As the timeline server is going to to be long term project, we can not neglect the additional workload of evolving all channels. And this is the similar concern that we want to remove the FS-based history store (see YARN-2320). Maybe cooperatively improving the current channel is a more cost-efficient choice. It’s good to think more before opening a new channel. In addition, the default solution is good to be simple and self-contained. A heavy solution with complex configuration and and large dependency is likely to prolong the learning curve to keep new adopters away, and complicate fast, small-scale deployment. was (
[jira] [Commented] (YARN-2517) Implement TimelineClientAsync
[ https://issues.apache.org/jira/browse/YARN-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128156#comment-14128156 ] Zhijie Shen commented on YARN-2517: --- Scan through the patch, the approach is quite close to AMRMClientAsync/NMClientAsync. It looks fine to me in general. Later on, we can improve the client step by step. For example, according to the discussion on the umbrella, we can persist the queued entities to be reliable. And we may want to allow multiple threads to put entities. > Implement TimelineClientAsync > - > > Key: YARN-2517 > URL: https://issues.apache.org/jira/browse/YARN-2517 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhijie Shen >Assignee: Tsuyoshi OZAWA > Attachments: YARN-2517.1.patch > > > In some scenarios, we'd like to put timeline entities in another thread no to > block the current one. > It's good to have a TimelineClientAsync like AMRMClientAsync and > NMClientAsync. It can buffer entities, put them in a separate thread, and > have callback to handle the responses. -- This message was sent by Atlassian JIRA (v6.3.4#6332)