[jira] [Commented] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077443#comment-14077443 ] Hadoop QA commented on YARN-611: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658363/YARN-611.4.rebase.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4466//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4466//console This message is automatically generated. > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077440#comment-14077440 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4468//console This message is automatically generated. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077435#comment-14077435 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658367/YARN-796.patch.1 against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4467//console This message is automatically generated. > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Labels: (was: patch) > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuliya Feldman updated YARN-796: Attachment: YARN-796.patch.1 First patch based on "LabelBasedScheduling" design document > Allow for (admin) labels on nodes and resource-requests > --- > > Key: YARN-796 > URL: https://issues.apache.org/jira/browse/YARN-796 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.4.1 >Reporter: Arun C Murthy >Assignee: Wangda Tan > Attachments: LabelBasedScheduling.pdf, > Node-labels-Requirements-Design-doc-V1.pdf, YARN-796.patch, YARN-796.patch.1 > > > It will be useful for admins to specify labels for nodes. Examples of labels > are OS, processor architecture etc. > We should expose these labels and allow applications to specify labels on > resource-requests. > Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077416#comment-14077416 ] Xuan Gong commented on YARN-1994: - I think it is because connectAddress is needed for generating the nodeId. With this patch, we will bind the NM Server with the NM_BIND address. We need the real nm_address to generate the nodeId. [~cwelch] Could you confirm whether it is the reason ? > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-611) Add an AM retry count reset window to YARN RM
[ https://issues.apache.org/jira/browse/YARN-611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-611: --- Attachment: YARN-611.4.rebase.patch rebased on the latest trunk > Add an AM retry count reset window to YARN RM > - > > Key: YARN-611 > URL: https://issues.apache.org/jira/browse/YARN-611 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.0.3-alpha >Reporter: Chris Riccomini >Assignee: Xuan Gong > Attachments: YARN-611.1.patch, YARN-611.2.patch, YARN-611.3.patch, > YARN-611.4.patch, YARN-611.4.rebase.patch > > > YARN currently has the following config: > yarn.resourcemanager.am.max-retries > This config defaults to 2, and defines how many times to retry a "failed" AM > before failing the whole YARN job. YARN counts an AM as failed if the node > that it was running on dies (the NM will timeout, which counts as a failure > for the AM), or if the AM dies. > This configuration is insufficient for long running (or infinitely running) > YARN jobs, since the machine (or NM) that the AM is running on will > eventually need to be restarted (or the machine/NM will fail). In such an > event, the AM has not done anything wrong, but this is counted as a "failure" > by the RM. Since the retry count for the AM is never reset, eventually, at > some point, the number of machine/NM failures will result in the AM failure > count going above the configured value for > yarn.resourcemanager.am.max-retries. Once this happens, the RM will mark the > job as failed, and shut it down. This behavior is not ideal. > I propose that we add a second configuration: > yarn.resourcemanager.am.retry-count-window-ms > This configuration would define a window of time that would define when an AM > is "well behaved", and it's safe to reset its failure count back to zero. > Every time an AM fails the RmAppImpl would check the last time that the AM > failed. If the last failure was less than retry-count-window-ms ago, and the > new failure count is > max-retries, then the job should fail. If the AM has > never failed, the retry count is < max-retries, or if the last failure was > OUTSIDE the retry-count-window-ms, then the job should be restarted. > Additionally, if the last failure was outside the retry-count-window-ms, then > the failure count should be set back to 0. > This would give developers a way to have well-behaved AMs run forever, while > still failing mis-behaving AMs after a short period of time. > I think the work to be done here is to change the RmAppImpl to actually look > at app.attempts, and see if there have been more than max-retries failures in > the last retry-count-window-ms milliseconds. If there have, then the job > should fail, if not, then the job should go forward. Additionally, we might > also need to add an endTime in either RMAppAttemptImpl or > RMAppFailedAttemptEvent, so that the RmAppImpl can check the time of the > failure. > Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077401#comment-14077401 ] Arpit Agarwal commented on YARN-1994: - +1 from me module one question. Why is the following logic only needed for ContainerManagerImpl.java? I probably knew this but can't recall now. {code} InetSocketAddress connectAddress; String connectHost = conf.getTrimmed(YarnConfiguration.NM_ADDRESS); if (connectHost == null || connectHost.isEmpty()) { // Get hostname and port from the listening endpoint. connectAddress = NetUtils.getConnectAddress(server); } else { // Combine the configured hostname with the port from the listening // endpoint. This gets the correct port number if the configuration // specifies an ephemeral port (port number 0). connectAddress = NetUtils.getConnectAddress( new InetSocketAddress(connectHost.split(":")[0], server.getListenerAddress().getPort())); } {code} Thanks. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077319#comment-14077319 ] Hadoop QA commented on YARN-1979: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658331/YARN-1979.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4465//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4465//console This message is automatically generated. > TestDirectoryCollection fails when the umask is unusual > --- > > Key: YARN-1979 > URL: https://issues.apache.org/jira/browse/YARN-1979 > Project: Hadoop YARN > Issue Type: Test >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: YARN-1979.2.patch, YARN-1979.txt > > > I've seen this fail in Windows where the default permissions are matching up > to 700. > {code} > --- > Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection > --- > Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection > testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) > Time elapsed: 0.422 sec <<< FAILURE! > java.lang.AssertionError: local dir parent > Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA > not created with proper permissions expected: but was: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.failNotEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) > {code} > The clash is between testDiskSpaceUtilizationLimit() and > testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077303#comment-14077303 ] Wangda Tan commented on YARN-2215: -- Hi [~kj-ki], Thanks for working on this, I've assigned this JIRA to you. I think the fields you added should be fine. With the scope of this JIRA, I think it's better to add CLI support as well. Please submit patch to kickoff jenkins when you completed. Wangda > Add preemption info to REST/CLI > --- > > Key: YARN-2215 > URL: https://issues.apache.org/jira/browse/YARN-2215 > Project: Hadoop YARN > Issue Type: Bug > Components: client, resourcemanager >Reporter: Wangda Tan >Assignee: Kenji Kikushima > Attachments: YARN-2215.patch > > > As discussed in YARN-2181, we'd better to add preemption info to RM RESTful > API/CLI to make administrator/user get more understanding about preemption > happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2215) Add preemption info to REST/CLI
[ https://issues.apache.org/jira/browse/YARN-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2215: - Assignee: Kenji Kikushima > Add preemption info to REST/CLI > --- > > Key: YARN-2215 > URL: https://issues.apache.org/jira/browse/YARN-2215 > Project: Hadoop YARN > Issue Type: Bug > Components: client, resourcemanager >Reporter: Wangda Tan >Assignee: Kenji Kikushima > Attachments: YARN-2215.patch > > > As discussed in YARN-2181, we'd better to add preemption info to RM RESTful > API/CLI to make administrator/user get more understanding about preemption > happened on app/queue, etc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1979) TestDirectoryCollection fails when the umask is unusual
[ https://issues.apache.org/jira/browse/YARN-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1979: - Attachment: YARN-1979.2.patch This JIRA seems to be forgotten, so let me update the patch. Just removed the lines [~djp] mentioned. > TestDirectoryCollection fails when the umask is unusual > --- > > Key: YARN-1979 > URL: https://issues.apache.org/jira/browse/YARN-1979 > Project: Hadoop YARN > Issue Type: Test >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: YARN-1979.2.patch, YARN-1979.txt > > > I've seen this fail in Windows where the default permissions are matching up > to 700. > {code} > --- > Test set: org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection > --- > Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.015 sec <<< > FAILURE! - in > org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection > testCreateDirectories(org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection) > Time elapsed: 0.422 sec <<< FAILURE! > java.lang.AssertionError: local dir parent > Y:\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection\dirA > not created with proper permissions expected: but was: > at org.junit.Assert.fail(Assert.java:93) > at org.junit.Assert.failNotEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.TestDirectoryCollection.testCreateDirectories(TestDirectoryCollection.java:106) > {code} > The clash is between testDiskSpaceUtilizationLimit() and > testCreateDirectories(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077296#comment-14077296 ] Wangda Tan commented on YARN-1707: -- Hi [~curino], Thanks for your reply, For regarding how the patch matches the JIRA: Since I don't have other solid use cases in my mind that others besides {{ReservationSystem}} can leverage these features, I don't have strong opinions to merge such dynamic behaviors into {{ParentQueue}}, {{LeafQueue}}. Let's wait for more feedbacks. I agree that we can consider queue capacity as a "weight", it will be easier for users to configure, and it's a backward-compatible change also (except it will not throw exception when sum of children of a {{ParentQueue}} doesn't equals to 100). bq. As I was mentioning in my previous comment, this is likely fine for the limited usage we will make of this from ReservationSystem I think for moving application across queue is not a ReservationSystem specific change. I would suggest to check it will not violate restrictions in target queue before moving it. Thanks, Wangda > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1826) TestDirectoryCollection intermittent failures
[ https://issues.apache.org/jira/browse/YARN-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA resolved YARN-1826. -- Resolution: Duplicate > TestDirectoryCollection intermittent failures > - > > Key: YARN-1826 > URL: https://issues.apache.org/jira/browse/YARN-1826 > Project: Hadoop YARN > Issue Type: Test >Reporter: Tsuyoshi OZAWA > > testCreateDirectories fails intermittently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077285#comment-14077285 ] Xuan Gong commented on YARN-1994: - +1 LGTM > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1826) TestDirectoryCollection intermittent failures
[ https://issues.apache.org/jira/browse/YARN-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077282#comment-14077282 ] Tsuyoshi OZAWA commented on YARN-1826: -- Thank you for commenting, Wangda. Vinod is fixing this problem on YARN-1979. Close this as duplicated. > TestDirectoryCollection intermittent failures > - > > Key: YARN-1826 > URL: https://issues.apache.org/jira/browse/YARN-1826 > Project: Hadoop YARN > Issue Type: Test >Reporter: Tsuyoshi OZAWA > > testCreateDirectories fails intermittently. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2367) Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler
Swapnil Daingade created YARN-2367: -- Summary: Make ResourceCalculator configurable for FairScheduler and FifoScheduler like CapacityScheduler Key: YARN-2367 URL: https://issues.apache.org/jira/browse/YARN-2367 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.4.1, 2.3.0, 2.2.0 Reporter: Swapnil Daingade Priority: Minor The ResourceCalculator used by CapacityScheduler is read from a configuration file entry capacity-scheduler.xml yarn.scheduler.capacity.resource-calculator. This allows for custom implementations that implement the ResourceCalculator interface to be plugged in. It would be nice to have the same functionality in FairScheduler and FifoScheduler. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077279#comment-14077279 ] Hadoop QA commented on YARN-2354: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658299/YARN-2354-072814.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4464//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4464//console This message is automatically generated. > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077262#comment-14077262 ] Wangda Tan commented on YARN-415: - Hi [~eepayne], Thanks for updating your patch, For e2e test, I think we can do this way, you can refer to tests in TestRMRestart Using MockRM/MockAM can do such test, even though it's not a complete e2e test, but most logic are included in it. I suggest we could cover following cases: {code} * Create an app, before submit AM, resource utilization should be 0 * Submit AM, while AM running, we can get its resource utilization > 0 * Allocate some container, and finish them, check total resource utilization * Finish application attempt, and check total resource utilization * Start a new application attempt, check if resource utilization of previous attempt is added to total resource utilization. * Check if resource utilization can be persist/read during RM restart {code} Do you have any comments on this? Thanks, Wangda > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077256#comment-14077256 ] Carlo Curino commented on YARN-1707: Thanks again for the fast and insightful feedback. *Regarding how the patch matches the JIRA:* Our initial implementation was indeed making the changes (i.e., the dynamic behaviors) in ParentQueue and LeafQueue themselves. Previous feedback pushed us to have subclasses to in a sense isolate the changes to dynamic subclasses. I think we can go back to the version modifying directly ParentQueue and LeafQueue if there is consensus. #4 is required because we cannot transactionally “add Q1, resize Q2” so that the invariant “size of children is == 100%” is maintained. As a consequence we must relax the constraints (either in ParentQueue if we remove the hierarchy, or as it is today in PlanQueue). The good news is that the percentages from the configuration are not interpreted as actual percentages, but rather used as relative "weights" (ranking queues in used_resources / guaranteed_resources). This means that even a careless admin will not get resources unused. For example, if we set two queues to 10,40 (i.e., something that doesn't add up to 100), the behavior is equivalent to setting them to 20,80 (as they are used only for relative ranking of siblings). I think this is also ok for hierarchies (worth double checking this part). So all in all we can pull up to {{ParentQueue}} and {{LeafQueue}} all the dynamic behavior if there is consensus that this is the right path. *Regarding move:* 1) Good catch... We will wait for feedback from Jian on this. 2) I think we had that at some point and did not work correctly. We will try again. 3) There are few invariants we do not check. {{MaxApplicationsPerUser}} is one of them, but also how many applications can be active in the target queue, etc... As I was mentioning in my previous comment, this is likely fine for the limited usage we will make of this from {{ReservationSystem}}, but it is worth expand the checks we make (see {{FairScheduler.verifyMoveDoesNotViolateConstraints(..)}}) to expose move to users via CLI. > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077254#comment-14077254 ] Wangda Tan commented on YARN-1707: -- Hi [~subru], Thanks for your elaboration, it is very helpful for me to understand the background. Regards, Wangda > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077242#comment-14077242 ] Subramaniam Venkatraman Krishnan commented on YARN-1707: [~wangda] Thanks for the very detailed comments. I agree that understanding the context is essential & glad to help with that. Overall your understanding is spot on, please find answers to your questions below: 1) Yes, it is possible to have multiple PlanQueues (e.g., if two organization want to dynamically allocate their resources, but not share among them). This is also good to "try" reservation on a small scale and slowly ramp up at each org's pace. 2) The extra confs are needed to automate the initialization of key parameters of the dynamic ReservationQueues (without requiring full specification of each of those). 3) Correct 4) Correct 5) First: the Plan guarantees that the sum of reservations never exceed available resources (replanning if needed to maintain this invariant to handle failures). On the other hand, like it happens for normal scheduler we can leverage "overcapacity" to guarantee high cluster utilization. More precisely, depending on the configuration (or dynamically on whether reservations have gang semantics or not) we can allow resources allocated to PlanQueue and ReservationQueue to exceed their guaranteed capacity (i.e., set the dynamic max-capacity above the guaranteed one). In this case preemption might kick in if other apps with more rights on resources have pending askss. Part of the changes in YARN-1957 were driven by this. 6) To limit the scope of changed, we agreed to have a follow up JIRA to address HA. The intuition we have is that it is sufficient to persist the Plan alone. During recovery, the _Plan Follower_ will resync the Plan with the scheduler by creating the dynamic queues for currently active reservations. We will be happy to have your input when we work on the HA JIRA. [~curino] will answer your questions specify to this JIRA. > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077235#comment-14077235 ] Junping Du commented on YARN-2209: -- bq. Previously, AM doesn't do re-register. Re-register on RM restart is a new requirement coming out from YARN-556. Does RESYNC being added in YARN-556 also? If so, I think this is a reasonable change and I suggest to remove RESYNC completely (not just deprecated) before this feature get released. > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2026) Fair scheduler : Fair share for inactive queues causes unfair allocation in some scenarios
[ https://issues.apache.org/jira/browse/YARN-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077218#comment-14077218 ] Ashwin Shankar commented on YARN-2026: -- [~kasha],[~sandyr] , did you have any comments on the latest patch ? I also made UI changes and attached screenshot which shows static/dynamic fair share in YARN-2360. Can you please take a look at that also ? > Fair scheduler : Fair share for inactive queues causes unfair allocation in > some scenarios > -- > > Key: YARN-2026 > URL: https://issues.apache.org/jira/browse/YARN-2026 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Labels: scheduler > Attachments: YARN-2026-v1.txt, YARN-2026-v2.txt, YARN-2026-v3.txt > > > Problem1- While using hierarchical queues in fair scheduler,there are few > scenarios where we have seen a leaf queue with least fair share can take > majority of the cluster and starve a sibling parent queue which has greater > weight/fair share and preemption doesn’t kick in to reclaim resources. > The root cause seems to be that fair share of a parent queue is distributed > to all its children irrespective of whether its an active or an inactive(no > apps running) queue. Preemption based on fair share kicks in only if the > usage of a queue is less than 50% of its fair share and if it has demands > greater than that. When there are many queues under a parent queue(with high > fair share),the child queue’s fair share becomes really low. As a result when > only few of these child queues have apps running,they reach their *tiny* fair > share quickly and preemption doesn’t happen even if other leaf > queues(non-sibling) are hogging the cluster. > This can be solved by dividing fair share of parent queue only to active > child queues. > Here is an example describing the problem and proposed solution: > root.lowPriorityQueue is a leaf queue with weight 2 > root.HighPriorityQueue is parent queue with weight 8 > root.HighPriorityQueue has 10 child leaf queues : > root.HighPriorityQueue.childQ(1..10) > Above config,results in root.HighPriorityQueue having 80% fair share > and each of its ten child queue would have 8% fair share. Preemption would > happen only if the child queue is <4% (0.5*8=4). > Lets say at the moment no apps are running in any of the > root.HighPriorityQueue.childQ(1..10) and few apps are running in > root.lowPriorityQueue which is taking up 95% of the cluster. > Up till this point,the behavior of FS is correct. > Now,lets say root.HighPriorityQueue.childQ1 got a big job which requires 30% > of the cluster. It would get only the available 5% in the cluster and > preemption wouldn't kick in since its above 4%(half fair share).This is bad > considering childQ1 is under a highPriority parent queue which has *80% fair > share*. > Until root.lowPriorityQueue starts relinquishing containers,we would see the > following allocation on the scheduler page: > *root.lowPriorityQueue = 95%* > *root.HighPriorityQueue.childQ1=5%* > This can be solved by distributing a parent’s fair share only to active > queues. > So in the example above,since childQ1 is the only active queue > under root.HighPriorityQueue, it would get all its parent’s fair share i.e. > 80%. > This would cause preemption to reclaim the 30% needed by childQ1 from > root.lowPriorityQueue after fairSharePreemptionTimeout seconds. > Problem2 - Also note that similar situation can happen between > root.HighPriorityQueue.childQ1 and root.HighPriorityQueue.childQ2,if childQ2 > hogs the cluster. childQ2 can take up 95% cluster and childQ1 would be stuck > at 5%,until childQ2 starts relinquishing containers. We would like each of > childQ1 and childQ2 to get half of root.HighPriorityQueue fair share ie > 40%,which would ensure childQ1 gets upto 40% resource if needed through > preemption. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077211#comment-14077211 ] Ashwin Shankar commented on YARN-2360: -- Expected -1 from Jenkins since patch depends on unresolved YARN-2026. > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, > YARN-2360-v1.txt > > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077204#comment-14077204 ] Jian He commented on YARN-2209: --- bq. The customized AM code could get RESYNC from response previously (like what we original do in AMRMClient) to handle AM re-registering case. Previously, AM doesn't do re-register. Re-register on RM restart is a new requirement coming out from YARN-556 > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2347) Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in yarn-server-common
[ https://issues.apache.org/jira/browse/YARN-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077198#comment-14077198 ] Junping Du commented on YARN-2347: -- [~zjshen], can you help to review it again? Thx! > Consolidate RMStateVersion and NMDBSchemaVersion into StateVersion in > yarn-server-common > > > Key: YARN-2347 > URL: https://issues.apache.org/jira/browse/YARN-2347 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Junping Du >Assignee: Junping Du > Attachments: YARN-2347-v2.patch, YARN-2347-v3.patch, > YARN-2347-v4.patch, YARN-2347-v5.patch, YARN-2347.patch > > > We have similar things for version state for RM, NM, TS (TimelineServer), > etc. I think we should consolidate them into a common object. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077192#comment-14077192 ] Junping Du commented on YARN-2209: -- bq. I think users are expected to handle two types of exceptions YarnException and IOException. In that sense, this is equivalent to throwing a new type of exception which should be fine? No. The customized AM code could get RESYNC from response previously (like what we original do in AMRMClient) to handle AM re-registering case. Now, it cannot get this RESYNC so could failed to re-registering to restarted RM. Do I miss anything here? > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077178#comment-14077178 ] Jason Lowe commented on YARN-1354: -- Thanks for taking a look, Junping! bq. what would happen if storeApplication(), finishApplication(), removeApplication() failed with application related information get inconsistent after restart? If storeApplication fails then it will throw an IOException which will bubble up and fail the container start request on the client. As long as we're unable to store a new application, containers for that application will not start, which I believe is the desired behavior. That prevents the state store from being inconsistent in this particular scenario. If finishApplication fails then the NM will proceed as if it did succeed but the state store will still have the application present. This should be corrected when the NM restarts and registers with the RM with those applications still running. The RM should correct the situation by telling the NM that the application has finished (see YARN-1885), and the NM will proceed to perform application finish processing (e.g.: log aggregation, etc.). I think worst-case it will upload all of the app container logs again, but when it goes to rename to the final destination name that will fail because the name already exists. Thus there could be some wasted work, but it should sort itself out and not do something catastrophic. If removeApplication fails then the NM will proceed as if it did succeed but the state store will still have the application present. This should be corrected when the NM finishes application processing (per above or if it was already recorded as finished) and it will again try to remove it from the state store. As above I think there could be some unnecessary work performed, but I think in the end the application should eventually be removed from the NM on restart. It could still remain in the state store if the second removal also fails, but a subsequent restart should behave the same. bq. Do we need special warning if get failed on deserializing credential here? I'm not sure how credential processing is fundamentally all that different from protocol buffer parsing which could also fail. If the credentials can't be read then we can't recover the application. Currently recovery errors are fatal to NM startup. Do you have something specific in mind for handling the credentials if the writable changes (e.g.: some pseudo code to show the approach)? > Recover applications upon nodemanager restart > - > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, > YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, > YARN-1354-v4.patch, YARN-1354-v5.patch > > > The set of active applications in the nodemanager context need to be > recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077144#comment-14077144 ] Hadoop QA commented on YARN-2360: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658291/YARN-2360-v1.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4463//console This message is automatically generated. > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, > YARN-2360-v1.txt > > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077125#comment-14077125 ] Ashwin Shankar commented on YARN-2360: -- Attached screenshot and patch for UI changes to display dynamic fair share. Some comments on UI changes : 1. I'm calling dynamic fair share as "Current Fair Share" and static fair share as "Guaranteed Fair Share". 2. Since dynamic fair share is a "temporary fair share", I've represented it as a "dashed" border. 3. Changed static fair share border to have a "solid" border rather than "dashed". 4. Added Dynamic Fair Share/Current Fair Share to show up on tooltip. 5. Usage changes to Orange, when it goes above dynamic/current fair share rather than static fair share. > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, > YARN-2360-v1.txt > > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Lu updated YARN-2354: Attachment: YARN-2354-072814.patch New patch, added log information. > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch, YARN-2354-072814.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2360: - Attachment: Screen Shot 2014-07-28 at 1.12.19 PM.png > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, > YARN-2360-v1.txt > > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar updated YARN-2360: - Attachment: YARN-2360-v1.txt > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > Attachments: Screen Shot 2014-07-28 at 1.12.19 PM.png, > YARN-2360-v1.txt > > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077087#comment-14077087 ] Hadoop QA commented on YARN-2363: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658270/YARN-2363.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4462//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4462//console This message is automatically generated. > Submitted applications occasionally lack a tracking URL > --- > > Key: YARN-2363 > URL: https://issues.apache.org/jira/browse/YARN-2363 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-2363.patch > > > Sometimes when an application is submitted the client receives no tracking > URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2360) Fair Scheduler : Display dynamic fair share for queues on the scheduler page
[ https://issues.apache.org/jira/browse/YARN-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashwin Shankar reassigned YARN-2360: Assignee: Ashwin Shankar > Fair Scheduler : Display dynamic fair share for queues on the scheduler page > > > Key: YARN-2360 > URL: https://issues.apache.org/jira/browse/YARN-2360 > Project: Hadoop YARN > Issue Type: New Feature > Components: fairscheduler >Reporter: Ashwin Shankar >Assignee: Ashwin Shankar > > Based on the discussion in YARN-2026, we'd like to display dynamic fair > share for queues on the scheduler page. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077038#comment-14077038 ] Jian He commented on YARN-2209: --- Hi Zhijie, thanks for the review. Here are some responses: bq. Why is it necessary to use the exception instead of the flag to indicate the RM restarting? I Because as you can see, not just allocate API, unregisterResponse is also required to add AMCommand otherwise. Basically, every AMS API other than register requires adding a new field otherwise. Throwing exception is much cleaner way. bq. For example, MR of prior versions will no longer work properly with a YARN cluster after this patch during RM restarting. Not matter how application is reacting to the shutdown command, NM will shoot down the AM container during RM restart. So prior applications(including MR) should still work. Even earlier MR AM container is possibly killed by NM before it actually successfully performs any shutting down logic. bq. Deprecate the enum type instead of each enum value? Maybe not deprecating AMCommand, as we may add other commands later on as needed. bq. Why not throwing ApplicationAttemptNotFoundException instead? It sounds more reasonable here, doesn’t it? Do you mean creating a new ApplicationAttemptNotFoundException exception ? I think it's fine to just reuse the ApplicationNotFoundException as they are quite similar. The internal exception msg shows the attemptId. bq. Is this change necessary? It is. because the finally block (i.e. "if(allocateResponse == null)" ) will be executed otherwise. bq. shall we split the patch into two pieces: one for YARN and the other for MR, will split once review is done. I think it'll be easier to review with both side changes having more context. bq. No need to break it into two lines, right? will fix it. > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076984#comment-14076984 ] Mit Desai commented on YARN-2363: - patch looks good to me. +1 (non-binding) > Submitted applications occasionally lack a tracking URL > --- > > Key: YARN-2363 > URL: https://issues.apache.org/jira/browse/YARN-2363 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-2363.patch > > > Sometimes when an application is submitted the client receives no tracking > URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2363: - Attachment: YARN-2363.patch Quick patch that generates a default proxy URL if the user has access to the app but there isn't a current attempt. > Submitted applications occasionally lack a tracking URL > --- > > Key: YARN-2363 > URL: https://issues.apache.org/jira/browse/YARN-2363 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe > Attachments: YARN-2363.patch > > > Sometimes when an application is submitted the client receives no tracking > URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2366) Speed up history server startup time
[ https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076933#comment-14076933 ] Hadoop QA commented on YARN-2366: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658247/YARN-2366.v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4461//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4461//console This message is automatically generated. > Speed up history server startup time > > > Key: YARN-2366 > URL: https://issues.apache.org/jira/browse/YARN-2366 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2366.v1.patch > > > When history server starts up, It scans every history directories and put all > history files into a cache, whereas this cache only stores 20K recent history > files. Therefore, it is wasting a large portion of time loading old history > files into the cache, and the startup time will keep increasing if we don't > trim the number of history files. For example, when history server starts up > with 2.5M history files in HDFS, it took ~5 minutes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2366) Speed up history server startup time
[ https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2366: -- Attachment: YARN-2366.v1.patch > Speed up history server startup time > > > Key: YARN-2366 > URL: https://issues.apache.org/jira/browse/YARN-2366 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2366.v1.patch > > > When history server starts up, It scans every history directories and put all > history files into a cache, whereas this cache only stores 20K recent history > files. Therefore, it is wasting a large portion of time loading old history > files into the cache, and the startup time will keep increasing if we don't > trim the number of history files. For example, when history server starts up > with 2.5M history files in HDFS, it took ~5 minutes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2366) Speed up history server startup time
[ https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li reassigned YARN-2366: - Assignee: Siqi Li > Speed up history server startup time > > > Key: YARN-2366 > URL: https://issues.apache.org/jira/browse/YARN-2366 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Siqi Li >Assignee: Siqi Li > Attachments: YARN-2366.v1.patch > > > When history server starts up, It scans every history directories and put all > history files into a cache, whereas this cache only stores 20K recent history > files. Therefore, it is wasting a large portion of time loading old history > files into the cache, and the startup time will keep increasing if we don't > trim the number of history files. For example, when history server starts up > with 2.5M history files in HDFS, it took ~5 minutes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2366) Speed up history server startup time
[ https://issues.apache.org/jira/browse/YARN-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li updated YARN-2366: -- Description: When history server starts up, It scans every history directories and put all history files into a cache, whereas this cache only stores 20K recent history files. Therefore, it is wasting a large portion of time loading old history files into the cache, and the startup time will keep increasing if we don't trim the number of history files. For example, when history server starts up with 2.5M history files in HDFS, it took ~5 minutes. > Speed up history server startup time > > > Key: YARN-2366 > URL: https://issues.apache.org/jira/browse/YARN-2366 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Siqi Li > > When history server starts up, It scans every history directories and put all > history files into a cache, whereas this cache only stores 20K recent history > files. Therefore, it is wasting a large portion of time loading old history > files into the cache, and the startup time will keep increasing if we don't > trim the number of history files. For example, when history server starts up > with 2.5M history files in HDFS, it took ~5 minutes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2366) Speed up history server startup time
Siqi Li created YARN-2366: - Summary: Speed up history server startup time Key: YARN-2366 URL: https://issues.apache.org/jira/browse/YARN-2366 Project: Hadoop YARN Issue Type: Bug Reporter: Siqi Li -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2365) TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2
Mit Desai created YARN-2365: --- Summary: TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry fails on branch-2 Key: YARN-2365 URL: https://issues.apache.org/jira/browse/YARN-2365 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai TestAMRestart#testShouldNotCountFailureToMaxAttemptRetry fails on branch with the following errror {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 46.471 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart testShouldNotCountFailureToMaxAttemptRetry(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart) Time elapsed: 46.354 sec <<< FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected: but was: at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAM(MockRM.java:569) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.launchAndRegisterAM(MockRM.java:576) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart.testShouldNotCountFailureToMaxAttemptRetry(TestAMRestart.java:389) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076769#comment-14076769 ] Hadoop QA commented on YARN-1769: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658215/YARN-1769.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4460//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4460//console This message is automatically generated. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2364) TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy
Mit Desai created YARN-2364: --- Summary: TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy Key: YARN-2364 URL: https://issues.apache.org/jira/browse/YARN-2364 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.5.0 Reporter: Mit Desai TestRMRestart#testRMRestartWaitForPreviousAMToFinish is racy. It fails intermittently on branch-2 with the following errors. Fails with any of these {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 26.836 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 26.687 sec <<< FAILURE! java.lang.AssertionError: expected:<4> but was:<3> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:557) {noformat} or {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 51.326 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart) Time elapsed: 51.055 sec <<< FAILURE! java.lang.AssertionError: AppAttempt state is not correct (timedout) expected: but was: at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:414) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:949) at org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:519) {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076679#comment-14076679 ] Hadoop QA commented on YARN-415: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658211/YARN-415.201407281816.txt against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.client.TestRMFailover org.apache.hadoop.yarn.client.api.impl.TestAMRMClient org.apache.hadoop.yarn.client.api.impl.TestNMClient org.apache.hadoop.yarn.client.TestGetGroups org.apache.hadoop.yarn.client.TestResourceManagerAdministrationProtocolPBClientImpl org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.client.cli.TestYarnCLI org.apache.hadoop.yarn.client.api.impl.TestYarnClient org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestRMAppTransitions org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebappAuthentication org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.TestClientRMTokens org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairSchedulerQueueACLs org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesApps org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart org.apache.hadoop.yarn.server.resourcemanager.TestRMAdminService org.apache.hadoop.yarn.server.resourcemanager.TestRMHA org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4459//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4459//console This message is automatically generated. > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the
[jira] [Commented] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076607#comment-14076607 ] Hadoop QA commented on YARN-1769: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658198/YARN-1769.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4458//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4458//console This message is automatically generated. > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1769: -- Attachment: YARN-1769.patch > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201407281816.txt [~leftnoteasy] Thanks for all of your help. How were you thinking an end-to-end test would work in the UT environment? In order to set a baseline and test that the containers ran for some predetermined and expected amount of time, wouldn't I need to somehow control the clock? Do you have any ideas on how to implement that? In the meantime, I have made the additional changes you suggested. Please see below: {quote} bq. I was able to remove the rmApps variable, but I had to leave the check for app != null because if I try to take that out, several unit tests would fail with NullPointerException. Even with removing the rmApps variable, I needed to change TestRMContainerImpl.java to mock rmContext.getRMApps(). I would like to suggest to fix such UTs instead of inserting some kernel code to make UT pass. I'm not sure about the effort of doing this, if the effort is still reasonable, we should do it. {quote} After some spy and mock magic, I was able to fix the unit tests so that the checks for "if != null" were not necessary. {quote} {code} ApplicationCLI.java + appReportStr.print("\tResources used : "); {code} We need change it to Resource Utilization as well? {quote} Yes. I changed it to that. > Capture memory utilization at the app-level for chargeback > -- > > Key: YARN-415 > URL: https://issues.apache.org/jira/browse/YARN-415 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager >Affects Versions: 0.23.6 >Reporter: Kendall Thrapp >Assignee: Andrey Klochkov > Attachments: YARN-415--n10.patch, YARN-415--n2.patch, > YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, > YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, > YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, > YARN-415.201406262136.txt, YARN-415.201407042037.txt, > YARN-415.201407071542.txt, YARN-415.201407171553.txt, > YARN-415.201407172144.txt, YARN-415.201407232237.txt, > YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.patch > > > For the purpose of chargeback, I'd like to be able to compute the cost of an > application in terms of cluster resource usage. To start out, I'd like to > get the memory utilization of an application. The unit should be MB-seconds > or something similar and, from a chargeback perspective, the memory amount > should be the memory reserved for the application, as even if the app didn't > use all that memory, no one else was able to use it. > (reserved ram for container 1 * lifetime of container 1) + (reserved ram for > container 2 * lifetime of container 2) + ... + (reserved ram for container n > * lifetime of container n) > It'd be nice to have this at the app level instead of the job level because: > 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't > appear on the job history server). > 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). > This new metric should be available both through the RM UI and RM Web > Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1769: -- Attachment: (was: YARN-1769.patch) > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076498#comment-14076498 ] Zhijie Shen commented on YARN-2209: --- [~jianhe], thanks for the patch. Bellow is some meta comments on this issue. Why is it necessary to use the exception instead of the flag to indicate the RM restarting? In general, I'm afraid the changes here mutually break the backward compatibility between YARN and MR. On the one side, any YARN applications used to have the logic to deal with RM restarting need to be updated after this patch. For example, MR of prior versions will no longer work properly with a YARN cluster after this patch during RM restarting. The MR job won’t recognize the not found exception and take the necessary restarting treatment, but will just record the error and move on. On the other side, if we assume it is possible the new version MR job after this patch is going to be run on an old YARN cluster, the MR job will then not recognize the old flag-style restarting signal, and thus will not executing the MR-side logic to deal with RM restarting. IMHO, at least, the switch block to check the AMCommand cannot be removed but deprecated for compatibility consideration. In case we want to proceed with this change, here're some comment on the patch: 1. MR side change is not trivial. According to our convention before, shall we split the patch into two pieces: one for YARN and the other for MR, such that we can easily track the changes for different projects. 2. Why not throwing ApplicationAttemptNotFoundException instead? It sounds more reasonable here, doesn’t it? 3. Deprecate the enum type instead of each enum value? {code} @Public @Unstable public enum AMCommand { {code} 4. The description sounds not accurate enough. It doesn’t just request containers. “App Master heartbeat”? {code} +public static final String AM_ALLOCATE = "App Master request containers”; {code} 5. No need to break it into two lines, right? {code} AllocateResponse allocateResponse; … +allocateResponse = scheduler.allocate(allocateRequest); {code} 6. Is this change necessary? {code} -return allocate(progressIndicator); +allocateResponse = allocate(progressIndicator); +return allocateResponse; {code} > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1769) CapacityScheduler: Improve reservations
[ https://issues.apache.org/jira/browse/YARN-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-1769: -- Attachment: YARN-1769.patch reduce log output when LeafQueue need to unreserve resource frequently. if (needToUnreserve) { + if(LOG.isDebugEnabled()){ LOG.info("we needed to unreserve to be able to allocate"); + } return false; } > CapacityScheduler: Improve reservations > > > Key: YARN-1769 > URL: https://issues.apache.org/jira/browse/YARN-1769 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Attachments: YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, > YARN-1769.patch, YARN-1769.patch, YARN-1769.patch, YARN-1769.patch > > > Currently the CapacityScheduler uses reservations in order to handle requests > for large containers and the fact there might not currently be enough space > available on a single host. > The current algorithm for reservations is to reserve as many containers as > currently required and then it will start to reserve more above that after a > certain number of re-reservations (currently biased against larger > containers). Anytime it hits the limit of number reserved it stops looking > at any other nodes. This results in potentially missing nodes that have > enough space to fullfill the request. > The other place for improvement is currently reservations count against your > queue capacity. If you have reservations you could hit the various limits > which would then stop you from looking further at that node. > The above 2 cases can cause an application requesting a larger container to > take a long time to gets it resources. > We could improve upon both of those by simply continuing to look at incoming > nodes to see if we could potentially swap out a reservation for an actual > allocation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2354) DistributedShell may allocate more containers than client specified after it restarts
[ https://issues.apache.org/jira/browse/YARN-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076439#comment-14076439 ] Li Lu commented on YARN-2354: - Same error message as YARN-2295, and could not reproduce locally. Seems like this is connected with the network settings of the server, causing the following lines to fail {code} if (appReport.getHost().startsWith(hostName) && appReport.getRpcPort() == -1) { verified = true; } {code} If such check failed, verified will never be set to true, hence the test will fail. This failure appears to be unrelated to the problem fixed by this patch. > DistributedShell may allocate more containers than client specified after it > restarts > - > > Key: YARN-2354 > URL: https://issues.apache.org/jira/browse/YARN-2354 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jian He >Assignee: Li Lu > Attachments: YARN-2354-072514.patch > > > To reproduce, run distributed shell with -num_containers option, > In ApplicationMaster.java, the following code has some issue. > {code} > int numTotalContainersToRequest = > numTotalContainers - previousAMRunningContainers.size(); > for (int i = 0; i < numTotalContainersToRequest; ++i) { > ContainerRequest containerAsk = setupContainerAskForRM(); > amRMClient.addContainerRequest(containerAsk); > } > numRequestedContainers.set(numTotalContainersToRequest); > {code} > numRequestedContainers doesn't account for previous AM's requested > containers. so numRequestedContainers should be set to numTotalContainers -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076405#comment-14076405 ] Hadoop QA commented on YARN-1994: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658153/YARN-1994.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4457//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4457//console This message is automatically generated. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2357) Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 changes to branch-2
[ https://issues.apache.org/jira/browse/YARN-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Foley updated YARN-2357: - Target Version/s: 2.6.0 > Port Windows Secure Container Executor YARN-1063, YARN-1972, YARN-2198 > changes to branch-2 > -- > > Key: YARN-2357 > URL: https://issues.apache.org/jira/browse/YARN-2357 > Project: Hadoop YARN > Issue Type: Task > Components: nodemanager >Affects Versions: 2.4.0 >Reporter: Remus Rusanu >Assignee: Remus Rusanu >Priority: Critical > Labels: security, windows > Attachments: YARN-2357.1.patch > > > As title says. Once YARN-1063, YARN-1972 and YARN-2198 are committed to > trunk, they need to be backported to branch-2 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076369#comment-14076369 ] Jian He commented on YARN-2209: --- Hi [~djp], thanks for the comment. I think users are expected to handle two types of exceptions YarnException and IOException. In that sense, this is equivalent to throwing a new type of exception which should be fine ? > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076319#comment-14076319 ] Craig Welch commented on YARN-1994: --- TestAMRestart passes on my box, reattached patch to try again on jenkins > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1994: -- Attachment: YARN-1994.11.patch > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, > YARN-1994.4.patch, YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076296#comment-14076296 ] Junping Du commented on YARN-1354: -- Thanks [~jlowe] for updating the patch! A few quick comments so far: {code} +try { + this.context.getNMStateStore().finishApplication(appID); +} catch (IOException e) { + LOG.error("Unable to update application state in store", e); +} {code} Looks like we only log when persistent effort get failed as we did for other components before. In this case, what would happen if storeApplication(), finishApplication(), removeApplication() failed with application related information get inconsistent after restart? In ContainerManagerImpl.java {code} + private void recoverApplication(ContainerManagerApplicationProto p) + throws IOException { +ApplicationId appId = new ApplicationIdPBImpl(p.getId()); +Credentials creds = new Credentials(); +creds.readTokenStorageStream( +new DataInputStream(p.getCredentials().newInput())); ... {code} Do we need special warning if get failed on deserializing credential here? i.e. adding something like version mismatch, etc. It could happen when any changes happen in future on credentials object which is a writable object. More comments will come later. > Recover applications upon nodemanager restart > - > > Key: YARN-1354 > URL: https://issues.apache.org/jira/browse/YARN-1354 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Affects Versions: 2.3.0 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-1354-v1.patch, > YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, > YARN-1354-v4.patch, YARN-1354-v5.patch > > > The set of active applications in the nodemanager context need to be > recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076292#comment-14076292 ] Jason Lowe commented on YARN-2363: -- Most application submits result in a proxy tracking URL, but occasionally the client sees a transient "N/A" URL. Here's a snippet of Pig client output where a MapReduce job was submitted with no tracking URL received: {noformat} 2014-07-23 19:19:16,658 [JobControl] INFO org.apache.hadoop.mapred.ResourceMgrDelegate - Submitted application application_1403199204249_357708 to ResourceManager at xx/xx:xx 2014-07-23 19:19:16,660 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: N/A {noformat} I believe this can occur if the client tries to get an application report just as the app is submitted. YarnClientImpl.submitApplication won't return until the app is past the NEW_SAVING state, but if the client slips in while the app is in the SUBMITTED state then I think we could end up with no tracking URL due to the lack of a current attempt. From RMAppImpl.createAndGetApplicationReport: {code} String trackingUrl = UNAVAILABLE; String host = UNAVAILABLE; String origTrackingUrl = UNAVAILABLE; [...] if (allowAccess) { if (this.currentAttempt != null) { currentApplicationAttemptId = this.currentAttempt.getAppAttemptId(); trackingUrl = this.currentAttempt.getTrackingUrl(); origTrackingUrl = this.currentAttempt.getOriginalTrackingUrl(); {code} So if we don't have a current attempt we'll return "N/A" as the tracking URL. Arguably we should return the proxied URL which will redirect to the RM app page if there is no tracking URL set yet so at least the client/user has a URL that can be used to track the application. > Submitted applications occasionally lack a tracking URL > --- > > Key: YARN-2363 > URL: https://issues.apache.org/jira/browse/YARN-2363 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Jason Lowe > > Sometimes when an application is submitted the client receives no tracking > URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2363) Submitted applications occasionally lack a tracking URL
Jason Lowe created YARN-2363: Summary: Submitted applications occasionally lack a tracking URL Key: YARN-2363 URL: https://issues.apache.org/jira/browse/YARN-2363 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Sometimes when an application is submitted the client receives no tracking URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jake Farrell updated YARN-321: -- Comment: was deleted (was: Compared to wrists well is less available status amphetamines but higher investigations of withdrawal. adderall 20 mg http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787428-29851520-stopadd3.html Areas also document any reasons they have surprisingly been using in the information.) > Generic application history service > --- > > Key: YARN-321 > URL: https://issues.apache.org/jira/browse/YARN-321 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Luke Lu > Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, > Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java > > > The mapreduce job history server currently needs to be deployed as a trusted > server in sync with the mapreduce runtime. Every new application would need a > similar application history server. Having to deploy O(T*V) (where T is > number of type of application, V is number of version of application) trusted > servers is clearly not scalable. > Job history storage handling itself is pretty generic: move the logs and > history data into a particular directory for later serving. Job history data > is already stored as json (or binary avro). I propose that we create only one > trusted application history server, which can have a generic UI (display json > as a tree of strings) as well. Specific application/version can deploy > untrusted webapps (a la AMs) to query the application history server and > interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076226#comment-14076226 ] Hudson commented on YARN-2247: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1818 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1818/]) YARN-2247. Made RM web services authenticate users via kerberos and delegation token. Contributed by Varun Vasudev. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613821) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Fix For: 2.5.0 > > Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, > apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, > apache-yarn-2247.4.patch, apache-yarn-2247.5.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076215#comment-14076215 ] Hadoop QA commented on YARN-1994: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658123/YARN-1994.11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4456//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4456//console This message is automatically generated. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, > YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076206#comment-14076206 ] Hudson commented on YARN-2247: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1845 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1845/]) YARN-2247. Made RM web services authenticate users via kerberos and delegation token. Contributed by Varun Vasudev. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613821) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Fix For: 2.5.0 > > Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, > apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, > apache-yarn-2247.4.patch, apache-yarn-2247.5.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1994) Expose YARN/MR endpoints on multiple interfaces
[ https://issues.apache.org/jira/browse/YARN-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1994: -- Attachment: YARN-1994.11.patch Fixed bug, YarnConfiguration.getSocketAddr checks in ha cases to see which rm it was on, this was no longer active in earlier versions of the patch. Simplified logic, removed many unnecessary changes in earlier patch versions, added some tests. With this patch, logic should be to act as before in the absence of any bind-host, in the presence of bind-host, only for the listening process, port is retrieved from address and used with bind-host to bind, all other address/configuration paths should now be unchanged by the patch. > Expose YARN/MR endpoints on multiple interfaces > --- > > Key: YARN-1994 > URL: https://issues.apache.org/jira/browse/YARN-1994 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager, webapp >Affects Versions: 2.4.0 >Reporter: Arpit Agarwal >Assignee: Craig Welch > Attachments: YARN-1994.0.patch, YARN-1994.1.patch, > YARN-1994.11.patch, YARN-1994.2.patch, YARN-1994.3.patch, YARN-1994.4.patch, > YARN-1994.5.patch, YARN-1994.6.patch, YARN-1994.7.patch > > > YARN and MapReduce daemons currently do not support specifying a wildcard > address for the server endpoints. This prevents the endpoints from being > accessible from all interfaces on a multihomed machine. > Note that if we do specify INADDR_ANY for any of the options, it will break > clients as they will attempt to connect to 0.0.0.0. We need a solution that > allows specifying a hostname or IP-address for clients while requesting > wildcard bind for the servers. > (List of endpoints is in a comment below) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2247) Allow RM web services users to authenticate using delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076142#comment-14076142 ] Hudson commented on YARN-2247: -- FAILURE: Integrated in Hadoop-Yarn-trunk #626 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/626/]) YARN-2247. Made RM web services authenticate users via kerberos and delegation token. Contributed by Varun Vasudev. (zjshen: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1613821) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilter.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/http/RMAuthenticationFilterInitializer.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/RMAuthenticationHandler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesDelegationTokenAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebappAuthentication.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRest.apt.vm > Allow RM web services users to authenticate using delegation tokens > --- > > Key: YARN-2247 > URL: https://issues.apache.org/jira/browse/YARN-2247 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Varun Vasudev >Assignee: Varun Vasudev >Priority: Blocker > Fix For: 2.5.0 > > Attachments: YARN-2247.6.patch, apache-yarn-2247.0.patch, > apache-yarn-2247.1.patch, apache-yarn-2247.2.patch, apache-yarn-2247.3.patch, > apache-yarn-2247.4.patch, apache-yarn-2247.5.patch > > > The RM webapp should allow users to authenticate using delegation tokens to > maintain parity with RPC. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-321) Generic application history service
[ https://issues.apache.org/jira/browse/YARN-321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076080#comment-14076080 ] Patrick Morton commented on YARN-321: - Compared to wrists well is less available status amphetamines but higher investigations of withdrawal. adderall 20 mg http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787428-29851520-stopadd3.html Areas also document any reasons they have surprisingly been using in the information. > Generic application history service > --- > > Key: YARN-321 > URL: https://issues.apache.org/jira/browse/YARN-321 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Luke Lu > Attachments: AHS Diagram.pdf, ApplicationHistoryServiceHighLevel.pdf, > Generic Application History - Design-20131219.pdf, HistoryStorageDemo.java > > > The mapreduce job history server currently needs to be deployed as a trusted > server in sync with the mapreduce runtime. Every new application would need a > similar application history server. Having to deploy O(T*V) (where T is > number of type of application, V is number of version of application) trusted > servers is clearly not scalable. > Job history storage handling itself is pretty generic: move the logs and > history data into a particular directory for later serving. Job history data > is already stored as json (or binary avro). I propose that we create only one > trusted application history server, which can have a generic UI (display json > as a tree of strings) as well. Specific application/version can deploy > untrusted webapps (a la AMs) to query the application history server and > interpret the json for its specific UI and/or analytics. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2008) CapacityScheduler may report incorrect queueMaxCap if there is hierarchy queue structure
[ https://issues.apache.org/jira/browse/YARN-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076062#comment-14076062 ] Wangda Tan commented on YARN-2008: -- Hi Craig, As we discussed in YARN-1198, I think we should consider resource used by a queue's siblings when computing headroom, I took a look at your patch again, some comments: We first need think about how to calculate headroom in general, I think headroom is (concluded from sub JIRAs of YARN-1198), {code} queue_available = min(clusterResource - used_by_sibling_of_parents - used_by_this_queue, queue_max_resource) headroom = min(queue_available - available_resource_in_blacklisted_nodes, user_limit) {code} So I think this JIRA is focus on computing {{used_by_sibling_of_parents}}, is it? I think the general appoarch looks good to me, except In CSQueueUtils.java, (will include review of tests in next iteration): 1) {code} //sibling used is parent used - my used... float siblingUsedCapacity = Resources.ratio( resourceCalculator, Resources.subtract(parent.getUsedResources(), queue.getUsedResources()), parentResource); {code} It seems to me this computing not robust enough when parent resource is empty, no matter it's an zero-capacity queue or sibling of it used 100% of cluster. It's better to add an edge test case to prevent such zero-division as well. 2) It's better to explicitly cap {{return absoluteMaxAvail}} in range of \[0~1\] to prevent errors float computation. Thanks, Wangda > CapacityScheduler may report incorrect queueMaxCap if there is hierarchy > queue structure > - > > Key: YARN-2008 > URL: https://issues.apache.org/jira/browse/YARN-2008 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.3.0 >Reporter: Chen He >Assignee: Craig Welch > Attachments: YARN-2008.1.patch, YARN-2008.2.patch > > > If there are two queues, both allowed to use 100% of the actual resources in > the cluster. Q1 and Q2 currently use 50% of actual cluster's resources and > there is not actual space available. If we use current method to get > headroom, CapacityScheduler thinks there are still available resources for > users in Q1 but they have been used by Q2. > If the CapacityScheduelr has a hierarchy queue structure, it may report > incorrect queueMaxCap. Here is a example > ||||rootQueue|| || > | | / | > \ | > | L1ParentQueue1 | | > L1ParentQueue2| > | (allowed to use up 80% of its parent)| | (allowed to use 20% > in minimum of its parent)| > |/ | \ || > | L2LeafQueue1 |L2LeafQueue2 | | > |(50% of its parent) | (50% of its parent in minimum) | | > When we calculate headroom of a user in L2LeafQueue2, current method will > think L2LeafQueue2 can use 40% (80%*50%) of actual rootQueue resources. > However, without checking L1ParentQueue1, we are not sure. It is possible > that L1ParentQueue2 have used 40% of rootQueue resources right now. Actually, > L2LeafQueue2 can only use 30% (60%*50%). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1291) RM INFO logs limit scheduling speed
[ https://issues.apache.org/jira/browse/YARN-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076059#comment-14076059 ] Varun Saxena commented on YARN-1291: Hi [~sandyr], I had raised YARN-2287 which is also about printing of too many RM Audit logs in critical flow. For this, in the patch, I had added support for printing audit logs at different log levels and changed container logs in RM and NM to DEBUG. I didnt remove the audit logs as I wasnt sure if these audit logs are really required or not. > RM INFO logs limit scheduling speed > --- > > Key: YARN-1291 > URL: https://issues.apache.org/jira/browse/YARN-1291 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.2.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > > I've been running some microbenchmarks to see how fast the Fair Scheduler can > fill up a cluster and found its performance is significantly hampered by > logging. > I tested with 500 (mock) nodes, and found that: > * Taking out fair scheduler INFO logs on the critical path brought down the > latency from 14000 ms to 6000 ms > * Taking out the INFO that RMContainerImpl logs when a container transitions > brought it down from 6000 ms to 4000 ms > * Taking out RMAuditLogger logs brought it down from 4000 ms to 1700 ms -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2287) Add audit log levels for NM and RM
[ https://issues.apache.org/jira/browse/YARN-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-2287: --- Attachment: YARN-2287-patch-1.patch > Add audit log levels for NM and RM > -- > > Key: YARN-2287 > URL: https://issues.apache.org/jira/browse/YARN-2287 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.4.1 >Reporter: Varun Saxena > Attachments: YARN-2287-patch-1.patch, YARN-2287.patch > > > NM and RM audit logging can be done based on log level as some of the audit > logs, especially the container audit logs appear too many times. By > introducing log level, certain audit logs can be suppressed, if not required > in deployment. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076048#comment-14076048 ] Junping Du commented on YARN-2209: -- Thanks [~jianhe] for the patch and [~rohithsharma] for review! I think this is a reasonable change and patch itself looks good to me. However, I have concern that it could break existing YARN applications that run with old version ApplicationMasterProtocol which looks forward to a RESYNC command rather than an exception in response. More discussions with broadly people in community are needed, I think. > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076042#comment-14076042 ] Wangda Tan commented on YARN-1707: -- Thanks for uploading the patch [~curino], [~subru]. They're great additions to current CapacityScheduler. I took a look at your patch, *First I have a couple of questions about its background, especially {{PlanQueue}}/{{ReservationQueue}} in this patch. I think understanding background is important for me to get a whole picture of this patch. What I can understand is,* # {{PlanQueue}} can have a normal {{ParentQueue}} as its parent, but all children of {{PlanQueue}} can only be {{ReservationQueue}}. Is it possible that multiple {{PlanQueue}} exist in the cluster? # {{PlanQueue}} is initially setup in configuration, as same as {{ParentQueue}}, it has absolute capacity, etc. But different from {{ParentQueue}}, it has user-limit/user-limit-factor, etc. # {{ReservationQueue}} is dynamically initialized by PlanFollower, when a new reservationId acquired, it will create a new {{ReservationQueue}} accordingly # {{PlanFollower}} can dynamically adjust queue size of {{ReservationQueue}}s to make resource reservation can be satisfied. # Is it possible that sum of reserved resource exceeds limit of {{PlanQueue}}/{{ReservationQeueu}} and preemption triggered? # How to deal with RM restart? It is possible that RM restart during resource reservation, we may need to consider how to persistent such queues Hope you could share your ideas about them. *For requirement of this ticket (copied from JIRA),* # create queues dynamically # destroy queues dynamically # dynamically change queue parameters (e.g., capacity) # modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% instead of ==100% # move app across queues I found #1-#3 are dedicated used by {{PlanQueue}}, {{Reservation}}. IMHO, it should be better to added them to CapacityScheduler and don't couple them with ReservationSystem, but I cannot think about other solid senarios can leverage them. I hope to get feedbacks from community before we couple them with ReservationSystem. And as mentioned by [~acmurthy], can we merge add queue to existing add new queue mechanism? #4 should be only valid in {{PlanQueue}}. Because if we change this behavior in {{ParentQueue}}, it is possible that some careless admin will mis-setting capacities of queues under a parent queue, if sum of their capacity don't equals to 1, some resource may not be able to be used by applications. *Some other comments (Majorly about move app because we may need consider scope of create/destory queues first):* 1) I think we need consider how moving apps across queues work with YARN-1368. We can change queue of containers from queueA to queueB, but with YARN-1368, during RM restart, container will report it is in queueA (we don't sync them to NM when do moveApp operation). I hope [~jianhe] could share some thoughts about this as well. 2) Move application in CapacityScheduler need call finishApplication in resource queue and submitApplication in target queue to make QueueMetrics correct. And submitApplication will check ACL of target queue as well. 3) Should we respect MaxApplicationsPerUser in target queue when trying to move app? IMHO, we can stop moving app if MaxApplicationsPerUser reached in target queue. Thanks, Wangda > Making the CapacityScheduler more dynamic > - > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Carlo Curino >Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1631) Container allocation issue in Leafqueue assignContainers()
[ https://issues.apache.org/jira/browse/YARN-1631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076006#comment-14076006 ] Hadoop QA commented on YARN-1631: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12625843/Yarn-1631.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4455//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4455//console This message is automatically generated. > Container allocation issue in Leafqueue assignContainers() > -- > > Key: YARN-1631 > URL: https://issues.apache.org/jira/browse/YARN-1631 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.2.0 > Environment: SuSe 11 Linux >Reporter: Sunil G >Assignee: Sunil G > Attachments: Yarn-1631.1.patch, Yarn-1631.2.patch > > > Application1 has a demand of 8GB[Map Task Size as 8GB] which is more than > Node_1 can handle. > Node_1 has a size of 8GB and 2GB is used by Application1's AM. > Hence reservation happened for remaining 6GB in Node_1 by Application1. > A new job is submitted with 2GB AM size and 2GB task size with only 2 Maps to > run. > Node_2 also has 8GB capability. > But Application2's AM cannot be launched in Node_2. And Application2 waits > longer as only 2 Nodes are available in cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075984#comment-14075984 ] Hadoop QA commented on YARN-2209: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12658088/YARN-2209.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4454//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4454//console This message is automatically generated. > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2209) Replace AM resync/shutdown command with corresponding exceptions
[ https://issues.apache.org/jira/browse/YARN-2209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075973#comment-14075973 ] Rohith commented on YARN-2209: -- +1 patch looks good to me > Replace AM resync/shutdown command with corresponding exceptions > > > Key: YARN-2209 > URL: https://issues.apache.org/jira/browse/YARN-2209 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2209.1.patch, YARN-2209.2.patch, YARN-2209.3.patch, > YARN-2209.4.patch, YARN-2209.5.patch > > > YARN-1365 introduced an ApplicationMasterNotRegisteredException to indicate > application to re-register on RM restart. we should do the same for > AMS#allocate call also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2362) Capacity Scheduler: apps with requests that exceed current capacity can starve pending apps
[ https://issues.apache.org/jira/browse/YARN-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075970#comment-14075970 ] Wangda Tan commented on YARN-2362: -- I think we should fix this, {code} if (!assignToQueue(clusterResource, required)) { -return NULL_ASSIGNMENT; +break; } {code} The {{return NULL_ASSIGNMENT}} statement means: if an app submitted earlier cannot allocate resource in a queue, the rest of apps in the queue cannot allocate resource in a queue too. The {{break}} looks better to me. And I agree this should be a duplicate of YARN-1631 > Capacity Scheduler: apps with requests that exceed current capacity can > starve pending apps > --- > > Key: YARN-2362 > URL: https://issues.apache.org/jira/browse/YARN-2362 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 2.4.1 >Reporter: Ram Venkatesh > > Cluster configuration: > Total memory: 8GB > yarn.scheduler.minimum-allocation-mb 256 > yarn.scheduler.capacity.maximum-am-resource-percent 1 (100%, test only config) > App 1 makes a request for 4.6 GB, succeeds, app transitions to RUNNING state. > It subsequently makes a request for 4.6 GB, which cannot be granted and it > waits. > App 2 makes a request for 1 GB - never receives it, so the app stays in the > ACCEPTED state for ever. > I think this can happen in leaf queues that are near capacity. > The fix is likely in LeafQueue.java assignContainers near line 861, where it > returns if the assignment would exceed queue capacity, instead of checking if > requests for other active applications can be met. > {code:title=LeafQueue.java|borderStyle=solid} >// Check queue max-capacity limit >if (!assignToQueue(clusterResource, required)) { > -return NULL_ASSIGNMENT; > +break; >} > {code} > With this change, the scenario above allows App 2 to start and finish while > App 1 continues to wait. > I have a patch available, but wondering if the current behavior is by design. -- This message was sent by Atlassian JIRA (v6.2#6252)