[jira] [Comment Edited] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539595#comment-16539595 ] Weiwei Yang edited comment on YARN-8511 at 7/11/18 5:56 AM: Fix UT failure in v2 patch. was (Author: cheersyang): Fix UT failure in v2 patch... > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8511: -- Attachment: YARN-8511.002.patch > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539595#comment-16539595 ] Weiwei Yang commented on YARN-8511: --- Fix UT failure in v2 patch... > When AM releases a container, RM removes allocation tags before it is > released by NM > > > Key: YARN-8511 > URL: https://issues.apache.org/jira/browse/YARN-8511 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.1.0 >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8511.001.patch, YARN-8511.002.patch > > > User leverages PC with allocation tags to avoid port conflicts between apps, > we found sometimes they still get port conflicts. This is a similar issue > like YARN-4148. Because RM immediately removes allocation tags once > AM#allocate asks to release a container, however container on NM has some > delay until it actually gets killed and released the port. We should let RM > remove allocation tags AFTER NM confirms the containers are released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539582#comment-16539582 ] Rohith Sharma K S commented on YARN-8383: - +1 lgtm.. Tested in single node cluster. committing shortly.. > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8383.001-branch-2.8.patch > > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) > Caused by: java.lang.ClassNotFoundException: > com.fasterxml.jackson.core.JsonFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539581#comment-16539581 ] Rohith Sharma K S commented on YARN-8473: - +1 compiled on branch-2.8 and succeeded. The QA reports before patch failure because of the compilation issue. After this patch, compilation is succeeded. The javac error is unrelated to this patch. Committing shorlty > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4 > > Attachments: YARN-8473-branch-2.8.addendum.001.patch, > YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180711.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539570#comment-16539570 ] genericqa commented on YARN-8473: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 18m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} branch-2.8 Compile Tests {color} || | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 7m 10s{color} | {color:red} root in branch-2.8 failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 21s{color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.8 failed. {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} branch-2.8 passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 23s{color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.8 failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} branch-2.8 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 30s{color} | {color:red} hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager generated 3 new + 1 unchanged - 1 fixed = 4 total (was 2) {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 59s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 39m 55s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:749e106 | | JIRA Issue | YARN-8473 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931110/YARN-8473-branch-2.8.addendum.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f06dafd7d081 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-2.8 / c22e6c8 | | maven | version: Apache Maven 3.0.5 | | Default Java | 1.7.0_181 | | mvninstall | https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-mvninstall-root.txt | | compile | https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | mvnsite | https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | javac | https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results |
[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539558#comment-16539558 ] Bibin A Chundatt commented on YARN-8434: Thanks [~subru] for clarification. Will try removing the configuration too. Sure .. No issues in fixing doc as part of this jira. > Nodemanager not registering to active RM in federation > -- > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8434.001.patch, YARN-8434.002.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM
[ https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539550#comment-16539550 ] Bibin A Chundatt edited comment on YARN-8505 at 7/11/18 4:36 AM: - [~Tao Yang] Thank you for clarification {{numPendingApps+numActiveApps}} --> *Submitted application to queue*. So limitation will be only for submission of application to queue not for application *RUNNING application* (application in running state whose AM has started running). Earlier we would have been able to limit RUNNING application based on AM LIMIT for unmanaged AM. So this change will be a behaviour changed from old version too. IIUC in case of federation application will be submitted to clusters as unmanaged AM in subcluster. So impact for should be evaluated for this change. was (Author: bibinchundatt): [~Tao Yang] Thank you for clarification {{numPendingApps+numActiveApps}} --> *Submitted application to queue*. So limitation will be only for submission of application to queue not for application *RUNNING application* (application in running state who AM has started running). Earlier we would have been able to limit RUNNING application based on AM LIMIT for unmanaged AM. So this will be a behaviour changed from old version too. > AMLimit and userAMLimit check should be skipped for unmanaged AM > > > Key: YARN-8505 > URL: https://issues.apache.org/jira/browse/YARN-8505 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 2.9.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8505.001.patch > > > AMLimit and userAMLimit check in LeafQueue#activateApplications should be > skipped for unmanaged AM whose resource is not taken from YARN cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM
[ https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539550#comment-16539550 ] Bibin A Chundatt commented on YARN-8505: [~Tao Yang] Thank you for clarification {{numPendingApps+numActiveApps}} --> *Submitted application to queue*. So limitation will be only for submission of application to queue not for application *RUNNING application* (application in running state who AM has started running). Earlier we would have been able to limit RUNNING application based on AM LIMIT for unmanaged AM. So this will be a behaviour changed from old version too. > AMLimit and userAMLimit check should be skipped for unmanaged AM > > > Key: YARN-8505 > URL: https://issues.apache.org/jira/browse/YARN-8505 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 2.9.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8505.001.patch > > > AMLimit and userAMLimit check in LeafQueue#activateApplications should be > skipped for unmanaged AM whose resource is not taken from YARN cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539531#comment-16539531 ] genericqa commented on YARN-8473: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} docker {color} | {color:red} 9m 47s{color} | {color:red} Docker failed to build yetus/hadoop:749e106. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-8473 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931110/YARN-8473-branch-2.8.addendum.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21208/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4 > > Attachments: YARN-8473-branch-2.8.addendum.001.patch, > YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180711.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port-20180711.patch, > hadoop-2.7.2.gpu-port.patch, hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8516) branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module
[ https://issues.apache.org/jira/browse/YARN-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan resolved YARN-8516. -- Resolution: Duplicate Thanks [~rohithsharma]. I am handling this as an addendum patch for YARN-8473. Apologies for missing out branch-2.8 compilation. > branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module > > > Key: YARN-8516 > URL: https://issues.apache.org/jira/browse/YARN-8516 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Blocker > > branch-2.8 compilation is failing with below error > {noformat} > INFO] > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 6.142 s > [INFO] Finished at: 2018-07-11T08:28:24+05:30 > [INFO] Final Memory: 64M/790M > [INFO] > > [WARNING] The requested profile "yarn-ui" could not be activated because it > does not exist. > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) > on project hadoop-yarn-server-nodemanager: Compilation failure > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12] > no suitable method found for > warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState) > [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is > not applicable > [ERROR] (actual and formal argument lists differ in length) > [ERROR] method > org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is > not applicable > [ERROR] (actual and formal argument lists differ in length) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8473: - Attachment: YARN-8473-branch-2.8.addendum.001.patch > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4 > > Attachments: YARN-8473-branch-2.8.addendum.001.patch, > YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan reopened YARN-8473: -- reopening Jira to fix branch-2.8 compile pblm with respect to log. It seems sl4j migration is not in branch-2.8, Apologies for missing this. > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4 > > Attachments: YARN-8473.001.patch, YARN-8473.002.patch, > YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM
[ https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539495#comment-16539495 ] Tao Yang commented on YARN-8505: {quote} Above properties are for total application in queue, not running application IIUC {quote} There is a validation which is to make sure (numPendingApps+numActiveApps<=min(maxApplications, maxApplicationsPerUser) ) in LeafQueue#validateSubmitApplication, the submission of unmanaged App will be rejected if reached the limit. That is the limitation which I mean for unmanaged AM. > AMLimit and userAMLimit check should be skipped for unmanaged AM > > > Key: YARN-8505 > URL: https://issues.apache.org/jira/browse/YARN-8505 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 2.9.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8505.001.patch > > > AMLimit and userAMLimit check in LeafQueue#activateApplications should be > skipped for unmanaged AM whose resource is not taken from YARN cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7129) Application Catalog for YARN applications
[ https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539494#comment-16539494 ] genericqa commented on YARN-7129: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 38s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 16 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 3s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 30m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 4s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 27s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 29m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 6m 21s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 0s{color} | {color:red} The patch generated 8 new + 0 unchanged - 0 fixed = 8 total (was 0) {color} | | {color:orange}-0{color} | {color:orange} shelldocs {color} | {color:orange} 0m 38s{color} | {color:orange} The patch generated 158 new + 402 unchanged - 0 fixed = 560 total (was 402) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 14s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 32s{color} | {color:green} hadoop-project in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 29m 58s{color} | {color:red}
[jira] [Updated] (YARN-8516) branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module
[ https://issues.apache.org/jira/browse/YARN-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8516: Summary: branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module (was: Compilation error for branch-2.8) > branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module > > > Key: YARN-8516 > URL: https://issues.apache.org/jira/browse/YARN-8516 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rohith Sharma K S >Priority: Blocker > > branch-2.8 compilation is failing with below error > {noformat} > INFO] > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 6.142 s > [INFO] Finished at: 2018-07-11T08:28:24+05:30 > [INFO] Final Memory: 64M/790M > [INFO] > > [WARNING] The requested profile "yarn-ui" could not be activated because it > does not exist. > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) > on project hadoop-yarn-server-nodemanager: Compilation failure > [ERROR] > /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12] > no suitable method found for > warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState) > [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is > not applicable > [ERROR] (actual and formal argument lists differ in length) > [ERROR] method > org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is > not applicable > [ERROR] (actual and formal argument lists differ in length) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8516) Compilation error for branch-2.8
Rohith Sharma K S created YARN-8516: --- Summary: Compilation error for branch-2.8 Key: YARN-8516 URL: https://issues.apache.org/jira/browse/YARN-8516 Project: Hadoop YARN Issue Type: Bug Reporter: Rohith Sharma K S branch-2.8 compilation is failing with below error {noformat} INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 6.142 s [INFO] Finished at: 2018-07-11T08:28:24+05:30 [INFO] Final Memory: 64M/790M [INFO] [WARNING] The requested profile "yarn-ui" could not be activated because it does not exist. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project hadoop-yarn-server-nodemanager: Compilation failure [ERROR] /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12] no suitable method found for warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState) [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is not applicable [ERROR] (actual and formal argument lists differ in length) [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is not applicable [ERROR] (actual and formal argument lists differ in length) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180711.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation
[ https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539415#comment-16539415 ] Subru Krishnan commented on YARN-8434: -- Thanks [~bibinchundatt] for the clarification, I understand the confusion now. That documentation is outdated and has to be fixed as now we automatically set the *{{FederationRMFailoverProxyProvider* internally via}} {{FederationProxyProviderUtil and so the NM config overriding is not required. My bad, I apologize.}} {{If it works for you, we can re-purpose the Jira to fix the doc?}} > Nodemanager not registering to active RM in federation > -- > > Key: YARN-8434 > URL: https://issues.apache.org/jira/browse/YARN-8434 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Blocker > Attachments: YARN-8434.001.patch, YARN-8434.002.patch > > > FederationRMFailoverProxyProvider doesn't handle connecting to active RM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539411#comment-16539411 ] Suma Shivaprasad edited comment on YARN-8361 at 7/11/18 12:38 AM: -- [~Zian Chen] Patch 002 LGTM. +1. Can you pls check UT failures and see if they are related? was (Author: suma.shivaprasad): [~Zian Chen] Patch 002 LGTM. +1 > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8361.001.patch, YARN-8361.002.patch > > > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration
[ https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539411#comment-16539411 ] Suma Shivaprasad commented on YARN-8361: [~Zian Chen] Patch 002 LGTM. +1 > Change App Name Placement Rule to use App Name instead of App Id for > configuration > -- > > Key: YARN-8361 > URL: https://issues.apache.org/jira/browse/YARN-8361 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Major > Attachments: YARN-8361.001.patch, YARN-8361.002.patch > > > 1. AppNamePlacementRule used app id while specifying queue mapping placement > rules, should change to app name > 2. Change documentation as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8360) Yarn service conflict between restart policy and NM configuration
[ https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539405#comment-16539405 ] Gour Saha commented on YARN-8360: - Thanks [~suma.shivaprasad], patch 1 looks good to me. +1. > Yarn service conflict between restart policy and NM configuration > -- > > Key: YARN-8360 > URL: https://issues.apache.org/jira/browse/YARN-8360 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Chandni Singh >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-8360.1.patch > > > For the below spec, the service will not stop even after container failures > because of the NM auto retry properties : > * "yarn.service.container-failure.retry.max": 1, > * "yarn.service.container-failure.validity-interval-ms": 5000 > The NM will continue auto-restarting containers. > {{fail_after 20}} fails after 20 seconds. Since the validity failure > interval is 5 seconds, NM will auto restart the container. > {code:java} > { > "name": "fail-demo2", > "version": "1.0.0", > "components" : > [ > { > "name": "comp1", > "number_of_containers": 1, > "launch_command": "fail_after 20", > "restart_policy": "NEVER", > "resource": { > "cpus": 1, > "memory": "256" > }, > "configuration": { > "properties": { > "yarn.service.container-failure.retry.max": 1, > "yarn.service.container-failure.validity-interval-ms": 5000 > } > } > } > ] > } > {code} > If {{restart_policy}} is NEVER, then the service should stop after the > container fails. > Since we have introduced, the service level Restart Policies, I think we > should make the NM auto retry configurations part of the {{RetryPolicy}} and > get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it > gets confusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7129) Application Catalog for YARN applications
[ https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7129: Attachment: YARN-7129.004.patch > Application Catalog for YARN applications > - > > Key: YARN-7129 > URL: https://issues.apache.org/jira/browse/YARN-7129 > Project: Hadoop YARN > Issue Type: New Feature > Components: applications >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN Appstore.pdf, YARN-7129.001.patch, > YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch > > > YARN native services provides web services API to improve usability of > application deployment on Hadoop using collection of docker images. It would > be nice to have an application catalog system which provides an editorial and > search interface for YARN applications. This improves usability of YARN for > manage the life cycle of applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8360) Yarn service conflict between restart policy and NM configuration
[ https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539355#comment-16539355 ] genericqa commented on YARN-8360: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 37s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 15s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 29s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 65m 4s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8360 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931072/YARN-8360.1.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7c7b4c6dcfef 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 4e59b92 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21206/testReport/ | | Max. process+thread count | 756 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21206/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Yarn service conflict
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180711.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180711.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, > hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7129) Application Catalog for YARN applications
[ https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539309#comment-16539309 ] genericqa commented on YARN-7129: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 16 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 44s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 32s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 27s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 2m 57s{color} | {color:red} hadoop-yarn-applications in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 1m 5s{color} | {color:red} hadoop-yarn-applications-catalog in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 1m 1s{color} | {color:red} hadoop-yarn-applications-catalog-webapp in the patch failed. {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 18s{color} | {color:red} hadoop-yarn-applications-catalog-docker in the patch failed. {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 34m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 34m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 31s{color} | {color:red} hadoop-yarn-applications-catalog-docker in the patch failed. {color} | | {color:red}-1{color} | {color:red} shellcheck {color} | {color:red} 0m 1s{color} | {color:red} The patch generated 8 new + 0 unchanged - 0 fixed = 8 total (was 0) {color} | | {color:orange}-0{color} | {color:orange} shelldocs {color} | {color:orange} 0m 40s{color} | {color:orange} The patch generated 158 new + 400 unchanged - 0 fixed = 558 total (was 400) {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 16s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog
[jira] [Updated] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-8515: -- Labels: Docker (was: ) > container-executor can crash with SIGPIPE after nodemanager restart > --- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > Labels: Docker > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539290#comment-16539290 ] Jim Brennan commented on YARN-8515: --- I have been able to repro this reliably on a test cluster. Repro steps are: # Start sleep job with a lot of mappers sleeping for 50 seconds # on one worker node, kill NM after a set of containers starts # restart the NM # On the gw, kill the application (before the current containers finish) This will leave the containers on the node where the nodemanager was restarted in the exited state. container-executor is not cleaning up the docker containers. Here is an strace of one of the container-executors when the application is killed: {noformat} -bash-4.2$ sudo strace -s 4096 -f -p 7176 strace: Process 7176 attached read(3, "143\n", 4096) = 4 close(3) = 0 wait4(7566, [\{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 7566 --- SIGCHLD \{si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7566, si_uid=0, si_status=0, si_utime=1, si_stime=0} --- munmap(0x7f233bfa4000, 4096) = 0 write(2, "Docker container exit code was not zero: 143\n", 45) = -1 EPIPE (Broken pipe) --- SIGPIPE \{si_signo=SIGPIPE, si_code=SI_USER, si_pid=7176, si_uid=0} --- +++ killed by SIGPIPE +++ {noformat} The problem is that when container-executor is started by the NM using the priviledged operation executor, it attaches stream readers to stdout and stderr. When we restart the NM, these threads are killed. Then when the application is killed, it kills the running containers and container-executor returns from waiting for the docker container. When it tries to write an error message to stderr, it generates a SIGPIPE signal, because the other end of the pipe has been killed. Since we are not handling that signal, container-executor crashes and we never remove the docker container. I have verified that if I change container-executor to ignore SIGPIPE, the problem does not occur. > container-executor can crash with SIGPIPE after nodemanager restart > --- > > Key: YARN-8515 > URL: https://issues.apache.org/jira/browse/YARN-8515 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jim Brennan >Assignee: Jim Brennan >Priority: Major > > When running with docker on large clusters, we have noticed that sometimes > docker containers are not removed - they remain in the exited state, and the > corresponding container-executor is no longer running. Upon investigation, > we noticed that this always seemed to happen after a nodemanager restart. > The sequence leading to the stranded docker containers is: > # Nodemanager restarts > # Containers are recovered and then run for a while > # Containers are killed for some (legitimate) reason > # Container-executor exits without removing the docker container. > After reproducing this on a test cluster, we found that the > container-executor was exiting due to a SIGPIPE. > What is happening is that the shell command executor that is used to start > container-executor has threads reading from c-e's stdout and stderr. When > the NM is restarted, these threads are killed. Then when the > container-executor continues executing after the container exits with error, > it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is > not handled, this crashes the container-executor before it can actually > remove the docker container. > We ran into this in branch 2.8. The way docker containers are removed has > been completely redesigned in trunk, so I don't think it will lead to this > exact failure, but after an NM restart, potentially any write to stderr or > stdout in the container-executor could cause it to crash. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8360) Yarn service conflict between restart policy and NM configuration
[ https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suma Shivaprasad updated YARN-8360: --- Attachment: YARN-8360.1.patch > Yarn service conflict between restart policy and NM configuration > -- > > Key: YARN-8360 > URL: https://issues.apache.org/jira/browse/YARN-8360 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Chandni Singh >Assignee: Suma Shivaprasad >Priority: Major > Attachments: YARN-8360.1.patch > > > For the below spec, the service will not stop even after container failures > because of the NM auto retry properties : > * "yarn.service.container-failure.retry.max": 1, > * "yarn.service.container-failure.validity-interval-ms": 5000 > The NM will continue auto-restarting containers. > {{fail_after 20}} fails after 20 seconds. Since the validity failure > interval is 5 seconds, NM will auto restart the container. > {code:java} > { > "name": "fail-demo2", > "version": "1.0.0", > "components" : > [ > { > "name": "comp1", > "number_of_containers": 1, > "launch_command": "fail_after 20", > "restart_policy": "NEVER", > "resource": { > "cpus": 1, > "memory": "256" > }, > "configuration": { > "properties": { > "yarn.service.container-failure.retry.max": 1, > "yarn.service.container-failure.validity-interval-ms": 5000 > } > } > } > ] > } > {code} > If {{restart_policy}} is NEVER, then the service should stop after the > container fails. > Since we have introduced, the service level Restart Policies, I think we > should make the NM auto retry configurations part of the {{RetryPolicy}} and > get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it > gets confusing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539286#comment-16539286 ] Jim Brennan commented on YARN-8515: --- Here is an example case that we saw: Docker ps info for this container: {noformat} 968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days ago container_e07_1528760012992_2875921_01_69 {noformat} NM Log with some added info from Docker container and journalctl to show where the docker container started/exited: {noformat} 2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO containermanager.ContainerManagerImpl: Start request for container_e07_1528760012992_2875921_01_69 by user p_condor 2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_69 to application application_1528760012992_2875921 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from NEW to LOCALIZING 2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO localizer.ResourceLocalizationService: Created localizer for container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:48,786 [LocalizerRunner for container_e07_1528760012992_2875921_01_69] INFO localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_69.tokens. Credentials list: 2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to RUNNING 2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_69 2018-06-27 16:32:53.345 Docker container started 2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id container_e07_1528760012992_2875921_01_69: 132.5 MB of 3 GB physical memory used; 4.3 GB of 6.3 GB virtual memory used 2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: / STARTUP_MSG: Starting NodeManager STARTUP_MSG: user = mapred STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.8.3.2.1806111934 2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: Recovering container_e07_1528760012992_2875921_01_69 in state LAUNCHED with exit code -1000 2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding container_e07_1528760012992_2875921_01_69 to application application_1528760012992_2875921 2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: Waiting for containers: 2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from NEW to LOCALIZING 2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Initializing container container_e07_1528760012992_2875921_01_69 2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to LOCALIZED 2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO nodemanager.ContainerExecutor: Reacquiring container_e07_1528760012992_2875921_01_69 with pid 103072 2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO container.ContainerImpl: Container container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to RUNNING 2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Starting resource-monitoring for container_e07_1528760012992_2875921_01_69 2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending out 598 NM container statuses: 2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers 2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 103072 for container-id
[jira] [Commented] (YARN-8421) when moving app, activeUsers is increased, even though app does not have outstanding request
[ https://issues.apache.org/jira/browse/YARN-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539269#comment-16539269 ] Eric Payne commented on YARN-8421: -- [~kyungwan nam], Thank you for providing the fix for this problem. The fix looks good and the unit test is doing a good job of testing what I would expect it to test. The failed unit tests in the latest pre-commit build ({{TestAMRestart}} / {{TestQueueManagementDynamicEditPolicy}}) are not failing for me in my local build environment. The only minor problem with the latest patch is that the parameters to the assertions in the test are backwards. That is, the "expected" value should come first and the "actual" value should come second. > when moving app, activeUsers is increased, even though app does not have > outstanding request > - > > Key: YARN-8421 > URL: https://issues.apache.org/jira/browse/YARN-8421 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-8421.001.patch, YARN-8421.002.patch > > > all containers for app1 have been allocated. > move app1 from default Queue to test Queue as follows. > {code} > yarn rmadmin application -movetoqueue app1 -queue test > {code} > _activeUsers_ of the test Queue is increased even though app1 which does not > have outstanding request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart
Jim Brennan created YARN-8515: - Summary: container-executor can crash with SIGPIPE after nodemanager restart Key: YARN-8515 URL: https://issues.apache.org/jira/browse/YARN-8515 Project: Hadoop YARN Issue Type: Bug Reporter: Jim Brennan Assignee: Jim Brennan When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running. Upon investigation, we noticed that this always seemed to happen after a nodemanager restart. The sequence leading to the stranded docker containers is: # Nodemanager restarts # Containers are recovered and then run for a while # Containers are killed for some (legitimate) reason # Container-executor exits without removing the docker container. After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE. What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr. When the NM is restarted, these threads are killed. Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE. Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container. We ran into this in branch 2.8. The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires
[ https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8514: Affects Version/s: 2.9.2 2.9.0 3.0.0 3.1.0 2.9.1 3.0.1 3.0.2 > YARN RegistryDNS throws NPE when Kerberos tgt expires > - > > Key: YARN-8514 > URL: https://issues.apache.org/jira/browse/YARN-8514 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 2.9.2 >Reporter: Eric Yang >Priority: Critical > > After Kerberos ticket expires, RegistryDNS throws NPE error: > {code:java} > 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT > Renewer for rm/host1.example@example.com,5,main] threw an Exception. > java.lang.NullPointerException > at > javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) > at > org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires
[ https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539164#comment-16539164 ] Eric Yang commented on YARN-8514: - This NPE is introduced by YARN-4983. UgiMetrics will not be initialized in UGI class, unless there is external code that calls: {code:java} UserGroupInformation.reattachMetrics();{code} There is a possibility that other new process encounter the same NPE in Kerberos enabled environment. It would be great if the reattachMetrics call can be initialized by itself without external invocation. > YARN RegistryDNS throws NPE when Kerberos tgt expires > - > > Key: YARN-8514 > URL: https://issues.apache.org/jira/browse/YARN-8514 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Yang >Priority: Critical > > After Kerberos ticket expires, RegistryDNS throws NPE error: > {code:java} > 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT > Renewer for rm/host1.example@example.com,5,main] threw an Exception. > java.lang.NullPointerException > at > javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) > at > org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539156#comment-16539156 ] genericqa commented on YARN-4606: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 58s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 71m 42s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}131m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-4606 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931034/YARN-4606.006.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0c5bd396 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d503f65 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21204/testReport/ | | Max. process+thread count | 912 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21204/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > CapacityScheduler: applications could get
[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires
[ https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8514: Description: After Kerberos ticket expires, RegistryDNS throws NPE error: {code:java} 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT Renewer for rm/host1.example@example.com,5,main] threw an Exception. java.lang.NullPointerException at javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) at java.lang.Thread.run(Thread.java:745){code} was: After Kerberos ticket expires, RegistryDNS throws NPE error: {code:java} 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT Renewer for rm/y001.l42scl.hortonworks@l42scl.hortonworks.com,5,main] threw an Exception. java.lang.NullPointerException at javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) at java.lang.Thread.run(Thread.java:745){code} > YARN RegistryDNS throws NPE when Kerberos tgt expires > - > > Key: YARN-8514 > URL: https://issues.apache.org/jira/browse/YARN-8514 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Eric Yang >Priority: Critical > > After Kerberos ticket expires, RegistryDNS throws NPE error: > {code:java} > 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT > Renewer for rm/host1.example@example.com,5,main] threw an Exception. > java.lang.NullPointerException > at > javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) > at > org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7129) Application Catalog for YARN applications
[ https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7129: Attachment: YARN-7129.003.patch > Application Catalog for YARN applications > - > > Key: YARN-7129 > URL: https://issues.apache.org/jira/browse/YARN-7129 > Project: Hadoop YARN > Issue Type: New Feature > Components: applications >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN Appstore.pdf, YARN-7129.001.patch, > YARN-7129.002.patch, YARN-7129.003.patch > > > YARN native services provides web services API to improve usability of > application deployment on Hadoop using collection of docker images. It would > be nice to have an application catalog system which provides an editorial and > search interface for YARN applications. This improves usability of YARN for > manage the life cycle of applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires
Eric Yang created YARN-8514: --- Summary: YARN RegistryDNS throws NPE when Kerberos tgt expires Key: YARN-8514 URL: https://issues.apache.org/jira/browse/YARN-8514 Project: Hadoop YARN Issue Type: Bug Reporter: Eric Yang After Kerberos ticket expires, RegistryDNS throws NPE error: {code:java} 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT Renewer for rm/y001.l42scl.hortonworks@l42scl.hortonworks.com,5,main] threw an Exception. java.lang.NullPointerException at javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482) at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894) at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539097#comment-16539097 ] Szilard Nemeth commented on YARN-8502: -- Thanks [~giovanni.fumarola] for the quick responses and for the commit! > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539092#comment-16539092 ] Hudson commented on YARN-8502: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14551 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14551/]) YARN-8502. Use path strings consistently for webservice endpoints in (gifuma: rev 82ac3aa6d0a83235cfac2805a444dd26efe5f9ce) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWSConsts.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539060#comment-16539060 ] genericqa commented on YARN-8468: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 33s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 7 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 7s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Skipped patched modules with no Java source: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 13s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}148m 37s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8468 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931027/YARN-8468.003.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539027#comment-16539027 ] Wangda Tan commented on YARN-8512: -- Patch LGTM as well, thanks [~rohithsharma] for the fix. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giovanni Matteo Fumarola updated YARN-8502: --- Fix Version/s: 3.2.0 > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539009#comment-16539009 ] Giovanni Matteo Fumarola commented on YARN-8502: Thanks [~snemeth] for working on this. Committed to trunk. > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
Che Yufei created YARN-8513: --- Summary: CapacityScheduler infinite loop when queue is near fully utilized Key: YARN-8513 URL: https://issues.apache.org/jira/browse/YARN-8513 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, yarn Affects Versions: 2.9.1 Environment: Ubuntu 14.04.5 YARN is configured with one label and 5 queues. Reporter: Che Yufei ResourceManager does not respond to any request when queue is near fully utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM restart, it can recover running jobs and start accepting new ones. Seems like CapacityScheduler is in an infinite loop printing out the following log messages (more than 25,000 lines in a second): {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.99816763 absoluteUsedCapacity=0.99816763 used= cluster=}} {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal}} {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1530619767030_1652_01 container=null queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 clusterResource= type=NODE_LOCAL requestedPartition=}} I encounter this problem several times after upgrading to YARN 2.9.1, while the same configuration works fine under version 2.7.3. YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a similar problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538998#comment-16538998 ] Giovanni Matteo Fumarola commented on YARN-8502: Ok. I will open a Jira to fix those. +1 from my side. Committing to trunk. > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538996#comment-16538996 ] Rohith Sharma K S commented on YARN-8512: - It looks like QA is trying to execute native binaries which is not part of the patch. So {color:red}-1 unit {color} is unrelated to patch. {code} [ERROR] Failed to execute goal org.apache.hadoop:hadoop-maven-plugins:3.2.0-SNAPSHOT:cmake-test (test-container-executor) on project hadoop-yarn-server-nodemanager: Test /testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/target/usr/local/bin/test-container-executor returned ERROR CODE 1 -> [Help 1] {code} > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538967#comment-16538967 ] genericqa commented on YARN-8512: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 30s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 51s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 50s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 6s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 27s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 32s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8512 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931029/YARN-8512.03.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 874f755939ab 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d503f65 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21203/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21203/testReport/ | | Max. process+thread count | 407 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21203/console | | Powered by |
[jira] [Updated] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R updated YARN-4606: --- Attachment: YARN-4606.006.patch > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538949#comment-16538949 ] Manikandan R commented on YARN-4606: Fixed whitespace related issues. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538849#comment-16538849 ] Sunil Govindan commented on YARN-8512: -- Thanks [~rohithsharma]. Patch seems fine to me. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538831#comment-16538831 ] Antal Bálint Steinbach edited comment on YARN-8468 at 7/10/18 3:54 PM: --- Hi [~snemeth] Thanks for the feedback. I applied all of your points except for removing _QueueMaxContainerAllocationValidator.createExceptionText_ from the test. I used it because I was testing if the parameters are correct for the exception not for validating the error message text. Balint was (Author: bsteinbach): Hi [~snemeth] Thanks for the feedback. I applied all of them except for removing _QueueMaxContainerAllocationValidator.createExceptionText_ from the test. I used it because I was testing if the parameters are correct for the exception not for validating the error message text. Balint > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538831#comment-16538831 ] Antal Bálint Steinbach commented on YARN-8468: -- Hi [~snemeth] Thanks for the feedback. I applied all of them except for removing _QueueMaxContainerAllocationValidator.createExceptionText_ from the test. I used it because I was testing if the parameters are correct for the exception not for validating the error message text. Balint > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538830#comment-16538830 ] Rohith Sharma K S commented on YARN-8512: - Updated the patch with test case added. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8512: Attachment: YARN-8512.03.patch > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch, > YARN-8512.03.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antal Bálint Steinbach updated YARN-8468: - Attachment: YARN-8468.003.patch > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch, YARN-8468.003.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538779#comment-16538779 ] Hudson commented on YARN-8473: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14549 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14549/]) YARN-8473. Containers being launched as app tears down can leave (sunilg: rev 705e2c1f7cba51496b0d019ecedffbe5fb55c28b) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/TestApplication.java > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4 > > Attachments: YARN-8473.001.patch, YARN-8473.002.patch, > YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8501) Reduce complexity of RMWebServices' getApps method
[ https://issues.apache.org/jira/browse/YARN-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538770#comment-16538770 ] Szilard Nemeth commented on YARN-8501: -- Hi [~Zian Chen]! As my intellij marks this method with a warning ("Method is too complex to analyze by data flow algorithm") I was thinking about these: 1. Eliminate the boolean flags 2. Separate validation and throwing exceptions from the rest of the code. 3. Use a builder that creates a {{GetApplicationsRequest}} from the provider query parameters. 4. Add some testcases in order to verify I don't break anything. Do you have anything to add to this list? Thanks! > Reduce complexity of RMWebServices' getApps method > -- > > Key: YARN-8501 > URL: https://issues.apache.org/jira/browse/YARN-8501 > Project: Hadoop YARN > Issue Type: Improvement > Components: restapi >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538763#comment-16538763 ] Rohith Sharma K S commented on YARN-8383: - Sure. I am doing verification and will commit it later of today. thanks > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8383.001-branch-2.8.patch > > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) > Caused by: java.lang.ClassNotFoundException: > com.fasterxml.jackson.core.JsonFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices
[ https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538764#comment-16538764 ] Szilard Nemeth commented on YARN-8502: -- Hey [~giovanni.fumarola]! I would vote for a separate jira as the thing you mentioned not strictly related to constants or endpoint paths and maybe could confuse anyone looking into git log. Thanks! > Use path strings consistently for webservice endpoints in RMWebServices > --- > > Key: YARN-8502 > URL: https://issues.apache.org/jira/browse/YARN-8502 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8502-001.patch > > > Currently there are 2 types of endpoint path definitions: > 1. with string, example: > @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}") > 2. with constant, example: > @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS) > Most preferably, constants should be used for all Paths. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538755#comment-16538755 ] Jason Lowe commented on YARN-8383: -- bq. change in HDFS is an incompatible change from branch-2.8 to branch-2.9 or branch-2 from jobs perspective right? Yes, you're right. We may be forced to do another round of shading of jackson in HDFS as we did for YARN in 2.8. Arguably that's a separate JIRA, and this one can focus on the fix for 2.8.x. > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8383.001-branch-2.8.patch > > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) > Caused by: java.lang.ClassNotFoundException: > com.fasterxml.jackson.core.JsonFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538750#comment-16538750 ] genericqa commented on YARN-8512: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 25s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 17m 46s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 79m 9s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8512 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931006/YARN-8512.02.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux d2154bee076c 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ca8b80b | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21201/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21201/testReport/ | | Max. process+thread count | 312 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538734#comment-16538734 ] Rohith Sharma K S commented on YARN-8383: - Ahh.. I misunderstood your earlier comment. Thanks for clarifying it. bq. Instead I was proposing adding the dependency back in for branch-2 and branch-2.9, since the jackson dependency is already there in those release lines due to HDFS pulling it in. Considering HDFS is already pulling jackson-core, it should be fine. My doubt is, CMIIW, change in HDFS is an incompatible change from branch-2.8 to branch-2.9 or branch-2 from jobs perspective right? ..since application classpath also refer to hdfs/lib. > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8383.001-branch-2.8.patch > > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) > Caused by: java.lang.ClassNotFoundException: > com.fasterxml.jackson.core.JsonFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state
[ https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538701#comment-16538701 ] Sunil Govindan commented on YARN-8473: -- Thanks [~jlowe]. I ll help to commit this. > Containers being launched as app tears down can leave containers in NEW state > - > > Key: YARN-8473 > URL: https://issues.apache.org/jira/browse/YARN-8473 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.8.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Major > Attachments: YARN-8473.001.patch, YARN-8473.002.patch, > YARN-8473.003.patch > > > I saw a case where containers were stuck on a nodemanager in the NEW state > because they tried to launch just as an application was tearing down. The > container sent an INIT_CONTAINER event to the ApplicationImpl which then > executed an invalid transition since that event is not handled/expected when > the application is in the process of tearing down. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError
[ https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538699#comment-16538699 ] Jason Lowe commented on YARN-8383: -- bq. I think adding dependencies share/hadoop/yarn/lib is right way to fix. But this change going to bring back YARN-6628 which will become compatible issue for older jobs right? I'm not proposing adding the dependency back for 2.8. The attached patch shades even more than we did before, so if anything we're removing dependencies from an app's point of view if this patch goes into 2.8. Instead I was proposing adding the dependency back in for branch-2 and branch-2.9, since the jackson dependency is already there in those release lines due to HDFS pulling it in. On those two branches shading YARN's jackson dependency isn't buying us anything from an app's perspective. > TimelineServer 1.5 start fails with NoClassDefFoundError > > > Key: YARN-8383 > URL: https://issues.apache.org/jira/browse/YARN-8383 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.8.4 >Reporter: Rohith Sharma K S >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8383.001-branch-2.8.patch > > > TimelineServer 1.5 start fails with NoClassDefFoundError. > {noformat} > 2018-05-31 22:10:58,548 FATAL > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: > Error starting ApplicationHistoryServer > java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory > at > org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190) > Caused by: java.lang.ClassNotFoundException: > com.fasterxml.jackson.core.JsonFactory > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 15 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538572#comment-16538572 ] Rohith Sharma K S edited comment on YARN-8512 at 7/10/18 1:49 PM: -- 01.patch seems to be issue if we create new ApplicationImpl and update in context since it state transition. We just need to update existing flowContext object inside ApplicationImpl. I will update a new patch with this change and cancelling existing patch. was (Author: rohithsharma): 01.patch seems to be issue if we update ApplicationImpl since it state transition. In this case, we just need to update existing field value inside ApplicationImpl. I will update a new patch with this change and cancelling existing patch. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538599#comment-16538599 ] Rohith Sharma K S commented on YARN-8512: - Attached 02 patch that sets flow context in existing ApplicationImpl. [~sunilg] Could you please review? > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8512: Attachment: YARN-8512.02.patch > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch, YARN-8512.02.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180710_old.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710_old.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538579#comment-16538579 ] genericqa commented on YARN-8480: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 20 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 43s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 33s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 23s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 7m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 11s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 24s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 26s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 47s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 19s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 69m 4s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 24m 5s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}190m 53s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api | | | org.apache.hadoop.yarn.conf.YarnConfiguration.DEFAULT_RM_CONFIGURATION_PROVIDER_CLASS isn't final but should be
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538572#comment-16538572 ] Rohith Sharma K S commented on YARN-8512: - 01.patch seems to be issue if we update ApplicationImpl since it state transition. In this case, we just need to update existing field value inside ApplicationImpl. I will update a new patch with this change and cancelling existing patch. > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8512: Summary: ATSv2 entities are not published to HBase from second attempt onwards (was: ATSv2 entities are not published to HBase) > ATSv2 entities are not published to HBase from second attempt onwards > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538556#comment-16538556 ] Rohith Sharma K S commented on YARN-8512: - Attached the patch with following modifications to update FlowContext in ApplicationImpl # _if_ ApplicationImpl reference found in context while starting master container _then_ ** create new reference to ApplicationImpl ** update the context. ** update the NMStateStore so that NM recovery will pick newer ApplicationImpl. > ATSv2 entities are not published to HBase > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8512: Affects Version/s: 3.0.3 3.2.0 2.10.0 3.1.0 Target Version/s: 3.1.1 > ATSv2 entities are not published to HBase > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3 >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase
[ https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith Sharma K S updated YARN-8512: Attachment: YARN-8512.01.patch > ATSv2 entities are not published to HBase > - > > Key: YARN-8512 > URL: https://issues.apache.org/jira/browse/YARN-8512 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Rohith Sharma K S >Priority: Major > Attachments: YARN-8512.01.patch > > > It is observed that if 1st attempt master container is died and 2nd attempt > master container is launched in a NM where old containers are running but not > master container. > ||Attempt||NM1||NM2||Action|| > |attempt-1|master container i.e container-1-1|container-1-2|master container > died| > |attempt-2|NA|container-1-2 and master container container-2-1|NA| > In the above scenario, NM doesn't identifies flowContext and will get log > below > {noformat} > 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1531175172425_0001 userId=hbase > clusterId=yarn-cluster . Not proceeding with writing to hbase > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710_old.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, > hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM
[ https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538478#comment-16538478 ] genericqa commented on YARN-8511: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 48s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 14s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 29m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 45s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 21s{color} | {color:green} hadoop-sls in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 39s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}206m 53s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8511 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930962/YARN-8511.001.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 4fe7932f3d45 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9bd5bef | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit |
[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM
[ https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538425#comment-16538425 ] Bibin A Chundatt commented on YARN-8505: {quote} maxApplications and maxApplicationsPerUser. {quote} Above properties are for total application in queue, not running application IIUC > AMLimit and userAMLimit check should be skipped for unmanaged AM > > > Key: YARN-8505 > URL: https://issues.apache.org/jira/browse/YARN-8505 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.0, 2.9.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-8505.001.patch > > > AMLimit and userAMLimit check in LeafQueue#activateApplications should be > skipped for unmanaged AM whose resource is not taken from YARN cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8512) ATSv2 entities are not published to HBase
Rohith Sharma K S created YARN-8512: --- Summary: ATSv2 entities are not published to HBase Key: YARN-8512 URL: https://issues.apache.org/jira/browse/YARN-8512 Project: Hadoop YARN Issue Type: Bug Reporter: Yesha Vora Assignee: Rohith Sharma K S It is observed that if 1st attempt master container is died and 2nd attempt master container is launched in a NM where old containers are running but not master container. ||Attempt||NM1||NM2||Action|| |attempt-1|master container i.e container-1-1|container-1-2|master container died| |attempt-2|NA|container-1-2 and master container container-2-1|NA| In the above scenario, NM doesn't identifies flowContext and will get log below {noformat} 2018-07-10 00:44:38,285 WARN storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: flowName=null appId=application_1531175172425_0001 userId=hbase clusterId=yarn-cluster . Not proceeding with writing to hbase 2018-07-10 00:44:38,560 WARN storage.HBaseTimelineWriterImpl (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: flowName=null appId=application_1531175172425_0001 userId=hbase clusterId=yarn-cluster . Not proceeding with writing to hbase {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538344#comment-16538344 ] Szilard Nemeth commented on YARN-8480: -- Uploaded second patch that fixes the trailing whitespace issues and one findbugs issue. The other findbugs issue that complains about DEFAULT_RM_CONFIGURATION_PROVIDER_CLASS should be final could not be fixed as the static initializer method in TestResource should modify this field in order to work correctly. > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8480) Add boolean option for resources
[ https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-8480: - Attachment: YARN-8480.002.patch > Add boolean option for resources > > > Key: YARN-8480 > URL: https://issues.apache.org/jira/browse/YARN-8480 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8480.001.patch, YARN-8480.002.patch > > > Make it possible to define a resource with a boolean value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538323#comment-16538323 ] Szilard Nemeth commented on YARN-8468: -- Hi [~bsteinbach]! Thanks for the patch. This is high quality code here. I noticed a couple of things: - {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be package-private (without any modifier) - {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use {{String.format()}} instead of concatenating the parts of the string. - {{QueueMaxContainerAllocationValidator}}: you used the method {{throwException}} 2 times, and you also used {{throw new YarnRuntimeException}} as is. I think you should either use the method for all 3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I prefer the latter. - {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of message instead: "Invalid queue resource allocation, it does not allowed to override " + MAX_CONTAINER_RESOURCES + " for the root queue!" - {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and maxCores on INFO level is unnecessary. I would not log these at all, even not on DEBUG level as it does not hold any meaningful information for the users like this. - {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as above, remove the logged queueMem and queueCores log statements. - {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add comments about what is this field for, as we have comments for other fields as well. {{FSLeafQueue.getMaximumResourceCapability // FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a space missing between the "if" and the parentheses. - {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to use method names like {{testXXX}} so {{tooHighMemoryMaxContainerAllocationTest}} should change to {{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the methods similarly. - {{TestQueueMaxContainerAllocationValidator}}: Please don't use {{QueueMaxContainerAllocationValidator.createExceptionText}} in the tests, as if the production code generates the text in a wrong format, then this test won't fail. I would simply use Strings here to assert the message. - {{TestFairScheduler}}: Once again, the convention for method names is testXXX. - In the {{FairScheduler.md}} documentation, I would replace "This property is invalid for root queue." with "This property cannot be defined for the root queue" Please fix the lines longer than 80 chars, at least I saw one occurence in {{FairSchedulerTestBase}} and {{TestFairScheduler}}. > Limit container sizes per queue in FairScheduler > > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Labels: patch > Attachments: YARN-8468.000.patch, YARN-8468.001.patch, > YARN-8468.002.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > > The goal of this ticket is to allow this value to be set on a per queue basis. > > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > > Suggested solution: > > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability() in both FSParentQueue and > FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * write JUnit tests. > * update the scheduler documentation.
[jira] [Comment Edited] (YARN-8468) Limit container sizes per queue in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538323#comment-16538323 ] Szilard Nemeth edited comment on YARN-8468 at 7/10/18 9:41 AM: --- Hi [~bsteinbach]! Thanks for the patch. This is high quality code here. I noticed a couple of things: - {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be package-private (without any modifier) - {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use {{String.format()}} instead of concatenating the parts of the string. - {{QueueMaxContainerAllocationValidator}}: you used the method {{throwException}} 2 times, and you also used {{throw new YarnRuntimeException}} as is. I think you should either use the method for all 3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I prefer the latter. - {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of message instead: "Invalid queue resource allocation, it does not allowed to override " + MAX_CONTAINER_RESOURCES + " for the root queue!" - {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and maxCores on INFO level is unnecessary. I would not log these at all, even not on DEBUG level as it does not hold any meaningful information for the users like this. - {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as above, remove the logged queueMem and queueCores log statements. - {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add comments about what is this field for, as we have comments for other fields as well. {{FSLeafQueue.getMaximumResourceCapability // FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a space missing between the "if" and the parentheses. - {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to use method names like {{testXXX}} so {{tooHighMemoryMaxContainerAllocationTest}} should change to {{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the methods similarly. - {{TestQueueMaxContainerAllocationValidator}}: Please don't use {{QueueMaxContainerAllocationValidator.createExceptionText}} in the tests, as if the production code generates the text in a wrong format, then this test won't fail. I would simply use Strings here to assert the message. - {{TestFairScheduler}}: Once again, the convention for method names is testXXX. - In the {{FairScheduler.md}} documentation, I would replace "This property is invalid for root queue." with "This property cannot be defined for the root queue" Please fix the lines longer than 80 chars, at least I saw one occurence in {{FairSchedulerTestBase}} and {{TestFairScheduler}}. Thanks! was (Author: snemeth): Hi [~bsteinbach]! Thanks for the patch. This is high quality code here. I noticed a couple of things: - {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be package-private (without any modifier) - {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use {{String.format()}} instead of concatenating the parts of the string. - {{QueueMaxContainerAllocationValidator}}: you used the method {{throwException}} 2 times, and you also used {{throw new YarnRuntimeException}} as is. I think you should either use the method for all 3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I prefer the latter. - {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of message instead: "Invalid queue resource allocation, it does not allowed to override " + MAX_CONTAINER_RESOURCES + " for the root queue!" - {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and maxCores on INFO level is unnecessary. I would not log these at all, even not on DEBUG level as it does not hold any meaningful information for the users like this. - {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as above, remove the logged queueMem and queueCores log statements. - {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add comments about what is this field for, as we have comments for other fields as well. {{FSLeafQueue.getMaximumResourceCapability // FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a space missing between the "if" and the parentheses. - {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to use method names like {{testXXX}} so {{tooHighMemoryMaxContainerAllocationTest}} should change to {{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the methods similarly. - {{TestQueueMaxContainerAllocationValidator}}: Please don't use
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch) > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port.patch, hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling
[ https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Qingcha updated YARN-7481: --- Attachment: hadoop-2.7.2.gpu-port-20180710.patch > Gpu locality support for Better AI scheduling > - > > Key: YARN-7481 > URL: https://issues.apache.org/jira/browse/YARN-7481 > Project: Hadoop YARN > Issue Type: New Feature > Components: api, RM, yarn >Affects Versions: 2.7.2 >Reporter: Chen Qingcha >Priority: Major > Fix For: 2.7.2 > > Attachments: GPU locality support for Job scheduling.pdf, > hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, > hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch > > Original Estimate: 1,344h > Remaining Estimate: 1,344h > > We enhance Hadoop with GPU support for better AI job scheduling. > Currently, YARN-3926 also supports GPU scheduling, which treats GPU as > countable resource. > However, GPU placement is also very important to deep learning job for better > efficiency. > For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu > {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not. > We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which > support fine-grained GPU placement. > A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage > and locality information in a node (up to 64 GPUs per node). '1' means > available and '0' otherwise in the corresponding position of the bit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org