date:20180710

[jira] [Comment Edited] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-10 Thread Weiwei Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539595#comment-16539595
 ] 

Weiwei Yang edited comment on YARN-8511 at 7/11/18 5:56 AM:


Fix UT failure in v2 patch.


was (Author: cheersyang):
Fix UT failure in v2 patch...

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-10 Thread Weiwei Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8511:
--
Attachment: YARN-8511.002.patch

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-10 Thread Weiwei Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539595#comment-16539595
 ] 

Weiwei Yang commented on YARN-8511:
---

Fix UT failure in v2 patch...

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> 
>
> Key: YARN-8511
> URL: https://issues.apache.org/jira/browse/YARN-8511
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.1.0
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539582#comment-16539582
 ] 

Rohith Sharma K S commented on YARN-8383:
-

+1 lgtm.. Tested in single node cluster.

committing shortly..

> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8383.001-branch-2.8.patch
>
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
> Caused by: java.lang.ClassNotFoundException: 
> com.fasterxml.jackson.core.JsonFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539581#comment-16539581
 ] 

Rohith Sharma K S commented on YARN-8473:
-

+1 compiled on branch-2.8 and succeeded. The QA reports before patch failure 
because of the compilation issue. After this patch, compilation is succeeded.

The javac error is unrelated to this patch.

 

Committing shorlty

> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
>
> Attachments: YARN-8473-branch-2.8.addendum.001.patch, 
> YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180711.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539570#comment-16539570
 ] 

genericqa commented on YARN-8473:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 18m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} branch-2.8 Compile Tests {color} ||
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  7m 
10s{color} | {color:red} root in branch-2.8 failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
21s{color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.8 failed. 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} branch-2.8 passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
23s{color} | {color:red} hadoop-yarn-server-nodemanager in branch-2.8 failed. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} branch-2.8 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 30s{color} 
| {color:red} 
hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager
 generated 3 new + 1 unchanged - 1 fixed = 4 total (was 2) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  9m 
59s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 39m 55s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:749e106 |
| JIRA Issue | YARN-8473 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931110/YARN-8473-branch-2.8.addendum.001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f06dafd7d081 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-2.8 / c22e6c8 |
| maven | version: Apache Maven 3.0.5 |
| Default Java | 1.7.0_181 |
| mvninstall | 
https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-mvninstall-root.txt
 |
| compile | 
https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-compile-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
| mvnsite | 
https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
| javac | 
https://builds.apache.org/job/PreCommit-YARN-Build/21209/artifact/out/diff-compile-javac-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results |

[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation

2018-07-10 Thread Bibin A Chundatt (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539558#comment-16539558
 ] 

Bibin A Chundatt commented on YARN-8434:


Thanks [~subru] for clarification.

Will try removing the configuration too.

Sure .. No issues in fixing doc as part of this jira.

> Nodemanager not registering to active RM in federation
> --
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM

2018-07-10 Thread Bibin A Chundatt (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539550#comment-16539550
 ] 

Bibin A Chundatt edited comment on YARN-8505 at 7/11/18 4:36 AM:
-

[~Tao Yang]

Thank you for clarification

{{numPendingApps+numActiveApps}} --> *Submitted application to queue*. So 
limitation will be only for submission of application to queue not for 
application *RUNNING application* (application in running state whose AM has 
started running).

Earlier we would have been able to limit RUNNING application based on AM LIMIT 
for unmanaged AM.

So this change will be a behaviour changed from old version too.

IIUC in case of federation application will be submitted to clusters as 
unmanaged AM in subcluster. So impact for should be evaluated for this change.


was (Author: bibinchundatt):
[~Tao Yang]

Thank you for clarification

{{numPendingApps+numActiveApps}} --> *Submitted application to queue*.  So 
limitation will be only for submission of application to queue not for 
application *RUNNING application* (application in running state who AM has 
started running). 

Earlier we would have been able to limit RUNNING application based on AM LIMIT 
for unmanaged AM. 

So this will be a behaviour changed from old version too.

> AMLimit and userAMLimit check should be skipped for unmanaged AM
> 
>
> Key: YARN-8505
> URL: https://issues.apache.org/jira/browse/YARN-8505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8505.001.patch
>
>
> AMLimit and userAMLimit check in LeafQueue#activateApplications should be 
> skipped for unmanaged AM whose resource is not taken from YARN cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM

2018-07-10 Thread Bibin A Chundatt (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539550#comment-16539550
 ] 

Bibin A Chundatt commented on YARN-8505:


[~Tao Yang]

Thank you for clarification

{{numPendingApps+numActiveApps}} --> *Submitted application to queue*.  So 
limitation will be only for submission of application to queue not for 
application *RUNNING application* (application in running state who AM has 
started running). 

Earlier we would have been able to limit RUNNING application based on AM LIMIT 
for unmanaged AM. 

So this will be a behaviour changed from old version too.

> AMLimit and userAMLimit check should be skipped for unmanaged AM
> 
>
> Key: YARN-8505
> URL: https://issues.apache.org/jira/browse/YARN-8505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8505.001.patch
>
>
> AMLimit and userAMLimit check in LeafQueue#activateApplications should be 
> skipped for unmanaged AM whose resource is not taken from YARN cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539531#comment-16539531
 ] 

genericqa commented on YARN-8473:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  9m 
47s{color} | {color:red} Docker failed to build yetus/hadoop:749e106. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-8473 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931110/YARN-8473-branch-2.8.addendum.001.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21208/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
>
> Attachments: YARN-8473-branch-2.8.addendum.001.patch, 
> YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180711.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port-20180711.patch, 
> hadoop-2.7.2.gpu-port.patch, hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-8516) branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module

2018-07-10 Thread Sunil Govindan (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan resolved YARN-8516.
--
Resolution: Duplicate

Thanks [~rohithsharma]. I am handling this as an addendum patch for YARN-8473. 
Apologies for missing out branch-2.8 compilation.

> branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module
> 
>
> Key: YARN-8516
> URL: https://issues.apache.org/jira/browse/YARN-8516
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Blocker
>
> branch-2.8 compilation is failing with below error
> {noformat}
> INFO] 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 6.142 s
> [INFO] Finished at: 2018-07-11T08:28:24+05:30
> [INFO] Final Memory: 64M/790M
> [INFO] 
> 
> [WARNING] The requested profile "yarn-ui" could not be activated because it 
> does not exist.
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hadoop-yarn-server-nodemanager: Compilation failure
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12]
>  no suitable method found for 
> warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState)
> [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is 
> not applicable
> [ERROR]   (actual and formal argument lists differ in length)
> [ERROR] method 
> org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is 
> not applicable
> [ERROR]   (actual and formal argument lists differ in length)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread Sunil Govindan (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8473:
-
Attachment: YARN-8473-branch-2.8.addendum.001.patch

> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
>
> Attachments: YARN-8473-branch-2.8.addendum.001.patch, 
> YARN-8473.001.patch, YARN-8473.002.patch, YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread Sunil Govindan (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan reopened YARN-8473:
--

reopening Jira to fix branch-2.8 compile pblm with respect to log. It seems 
sl4j migration is not in branch-2.8, Apologies for missing this.

> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
>
> Attachments: YARN-8473.001.patch, YARN-8473.002.patch, 
> YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM

2018-07-10 Thread Tao Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539495#comment-16539495
 ] 

Tao Yang commented on YARN-8505:


{quote}
Above properties are for total application in queue, not running application 
IIUC
{quote}
There is a validation which is to make sure 
(numPendingApps+numActiveApps<=min(maxApplications, maxApplicationsPerUser) ) 
in LeafQueue#validateSubmitApplication, the submission of unmanaged App will be 
rejected if reached the limit. That is the limitation which I mean for 
unmanaged AM. 

> AMLimit and userAMLimit check should be skipped for unmanaged AM
> 
>
> Key: YARN-8505
> URL: https://issues.apache.org/jira/browse/YARN-8505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8505.001.patch
>
>
> AMLimit and userAMLimit check in LeafQueue#activateApplications should be 
> skipped for unmanaged AM whose resource is not taken from YARN cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7129) Application Catalog for YARN applications

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539494#comment-16539494
 ] 

genericqa commented on YARN-7129:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 16 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
3s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 30m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications {color} 
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
27s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 29m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  6m 
21s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red}  0m  
0s{color} | {color:red} The patch generated 8 new + 0 unchanged - 0 fixed = 8 
total (was 0) {color} |
| {color:orange}-0{color} | {color:orange} shelldocs {color} | {color:orange}  
0m 38s{color} | {color:orange} The patch generated 158 new + 402 unchanged - 0 
fixed = 560 total (was 402) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m 
14s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog/hadoop-yarn-applications-catalog-docker
 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
56s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
32s{color} | {color:green} hadoop-project in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 29m 58s{color} 
| {color:red}

[jira] [Updated] (YARN-8516) branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8516:

Summary: branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager 
module  (was: Compilation error for branch-2.8)

> branch-2.8 compilation faliure for hadoop-yarn-server-nodemanager module
> 
>
> Key: YARN-8516
> URL: https://issues.apache.org/jira/browse/YARN-8516
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Blocker
>
> branch-2.8 compilation is failing with below error
> {noformat}
> INFO] 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time: 6.142 s
> [INFO] Finished at: 2018-07-11T08:28:24+05:30
> [INFO] Final Memory: 64M/790M
> [INFO] 
> 
> [WARNING] The requested profile "yarn-ui" could not be activated because it 
> does not exist.
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hadoop-yarn-server-nodemanager: Compilation failure
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12]
>  no suitable method found for 
> warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState)
> [ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is 
> not applicable
> [ERROR]   (actual and formal argument lists differ in length)
> [ERROR] method 
> org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is 
> not applicable
> [ERROR]   (actual and formal argument lists differ in length)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8516) Compilation error for branch-2.8

2018-07-10 Thread Rohith Sharma K S (JIRA)

Rohith Sharma K S created YARN-8516:
---

 Summary: Compilation error for branch-2.8
 Key: YARN-8516
 URL: https://issues.apache.org/jira/browse/YARN-8516
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rohith Sharma K S


branch-2.8 compilation is failing with below error
{noformat}
INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 6.142 s
[INFO] Finished at: 2018-07-11T08:28:24+05:30
[INFO] Final Memory: 64M/790M
[INFO] 
[WARNING] The requested profile "yarn-ui" could not be activated because it 
does not exist.
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hadoop-yarn-server-nodemanager: Compilation failure
[ERROR] 
/Users/rsharmaks/Repos/Apache/Commit_Repos/branch-2.8/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java:[333,12]
 no suitable method found for 
warn(java.lang.String,org.apache.hadoop.yarn.api.records.ContainerId,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl,org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationState)
[ERROR] method org.apache.commons.logging.Log.warn(java.lang.Object) is not 
applicable
[ERROR]   (actual and formal argument lists differ in length)
[ERROR] method 
org.apache.commons.logging.Log.warn(java.lang.Object,java.lang.Throwable) is 
not applicable
[ERROR]   (actual and formal argument lists differ in length)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180711.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8434) Nodemanager not registering to active RM in federation

2018-07-10 Thread Subru Krishnan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539415#comment-16539415
 ] 

Subru Krishnan commented on YARN-8434:
--

Thanks [~bibinchundatt] for the clarification, I understand the confusion now. 
That documentation is outdated and has to be fixed as now we automatically set 
the  *{{FederationRMFailoverProxyProvider* internally via}} 
{{FederationProxyProviderUtil and so the NM config overriding is not required. 
My bad, I apologize.}}

{{If it works for you, we can re-purpose the Jira to fix the doc?}}

> Nodemanager not registering to active RM in federation
> --
>
> Key: YARN-8434
> URL: https://issues.apache.org/jira/browse/YARN-8434
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Blocker
> Attachments: YARN-8434.001.patch, YARN-8434.002.patch
>
>
> FederationRMFailoverProxyProvider doesn't handle connecting to active RM. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-10 Thread Suma Shivaprasad (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539411#comment-16539411
 ] 

Suma Shivaprasad edited comment on YARN-8361 at 7/11/18 12:38 AM:
--

[~Zian Chen] Patch 002 LGTM. +1. Can you pls check UT failures and see if they 
are related?


was (Author: suma.shivaprasad):
[~Zian Chen] Patch 002 LGTM. +1

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch
>
>
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8361) Change App Name Placement Rule to use App Name instead of App Id for configuration

2018-07-10 Thread Suma Shivaprasad (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539411#comment-16539411
 ] 

Suma Shivaprasad commented on YARN-8361:


[~Zian Chen] Patch 002 LGTM. +1

> Change App Name Placement Rule to use App Name instead of App Id for 
> configuration
> --
>
> Key: YARN-8361
> URL: https://issues.apache.org/jira/browse/YARN-8361
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Major
> Attachments: YARN-8361.001.patch, YARN-8361.002.patch
>
>
> 1. AppNamePlacementRule used app id while specifying queue mapping placement 
> rules, should change to app name
> 2. Change documentation as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8360) Yarn service conflict between restart policy and NM configuration

2018-07-10 Thread Gour Saha (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539405#comment-16539405
 ] 

Gour Saha commented on YARN-8360:
-

Thanks [~suma.shivaprasad], patch 1 looks good to me. +1.

> Yarn service conflict between restart policy and NM configuration 
> --
>
> Key: YARN-8360
> URL: https://issues.apache.org/jira/browse/YARN-8360
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8360.1.patch
>
>
> For the below spec, the service will not stop even after container failures 
> because of the NM auto retry properties :
>  * "yarn.service.container-failure.retry.max": 1,
>  * "yarn.service.container-failure.validity-interval-ms": 5000
>  The NM will continue auto-restarting containers.
>  {{fail_after 20}} fails after 20 seconds. Since the validity failure 
> interval is 5 seconds, NM will auto restart the container.
> {code:java}
> {
>   "name": "fail-demo2",
>   "version": "1.0.0",
>   "components" :
>   [
> {
>   "name": "comp1",
>   "number_of_containers": 1,
>   "launch_command": "fail_after 20",
>   "restart_policy": "NEVER",
>   "resource": {
> "cpus": 1,
> "memory": "256"
>   },
>   "configuration": {
> "properties": {
>   "yarn.service.container-failure.retry.max": 1,
>   "yarn.service.container-failure.validity-interval-ms": 5000
> }
>   }
> }
>   ]
> }
> {code}
> If {{restart_policy}} is NEVER, then the service should stop after the 
> container fails.
> Since we have introduced, the service level Restart Policies, I think we 
> should make the NM auto retry configurations part of the {{RetryPolicy}} and 
> get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it 
> gets confusing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7129) Application Catalog for YARN applications

2018-07-10 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7129:

Attachment: YARN-7129.004.patch

> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8360) Yarn service conflict between restart policy and NM configuration

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539355#comment-16539355
 ] 

genericqa commented on YARN-8360:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
37s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
 4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 35s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 15s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 
29s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
23s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 65m  4s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8360 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931072/YARN-8360.1.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 7c7b4c6dcfef 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 
08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 4e59b92 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21206/testReport/ |
| Max. process+thread count | 756 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21206/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Yarn service conflict

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180711.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180711.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180711.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, 
> hadoop-2.7.2.gpu-port-20180711.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7129) Application Catalog for YARN applications

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539309#comment-16539309
 ] 

genericqa commented on YARN-7129:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 16 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
44s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications {color} 
|
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
27s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  2m 
57s{color} | {color:red} hadoop-yarn-applications in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  1m  
5s{color} | {color:red} hadoop-yarn-applications-catalog in the patch failed. 
{color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  1m  
1s{color} | {color:red} hadoop-yarn-applications-catalog-webapp in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
18s{color} | {color:red} hadoop-yarn-applications-catalog-docker in the patch 
failed. {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 34m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 34m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
31s{color} | {color:red} hadoop-yarn-applications-catalog-docker in the patch 
failed. {color} |
| {color:red}-1{color} | {color:red} shellcheck {color} | {color:red}  0m  
1s{color} | {color:red} The patch generated 8 new + 0 unchanged - 0 fixed = 8 
total (was 0) {color} |
| {color:orange}-0{color} | {color:orange} shelldocs {color} | {color:orange}  
0m 40s{color} | {color:orange} The patch generated 158 new + 400 unchanged - 0 
fixed = 558 total (was 400) {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m 
16s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-project hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-catalog

[jira] [Updated] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Eric Badger (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-8515:
--
Labels: Docker  (was: )

> container-executor can crash with SIGPIPE after nodemanager restart
> ---
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>  Labels: Docker
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539290#comment-16539290
 ] 

Jim Brennan commented on YARN-8515:
---

I have been able to repro this reliably on a test cluster.
Repro steps are:
# Start sleep job with a lot of mappers sleeping for 50 seconds
# on one worker node, kill NM after a set of containers starts
# restart the NM
# On the gw, kill the application (before the current containers finish)

This will leave the containers on the node where the nodemanager was restarted 
in the exited state.

container-executor is not cleaning up the docker containers. Here is an strace 
of one of the container-executors when the application is killed:
{noformat}
-bash-4.2$ sudo strace -s 4096 -f -p 7176
strace: Process 7176 attached
read(3, "143\n", 4096) = 4
close(3) = 0
wait4(7566, [\{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 7566
--- SIGCHLD \{si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7566, si_uid=0, 
si_status=0, si_utime=1, si_stime=0} ---
munmap(0x7f233bfa4000, 4096) = 0
write(2, "Docker container exit code was not zero: 143\n", 45) = -1 EPIPE 
(Broken pipe)
--- SIGPIPE \{si_signo=SIGPIPE, si_code=SI_USER, si_pid=7176, si_uid=0} ---
+++ killed by SIGPIPE +++
{noformat}

The problem is that when container-executor is started by the NM using the 
priviledged operation executor, it attaches stream readers to stdout and stderr.
When we restart the NM, these threads are killed. Then when the application is 
killed, it kills the running containers and container-executor returns from 
waiting for the docker container. When it tries to write an error message to 
stderr, it generates a SIGPIPE signal, because the other end of the pipe has 
been killed. Since we are not handling that signal, container-executor crashes 
and we never remove the docker container.

I have verified that if I change container-executor to ignore SIGPIPE, the 
problem does not occur.

> container-executor can crash with SIGPIPE after nodemanager restart
> ---
>
> Key: YARN-8515
> URL: https://issues.apache.org/jira/browse/YARN-8515
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
>
> When running with docker on large clusters, we have noticed that sometimes 
> docker containers are not removed - they remain in the exited state, and the 
> corresponding container-executor is no longer running.  Upon investigation, 
> we noticed that this always seemed to happen after a nodemanager restart.   
> The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the 
> container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start 
> container-executor has threads reading from c-e's stdout and stderr.  When 
> the NM is restarted, these threads are killed.  Then when the 
> container-executor continues executing after the container exits with error, 
> it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is 
> not handled, this crashes the container-executor before it can actually 
> remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has 
> been completely redesigned in trunk, so I don't think it will lead to this 
> exact failure, but after an NM restart, potentially any write to stderr or 
> stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8360) Yarn service conflict between restart policy and NM configuration

2018-07-10 Thread Suma Shivaprasad (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suma Shivaprasad updated YARN-8360:
---
Attachment: YARN-8360.1.patch

> Yarn service conflict between restart policy and NM configuration 
> --
>
> Key: YARN-8360
> URL: https://issues.apache.org/jira/browse/YARN-8360
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Chandni Singh
>Assignee: Suma Shivaprasad
>Priority: Major
> Attachments: YARN-8360.1.patch
>
>
> For the below spec, the service will not stop even after container failures 
> because of the NM auto retry properties :
>  * "yarn.service.container-failure.retry.max": 1,
>  * "yarn.service.container-failure.validity-interval-ms": 5000
>  The NM will continue auto-restarting containers.
>  {{fail_after 20}} fails after 20 seconds. Since the validity failure 
> interval is 5 seconds, NM will auto restart the container.
> {code:java}
> {
>   "name": "fail-demo2",
>   "version": "1.0.0",
>   "components" :
>   [
> {
>   "name": "comp1",
>   "number_of_containers": 1,
>   "launch_command": "fail_after 20",
>   "restart_policy": "NEVER",
>   "resource": {
> "cpus": 1,
> "memory": "256"
>   },
>   "configuration": {
> "properties": {
>   "yarn.service.container-failure.retry.max": 1,
>   "yarn.service.container-failure.validity-interval-ms": 5000
> }
>   }
> }
>   ]
> }
> {code}
> If {{restart_policy}} is NEVER, then the service should stop after the 
> container fails.
> Since we have introduced, the service level Restart Policies, I think we 
> should make the NM auto retry configurations part of the {{RetryPolicy}} and 
> get rid of all {{yarn.service.container-failure.**}} properties. Otherwise it 
> gets confusing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539286#comment-16539286
 ] 

Jim Brennan commented on YARN-8515:
---

Here is an example case that we saw:
Docker ps info for this container:
{noformat}
968e4a1a0fca 90188f3d752e "bash /grid/4/tmp/..." 6 days ago Exited (143) 6 days 
ago container_e07_1528760012992_2875921_01_69
{noformat}
NM Log with some added info from Docker container and journalctl to show where 
the docker container started/exited:
{noformat}
2018-06-27 16:32:48,779 [IPC Server handler 9 on 8041] INFO 
containermanager.ContainerManagerImpl: Start request for 
container_e07_1528760012992_2875921_01_69 by user p_condor
2018-06-27 16:32:48,782 [AsyncDispatcher event handler] INFO 
application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_69 to application 
application_1528760012992_2875921
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from NEW to 
LOCALIZING
2018-06-27 16:32:48,783 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:32:48,786 [AsyncDispatcher event handler] INFO 
localizer.ResourceLocalizationService: Created localizer for 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:32:48,786 [LocalizerRunner for 
container_e07_1528760012992_2875921_01_69] INFO 
localizer.ResourceLocalizationService: Writing credentials to the nmPrivate 
file 
/grid/4/tmp/yarn-local/nmPrivate/container_e07_1528760012992_2875921_01_69.tokens.
 Credentials list: 
2018-06-27 16:32:52,654 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:32:52,684 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_69

2018-06-27 16:32:53.345 Docker container started

2018-06-27 16:32:54,429 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id 
container_e07_1528760012992_2875921_01_69: 132.5 MB of 3 GB physical memory 
used; 4.3 GB of 6.3 GB virtual memory used

2018-06-27 16:33:25,422 [main] INFO nodemanager.NodeManager: STARTUP_MSG: 
/
STARTUP_MSG: Starting NodeManager
STARTUP_MSG: user = mapred
STARTUP_MSG: host = gsbl607n22.blue.ygrid.yahoo.com/10.213.59.232
STARTUP_MSG: args = []
STARTUP_MSG: version = 2.8.3.2.1806111934

2018-06-27 16:33:31,140 [main] INFO containermanager.ContainerManagerImpl: 
Recovering container_e07_1528760012992_2875921_01_69 in state LAUNCHED with 
exit code -1000
2018-06-27 16:33:31,140 [main] INFO application.ApplicationImpl: Adding 
container_e07_1528760012992_2875921_01_69 to application 
application_1528760012992_2875921

2018-06-27 16:33:32,771 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,280 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers: 
2018-06-27 16:33:33,178 [main] INFO containermanager.ContainerManagerImpl: 
Waiting for containers:

2018-06-27 16:33:33,776 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from NEW to 
LOCALIZING
2018-06-27 16:33:34,393 [AsyncDispatcher event handler] INFO 
yarn.YarnShuffleService: Initializing container 
container_e07_1528760012992_2875921_01_69
2018-06-27 16:33:34,433 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZING to 
LOCALIZED
2018-06-27 16:33:34,461 [ContainersLauncher #23] INFO 
nodemanager.ContainerExecutor: Reacquiring 
container_e07_1528760012992_2875921_01_69 with pid 103072
2018-06-27 16:33:34,463 [AsyncDispatcher event handler] INFO 
container.ContainerImpl: Container 
container_e07_1528760012992_2875921_01_69 transitioned from LOCALIZED to 
RUNNING
2018-06-27 16:33:34,482 [AsyncDispatcher event handler] INFO 
monitor.ContainersMonitorImpl: Starting resource-monitoring for 
container_e07_1528760012992_2875921_01_69

2018-06-27 16:33:35,304 [main] INFO nodemanager.NodeStatusUpdaterImpl: Sending 
out 598 NM container statuses: 
2018-06-27 16:33:35,356 [main] INFO nodemanager.NodeStatusUpdaterImpl: 
Registering with RM using containers 
2018-06-27 16:33:35,902 [Container Monitor] DEBUG ContainersMonitorImpl.audit: 
Memory usage of ProcessTree 103072 for container-id

[jira] [Commented] (YARN-8421) when moving app, activeUsers is increased, even though app does not have outstanding request

2018-07-10 Thread Eric Payne (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539269#comment-16539269
 ] 

Eric Payne commented on YARN-8421:
--

[~kyungwan nam], Thank you for providing the fix for this problem. The fix 
looks good and the unit test is doing a good job of testing what I would expect 
it to test. The failed unit tests in the latest pre-commit build 
({{TestAMRestart}} / {{TestQueueManagementDynamicEditPolicy}}) are not failing 
for me in my local build environment.

The only minor problem with the latest patch is that the parameters to the 
assertions in the test are backwards. That is, the "expected" value should come 
first and the "actual" value should come second.

> when moving app, activeUsers is increased, even though app does not have 
> outstanding request 
> -
>
> Key: YARN-8421
> URL: https://issues.apache.org/jira/browse/YARN-8421
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-8421.001.patch, YARN-8421.002.patch
>
>
> all containers for app1 have been allocated.
> move app1 from default Queue to test Queue as follows.
> {code}
>   yarn rmadmin application -movetoqueue app1 -queue test
> {code}
> _activeUsers_ of the test Queue is increased even though app1 which does not 
> have outstanding request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

2018-07-10 Thread Jim Brennan (JIRA)

Jim Brennan created YARN-8515:
-

 Summary: container-executor can crash with SIGPIPE after 
nodemanager restart
 Key: YARN-8515
 URL: https://issues.apache.org/jira/browse/YARN-8515
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jim Brennan
Assignee: Jim Brennan


When running with docker on large clusters, we have noticed that sometimes 
docker containers are not removed - they remain in the exited state, and the 
corresponding container-executor is no longer running.  Upon investigation, we 
noticed that this always seemed to happen after a nodemanager restart.   The 
sequence leading to the stranded docker containers is:
 # Nodemanager restarts
 # Containers are recovered and then run for a while
 # Containers are killed for some (legitimate) reason
 # Container-executor exits without removing the docker container.

After reproducing this on a test cluster, we found that the container-executor 
was exiting due to a SIGPIPE.

What is happening is that the shell command executor that is used to start 
container-executor has threads reading from c-e's stdout and stderr.  When the 
NM is restarted, these threads are killed.  Then when the container-executor 
continues executing after the container exits with error, it tries to write to 
stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this 
crashes the container-executor before it can actually remove the docker 
container.

We ran into this in branch 2.8.  The way docker containers are removed has been 
completely redesigned in trunk, so I don't think it will lead to this exact 
failure, but after an NM restart, potentially any write to stderr or stdout in 
the container-executor could cause it to crash.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires

2018-07-10 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8514:

Affects Version/s: 2.9.2
   2.9.0
   3.0.0
   3.1.0
   2.9.1
   3.0.1
   3.0.2

> YARN RegistryDNS throws NPE when Kerberos tgt expires
> -
>
> Key: YARN-8514
> URL: https://issues.apache.org/jira/browse/YARN-8514
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0, 3.1.0, 2.9.1, 3.0.1, 3.0.2, 2.9.2
>Reporter: Eric Yang
>Priority: Critical
>
> After Kerberos ticket expires, RegistryDNS throws NPE error:
> {code:java}
> 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
> Renewer for rm/host1.example@example.com,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)
>         at 
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)
>         at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires

2018-07-10 Thread Eric Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539164#comment-16539164
 ] 

Eric Yang commented on YARN-8514:
-

This NPE is introduced by YARN-4983.  UgiMetrics will not be initialized in UGI 
class, unless there is external code that calls:
{code:java}
UserGroupInformation.reattachMetrics();{code}
There is a possibility that other new process encounter the same NPE in 
Kerberos enabled environment.  It would be great if the reattachMetrics call 
can be initialized by itself without external invocation.

> YARN RegistryDNS throws NPE when Kerberos tgt expires
> -
>
> Key: YARN-8514
> URL: https://issues.apache.org/jira/browse/YARN-8514
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Critical
>
> After Kerberos ticket expires, RegistryDNS throws NPE error:
> {code:java}
> 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
> Renewer for rm/host1.example@example.com,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)
>         at 
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)
>         at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539156#comment-16539156
 ] 

genericqa commented on YARN-4606:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 58s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 18s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 71m 
42s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}131m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-4606 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931034/YARN-4606.006.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 0c5bd396 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d503f65 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21204/testReport/ |
| Max. process+thread count | 912 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21204/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> CapacityScheduler: applications could get

[jira] [Updated] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires

2018-07-10 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8514:

Description: 
After Kerberos ticket expires, RegistryDNS throws NPE error:
{code:java}
2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
Renewer for rm/host1.example@example.com,5,main] threw an Exception.

java.lang.NullPointerException

        at 
javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)

        at 
org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)

        at java.lang.Thread.run(Thread.java:745){code}

  was:
After Kerberos ticket expires, RegistryDNS throws NPE error:
{code:java}
2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
Renewer for rm/y001.l42scl.hortonworks@l42scl.hortonworks.com,5,main] threw 
an Exception.

java.lang.NullPointerException

        at 
javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)

        at 
org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)

        at java.lang.Thread.run(Thread.java:745){code}


> YARN RegistryDNS throws NPE when Kerberos tgt expires
> -
>
> Key: YARN-8514
> URL: https://issues.apache.org/jira/browse/YARN-8514
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Eric Yang
>Priority: Critical
>
> After Kerberos ticket expires, RegistryDNS throws NPE error:
> {code:java}
> 2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
> Renewer for rm/host1.example@example.com,5,main] threw an Exception.
> java.lang.NullPointerException
>         at 
> javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)
>         at 
> org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)
>         at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7129) Application Catalog for YARN applications

2018-07-10 Thread Eric Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7129:

Attachment: YARN-7129.003.patch

> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8514) YARN RegistryDNS throws NPE when Kerberos tgt expires

2018-07-10 Thread Eric Yang (JIRA)

Eric Yang created YARN-8514:
---

 Summary: YARN RegistryDNS throws NPE when Kerberos tgt expires
 Key: YARN-8514
 URL: https://issues.apache.org/jira/browse/YARN-8514
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Eric Yang


After Kerberos ticket expires, RegistryDNS throws NPE error:
{code:java}
2018-07-06 01:26:25,025 ERROR yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[TGT 
Renewer for rm/y001.l42scl.hortonworks@l42scl.hortonworks.com,5,main] threw 
an Exception.

java.lang.NullPointerException

        at 
javax.security.auth.kerberos.KerberosTicket.getEndTime(KerberosTicket.java:482)

        at 
org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:894)

        at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539097#comment-16539097
 ] 

Szilard Nemeth commented on YARN-8502:
--

Thanks [~giovanni.fumarola] for the quick responses and for the commit!

> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539092#comment-16539092
 ] 

Hudson commented on YARN-8502:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14551 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14551/])
YARN-8502. Use path strings consistently for webservice endpoints in (gifuma: 
rev 82ac3aa6d0a83235cfac2805a444dd26efe5f9ce)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWSConsts.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java


> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539060#comment-16539060
 ] 

genericqa commented on YARN-8468:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
33s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 7 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 7s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Skipped patched modules with no Java source: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 13s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
20s{color} | {color:green} hadoop-yarn-site in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}148m 37s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8468 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931027/YARN-8468.003.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Wangda Tan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539027#comment-16539027
 ] 

Wangda Tan commented on YARN-8512:
--

Patch LGTM as well, thanks [~rohithsharma] for the fix. 

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Giovanni Matteo Fumarola (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giovanni Matteo Fumarola updated YARN-8502:
---
Fix Version/s: 3.2.0

> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Giovanni Matteo Fumarola (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539009#comment-16539009
 ] 

Giovanni Matteo Fumarola commented on YARN-8502:


Thanks [~snemeth] for working on this.
Committed to trunk.

> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-07-10 Thread Che Yufei (JIRA)

Che Yufei created YARN-8513:
---

 Summary: CapacityScheduler infinite loop when queue is near fully 
utilized
 Key: YARN-8513
 URL: https://issues.apache.org/jira/browse/YARN-8513
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, yarn
Affects Versions: 2.9.1
 Environment: Ubuntu 14.04.5

YARN is configured with one label and 5 queues.
Reporter: Che Yufei


ResourceManager does not respond to any request when queue is near fully 
utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
restart, it can recover running jobs and start accepting new ones.

 

Seems like CapacityScheduler is in an infinite loop printing out the following 
log messages (more than 25,000 lines in a second):

 

{{2018-07-10 17:16:29,227 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
assignedContainer queue=root usedCapacity=0.99816763 
absoluteUsedCapacity=0.99816763 used= 
cluster=}}
{{2018-07-10 17:16:29,227 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Failed to accept allocation proposal}}
{{2018-07-10 17:16:29,227 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
 assignedContainer application attempt=appattempt_1530619767030_1652_01 
container=null 
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
 clusterResource= type=NODE_LOCAL 
requestedPartition=}}

 

I encounter this problem several times after upgrading to YARN 2.9.1, while the 
same configuration works fine under version 2.7.3.

 

YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
similar problem.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Giovanni Matteo Fumarola (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538998#comment-16538998
 ] 

Giovanni Matteo Fumarola commented on YARN-8502:


Ok. I will open a Jira to fix those.
+1 from my side.
Committing to trunk.

> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538996#comment-16538996
 ] 

Rohith Sharma K S commented on YARN-8512:
-

It looks like QA is trying to execute native binaries which is not part of the 
patch. So {color:red}-1 unit {color} is unrelated to patch.
{code}
[ERROR] Failed to execute goal 
org.apache.hadoop:hadoop-maven-plugins:3.2.0-SNAPSHOT:cmake-test 
(test-container-executor) on project hadoop-yarn-server-nodemanager: Test 
/testptch/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/target/usr/local/bin/test-container-executor
 returned ERROR CODE 1 -> [Help 1]
{code}

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538967#comment-16538967
 ] 

genericqa commented on YARN-8512:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
30s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 50s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  6s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 27s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 72m 32s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8512 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931029/YARN-8512.03.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 874f755939ab 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 
19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d503f65 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/21203/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21203/testReport/ |
| Max. process+thread count | 407 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/21203/console |
| Powered by |

[jira] [Updated] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-07-10 Thread Manikandan R (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manikandan R updated YARN-4606:
---
Attachment: YARN-4606.006.patch

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.001.patch, YARN-4606.002.patch, 
> YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, 
> YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.3.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

2018-07-10 Thread Manikandan R (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538949#comment-16538949
 ] 

Manikandan R commented on YARN-4606:


Fixed whitespace related issues.

> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -
>
> Key: YARN-4606
> URL: https://issues.apache.org/jira/browse/YARN-4606
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, capacityscheduler
>Affects Versions: 2.8.0, 2.7.1
>Reporter: Karam Singh
>Assignee: Manikandan R
>Priority: Critical
> Attachments: YARN-4606.001.patch, YARN-4606.002.patch, 
> YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, 
> YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, 
> YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Sunil Govindan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538849#comment-16538849
 ] 

Sunil Govindan commented on YARN-8512:
--

Thanks [~rohithsharma]. Patch seems fine to me.

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538831#comment-16538831
 ] 

Antal Bálint Steinbach edited comment on YARN-8468 at 7/10/18 3:54 PM:
---

Hi [~snemeth]

Thanks for the feedback. I applied all of your points except for removing 
_QueueMaxContainerAllocationValidator.createExceptionText_

from the test. I used it because I was testing if the parameters are correct 
for the exception not for validating the error message text.

 

Balint


was (Author: bsteinbach):
Hi [~snemeth]

Thanks for the feedback. I applied all of them except for removing 
_QueueMaxContainerAllocationValidator.createExceptionText_

from the test. I used it because I was testing if the parameters are correct 
for the exception not for validating the error message text.

 

Balint

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
>  Labels: patch
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread JIRA



[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538831#comment-16538831
 ] 

Antal Bálint Steinbach commented on YARN-8468:
--

Hi [~snemeth]

Thanks for the feedback. I applied all of them except for removing 
_QueueMaxContainerAllocationValidator.createExceptionText_

from the test. I used it because I was testing if the parameters are correct 
for the exception not for validating the error message text.

 

Balint

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
>  Labels: patch
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538830#comment-16538830
 ] 

Rohith Sharma K S commented on YARN-8512:
-

Updated the patch with test case added.

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8512:

Attachment: YARN-8512.03.patch

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch, 
> YARN-8512.03.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread JIRA



 [ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antal Bálint Steinbach updated YARN-8468:
-
Attachment: YARN-8468.003.patch

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
>  Labels: patch
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch, YARN-8468.003.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread Hudson (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538779#comment-16538779
 ] 

Hudson commented on YARN-8473:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14549 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14549/])
YARN-8473. Containers being launched as app tears down can leave (sunilg: rev 
705e2c1f7cba51496b0d019ecedffbe5fb55c28b)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/ApplicationImpl.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/application/TestApplication.java


> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.1, 2.9.2, 2.8.5, 3.0.4
>
> Attachments: YARN-8473.001.patch, YARN-8473.002.patch, 
> YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8501) Reduce complexity of RMWebServices' getApps method

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538770#comment-16538770
 ] 

Szilard Nemeth commented on YARN-8501:
--

Hi [~Zian Chen]!
As my intellij marks this method with a warning ("Method is too complex to 
analyze by data flow algorithm") I was thinking about these: 
1. Eliminate the boolean flags
2. Separate validation and throwing exceptions from the rest of the code.
3. Use a builder that creates a {{GetApplicationsRequest}} from the provider 
query parameters.
4. Add some testcases in order to verify I don't break anything.

Do you have anything to add to this list?
Thanks!

> Reduce complexity of RMWebServices' getApps method
> --
>
> Key: YARN-8501
> URL: https://issues.apache.org/jira/browse/YARN-8501
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: restapi
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538763#comment-16538763
 ] 

Rohith Sharma K S commented on YARN-8383:
-

Sure. I am doing verification and will commit it later of today. thanks

> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8383.001-branch-2.8.patch
>
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
> Caused by: java.lang.ClassNotFoundException: 
> com.fasterxml.jackson.core.JsonFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8502) Use path strings consistently for webservice endpoints in RMWebServices

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538764#comment-16538764
 ] 

Szilard Nemeth commented on YARN-8502:
--

Hey [~giovanni.fumarola]!
I would vote for a separate jira as the thing you mentioned not strictly 
related to constants or endpoint paths and maybe could confuse anyone looking 
into git log.
Thanks!

> Use path strings consistently for webservice endpoints in RMWebServices
> ---
>
> Key: YARN-8502
> URL: https://issues.apache.org/jira/browse/YARN-8502
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8502-001.patch
>
>
> Currently there are 2 types of endpoint path definitions: 
> 1. with string, example: 
> @Path("/apps/{appid}/appattempts/{appattemptid}/containers/{containerid}")
> 2. with constant, example: 
> @Path(RMWSConsts.APPS_APPID_APPATTEMPTS_APPATTEMPTID_CONTAINERS)
> Most preferably, constants should be used for all Paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-07-10 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538755#comment-16538755
 ] 

Jason Lowe commented on YARN-8383:
--

bq. change in HDFS is an incompatible change from branch-2.8 to branch-2.9 or 
branch-2 from jobs perspective right?

Yes, you're right.  We may be forced to do another round of shading of jackson 
in HDFS as we did for YARN in 2.8.  Arguably that's a separate JIRA, and this 
one can focus on the fix for 2.8.x.

> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8383.001-branch-2.8.patch
>
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
> Caused by: java.lang.ClassNotFoundException: 
> com.fasterxml.jackson.core.JsonFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538750#comment-16538750
 ] 

genericqa commented on YARN-8512:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  0s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 25s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 17m 46s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m  9s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8512 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12931006/YARN-8512.02.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux d2154bee076c 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ca8b80b |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/21201/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/21201/testReport/ |
| Max. process+thread count | 312 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U:

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538734#comment-16538734
 ] 

Rohith Sharma K S commented on YARN-8383:
-

Ahh.. I misunderstood your earlier comment. Thanks for clarifying it.

bq. Instead I was proposing adding the dependency back in for branch-2 and 
branch-2.9, since the jackson dependency is already there in those release 
lines due to HDFS pulling it in.
Considering HDFS is already pulling jackson-core, it should be fine. My doubt 
is, CMIIW, change in HDFS is an incompatible change from branch-2.8 to 
branch-2.9 or branch-2 from jobs perspective right? ..since application 
classpath also refer to hdfs/lib. 

> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8383.001-branch-2.8.patch
>
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
> Caused by: java.lang.ClassNotFoundException: 
> com.fasterxml.jackson.core.JsonFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-07-10 Thread Sunil Govindan (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538701#comment-16538701
 ] 

Sunil Govindan commented on YARN-8473:
--

Thanks [~jlowe]. I ll help to commit this.

> Containers being launched as app tears down can leave containers in NEW state
> -
>
> Key: YARN-8473
> URL: https://issues.apache.org/jira/browse/YARN-8473
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.4
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Major
> Attachments: YARN-8473.001.patch, YARN-8473.002.patch, 
> YARN-8473.003.patch
>
>
> I saw a case where containers were stuck on a nodemanager in the NEW state 
> because they tried to launch just as an application was tearing down.  The 
> container sent an INIT_CONTAINER event to the ApplicationImpl which then 
> executed an invalid transition since that event is not handled/expected when 
> the application is in the process of tearing down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8383) TimelineServer 1.5 start fails with NoClassDefFoundError

2018-07-10 Thread Jason Lowe (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538699#comment-16538699
 ] 

Jason Lowe commented on YARN-8383:
--

bq. I think adding dependencies share/hadoop/yarn/lib is right way to fix. But 
this change going to bring back YARN-6628 which will become compatible issue 
for older jobs right?

I'm not proposing adding the dependency back for 2.8.  The attached patch 
shades even more than we did before, so if anything we're removing dependencies 
from an app's point of view if this patch goes into 2.8.

Instead I was proposing adding the dependency back in for branch-2 and 
branch-2.9, since the jackson dependency is already there in those release 
lines due to HDFS pulling it in.  On those two branches shading YARN's jackson 
dependency isn't buying us anything from an app's perspective.


> TimelineServer 1.5 start fails with NoClassDefFoundError
> 
>
> Key: YARN-8383
> URL: https://issues.apache.org/jira/browse/YARN-8383
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.8.4
>Reporter: Rohith Sharma K S
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8383.001-branch-2.8.patch
>
>
> TimelineServer 1.5 start fails with NoClassDefFoundError.
> {noformat}
> 2018-05-31 22:10:58,548 FATAL 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer:
>  Error starting ApplicationHistoryServer
> java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/JsonFactory
>   at 
> org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore.(RollingLevelDBTimelineStore.java:174)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2306)
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2271)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2367)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.createSummaryStore(EntityGroupFSTimelineStore.java:239)
>   at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.serviceInit(EntityGroupFSTimelineStore.java:146)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:115)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:180)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:190)
> Caused by: java.lang.ClassNotFoundException: 
> com.fasterxml.jackson.core.JsonFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 15 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538572#comment-16538572
 ] 

Rohith Sharma K S edited comment on YARN-8512 at 7/10/18 1:49 PM:
--

01.patch seems to be issue if we create new ApplicationImpl and update in 
context since it state transition. We just need to update existing flowContext 
object inside ApplicationImpl. I will update a new patch with this change and 
cancelling existing patch.


was (Author: rohithsharma):
01.patch seems to be issue if we update ApplicationImpl since it state 
transition. In this case, we just need to update existing field value inside 
ApplicationImpl. I will update a new patch with this change and cancelling 
existing patch.

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538599#comment-16538599
 ] 

Rohith Sharma K S commented on YARN-8512:
-

Attached 02 patch that sets  flow context in existing ApplicationImpl. 
[~sunilg] Could you please review?

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8512:

Attachment: YARN-8512.02.patch

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch, YARN-8512.02.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180710_old.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710_old.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538579#comment-16538579
 ] 

genericqa commented on YARN-8480:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
35s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 20 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
43s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  8m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  3m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 33s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
23s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green}  7m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m 
11s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 24s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
26s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
47s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
19s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 69m  
4s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 24m  
5s{color} | {color:green} hadoop-yarn-client in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}190m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api |
|  |  
org.apache.hadoop.yarn.conf.YarnConfiguration.DEFAULT_RM_CONFIGURATION_PROVIDER_CLASS
 isn't final but should be

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538572#comment-16538572
 ] 

Rohith Sharma K S commented on YARN-8512:
-

01.patch seems to be issue if we update ApplicationImpl since it state 
transition. In this case, we just need to update existing field value inside 
ApplicationImpl. I will update a new patch with this change and cancelling 
existing patch.

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase from second attempt onwards

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8512:

Summary: ATSv2 entities are not published to HBase from second attempt 
onwards  (was: ATSv2 entities are not published to HBase)

> ATSv2 entities are not published to HBase from second attempt onwards
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8512) ATSv2 entities are not published to HBase

2018-07-10 Thread Rohith Sharma K S (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538556#comment-16538556
 ] 

Rohith Sharma K S commented on YARN-8512:
-

Attached the patch with following modifications to update FlowContext in 
ApplicationImpl
 # _if_ ApplicationImpl reference found in context while starting master 
container _then_
 ** create new reference to ApplicationImpl
 ** update the context.
 ** update the NMStateStore so that NM recovery will pick newer ApplicationImpl.

> ATSv2 entities are not published to HBase
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8512:

Affects Version/s: 3.0.3
   3.2.0
   2.10.0
   3.1.0
 Target Version/s: 3.1.1

> ATSv2 entities are not published to HBase
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0, 2.10.0, 3.2.0, 3.0.3
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8512) ATSv2 entities are not published to HBase

2018-07-10 Thread Rohith Sharma K S (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohith Sharma K S updated YARN-8512:

Attachment: YARN-8512.01.patch

> ATSv2 entities are not published to HBase
> -
>
> Key: YARN-8512
> URL: https://issues.apache.org/jira/browse/YARN-8512
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Assignee: Rohith Sharma K S
>Priority: Major
> Attachments: YARN-8512.01.patch
>
>
> It is observed that if 1st attempt master container is died and 2nd attempt 
> master container is launched in a NM where old containers are running but not 
> master container. 
> ||Attempt||NM1||NM2||Action||
> |attempt-1|master container i.e container-1-1|container-1-2|master container 
> died|
> |attempt-2|NA|container-1-2 and master container container-2-1|NA|
> In the above scenario, NM doesn't identifies flowContext and will get log 
> below
> {noformat}
> 2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1531175172425_0001 userId=hbase 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710_old.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, 
> hadoop-2.7.2.gpu-port-20180710_old.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8511) When AM releases a container, RM removes allocation tags before it is released by NM

2018-07-10 Thread genericqa (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538478#comment-16538478
 ] 

genericqa commented on YARN-8511:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
48s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 14s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 29m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 29m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 12s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 69m 45s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 
21s{color} | {color:green} hadoop-sls in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
39s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}206m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | YARN-8511 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12930962/YARN-8511.001.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 4fe7932f3d45 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 9bd5bef |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
| unit |

[jira] [Commented] (YARN-8505) AMLimit and userAMLimit check should be skipped for unmanaged AM

2018-07-10 Thread Bibin A Chundatt (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538425#comment-16538425
 ] 

Bibin A Chundatt commented on YARN-8505:


{quote}
maxApplications and maxApplicationsPerUser. 
{quote}

Above properties are for total application in queue, not running application 
IIUC

> AMLimit and userAMLimit check should be skipped for unmanaged AM
> 
>
> Key: YARN-8505
> URL: https://issues.apache.org/jira/browse/YARN-8505
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.2.0, 2.9.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-8505.001.patch
>
>
> AMLimit and userAMLimit check in LeafQueue#activateApplications should be 
> skipped for unmanaged AM whose resource is not taken from YARN cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-8512) ATSv2 entities are not published to HBase

2018-07-10 Thread Rohith Sharma K S (JIRA)

Rohith Sharma K S created YARN-8512:
---

 Summary: ATSv2 entities are not published to HBase
 Key: YARN-8512
 URL: https://issues.apache.org/jira/browse/YARN-8512
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Yesha Vora
Assignee: Rohith Sharma K S


It is observed that if 1st attempt master container is died and 2nd attempt 
master container is launched in a NM where old containers are running but not 
master container. 

||Attempt||NM1||NM2||Action||
|attempt-1|master container i.e container-1-1|container-1-2|master container 
died|
|attempt-2|NA|container-1-2 and master container container-2-1|NA|

In the above scenario, NM doesn't identifies flowContext and will get log below
{noformat}
2018-07-10 00:44:38,285 WARN  storage.HBaseTimelineWriterImpl 
(HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
flowName=null appId=application_1531175172425_0001 userId=hbase 
clusterId=yarn-cluster . Not proceeding with writing to hbase
2018-07-10 00:44:38,560 WARN  storage.HBaseTimelineWriterImpl 
(HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
flowName=null appId=application_1531175172425_0001 userId=hbase 
clusterId=yarn-cluster . Not proceeding with writing to hbase
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8480) Add boolean option for resources

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538344#comment-16538344
 ] 

Szilard Nemeth commented on YARN-8480:
--

Uploaded second patch that fixes the trailing whitespace issues and one 
findbugs issue.
The other findbugs issue that complains about 
DEFAULT_RM_CONFIGURATION_PROVIDER_CLASS should be final could not be fixed as 
the static initializer method in TestResource should modify this field in order 
to work correctly. 

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8480) Add boolean option for resources

2018-07-10 Thread Szilard Nemeth (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-8480:
-
Attachment: YARN-8480.002.patch

> Add boolean option for resources
> 
>
> Key: YARN-8480
> URL: https://issues.apache.org/jira/browse/YARN-8480
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8480.001.patch, YARN-8480.002.patch
>
>
> Make it possible to define a resource with a boolean value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538323#comment-16538323
 ] 

Szilard Nemeth commented on YARN-8468:
--

Hi [~bsteinbach]!
Thanks for the patch. This is high quality code here.

I noticed a couple of things: 
- {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be 
package-private (without any modifier)
- {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use 
{{String.format()}} instead of concatenating the parts of the string.
- {{QueueMaxContainerAllocationValidator}}: you used the method 
{{throwException}} 2 times, and you also used {{throw new 
YarnRuntimeException}} as is. I think you should either use the method for all 
3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I 
prefer the latter.
- {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of 
message instead: "Invalid queue resource allocation, it does not allowed to 
override " + MAX_CONTAINER_RESOURCES + " for the root queue!"
- {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and 
maxCores on INFO level is unnecessary. I would not log these at all, even not 
on DEBUG level as it does not hold any meaningful information for the users 
like this.
- {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as 
above, remove the logged queueMem and queueCores log statements.
- {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add 
comments about what is this field for, as we have comments for other fields as 
well.
{{FSLeafQueue.getMaximumResourceCapability // 
FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a 
space missing between the "if" and the parentheses.
- {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to 
use method names like {{testXXX}} so 
{{tooHighMemoryMaxContainerAllocationTest}} should change to 
{{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the 
name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the 
methods similarly.
- {{TestQueueMaxContainerAllocationValidator}}: Please don't use 
{{QueueMaxContainerAllocationValidator.createExceptionText}} in the tests, as 
if the production code generates the text in a wrong format, then this test 
won't fail. I would simply use Strings here to assert the message.
- {{TestFairScheduler}}: Once again, the convention for method names is testXXX.
- In the {{FairScheduler.md}} documentation, I would replace "This property is 
invalid for root queue." with "This property cannot be defined for the root 
queue"

Please fix the lines longer than 80 chars, at least I saw one occurence in 
{{FairSchedulerTestBase}} and {{TestFairScheduler}}.

> Limit container sizes per queue in FairScheduler
> 
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
>  Labels: patch
> Attachments: YARN-8468.000.patch, YARN-8468.001.patch, 
> YARN-8468.002.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
>  
> The goal of this ticket is to allow this value to be set on a per queue basis.
>  
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
>  
> Suggested solution:
>  
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability() in both FSParentQueue and 
> FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * write JUnit tests.
>  * update the scheduler documentation.

[jira] [Comment Edited] (YARN-8468) Limit container sizes per queue in FairScheduler

2018-07-10 Thread Szilard Nemeth (JIRA)



[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538323#comment-16538323
 ] 

Szilard Nemeth edited comment on YARN-8468 at 7/10/18 9:41 AM:
---

Hi [~bsteinbach]!
Thanks for the patch. This is high quality code here.

I noticed a couple of things: 
- {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be 
package-private (without any modifier)
- {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use 
{{String.format()}} instead of concatenating the parts of the string.
- {{QueueMaxContainerAllocationValidator}}: you used the method 
{{throwException}} 2 times, and you also used {{throw new 
YarnRuntimeException}} as is. I think you should either use the method for all 
3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I 
prefer the latter.
- {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of 
message instead: "Invalid queue resource allocation, it does not allowed to 
override " + MAX_CONTAINER_RESOURCES + " for the root queue!"
- {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and 
maxCores on INFO level is unnecessary. I would not log these at all, even not 
on DEBUG level as it does not hold any meaningful information for the users 
like this.
- {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as 
above, remove the logged queueMem and queueCores log statements.
- {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add 
comments about what is this field for, as we have comments for other fields as 
well.
{{FSLeafQueue.getMaximumResourceCapability // 
FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a 
space missing between the "if" and the parentheses.
- {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to 
use method names like {{testXXX}} so 
{{tooHighMemoryMaxContainerAllocationTest}} should change to 
{{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the 
name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the 
methods similarly.
- {{TestQueueMaxContainerAllocationValidator}}: Please don't use 
{{QueueMaxContainerAllocationValidator.createExceptionText}} in the tests, as 
if the production code generates the text in a wrong format, then this test 
won't fail. I would simply use Strings here to assert the message.
- {{TestFairScheduler}}: Once again, the convention for method names is testXXX.
- In the {{FairScheduler.md}} documentation, I would replace "This property is 
invalid for root queue." with "This property cannot be defined for the root 
queue"

Please fix the lines longer than 80 chars, at least I saw one occurence in 
{{FairSchedulerTestBase}} and {{TestFairScheduler}}.

Thanks!


was (Author: snemeth):
Hi [~bsteinbach]!
Thanks for the patch. This is high quality code here.

I noticed a couple of things: 
- {{AllocationFileQueueParser: MAX_CONTAINER_RESOURCES}} could be 
package-private (without any modifier)
- {{QueueMaxContainerAllocationValidator.createExceptionText}}: please use 
{{String.format()}} instead of concatenating the parts of the string.
- {{QueueMaxContainerAllocationValidator}}: you used the method 
{{throwException}} 2 times, and you also used {{throw new 
YarnRuntimeException}} as is. I think you should either use the method for all 
3 invocations or just use {{throw new YarnRuntimeException()}} everywhere. I 
prefer the latter.
- {{QueueMaxContainerAllocationValidator.validate}}: I would use this kind of 
message instead: "Invalid queue resource allocation, it does not allowed to 
override " + MAX_CONTAINER_RESOURCES + " for the root queue!"
- {{QueueMaxContainerAllocationValidator.validate}}: Logging maxMem and 
maxCores on INFO level is unnecessary. I would not log these at all, even not 
on DEBUG level as it does not hold any meaningful information for the users 
like this.
- {{QueueMaxContainerAllocationValidator.checkContainerResources}}: Same as 
above, remove the logged queueMem and queueCores log statements.
- {{AllocationConfiguration.queueMaxContainerResourcesMap}}: Please add 
comments about what is this field for, as we have comments for other fields as 
well.
{{FSLeafQueue.getMaximumResourceCapability // 
FsParentQueue.getMaximumResourceCapability}}: I accidentally noticed there's a 
space missing between the "if" and the parentheses.
- {{TestQueueMaxContainerAllocationValidator}}: I think the convention is to 
use method names like {{testXXX}} so 
{{tooHighMemoryMaxContainerAllocationTest}} should change to 
{{testTooHighMemoryMaxContainerAllocation}}. In addition, I would change the 
name to {{testMaxContainerAllocationWithTooHighMemory}} and the rest of the 
methods similarly.
- {{TestQueueMaxContainerAllocationValidator}}: Please don't use

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: (was: hadoop-2.7.2.gpu-port-20180710.patch)

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port.patch, hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-7481) Gpu locality support for Better AI scheduling

2018-07-10 Thread Chen Qingcha (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Qingcha updated YARN-7481:
---
Attachment: hadoop-2.7.2.gpu-port-20180710.patch

> Gpu locality support for Better AI scheduling
> -
>
> Key: YARN-7481
> URL: https://issues.apache.org/jira/browse/YARN-7481
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: api, RM, yarn
>Affects Versions: 2.7.2
>Reporter: Chen Qingcha
>Priority: Major
> Fix For: 2.7.2
>
> Attachments: GPU locality support for Job scheduling.pdf, 
> hadoop-2.7.2.gpu-port-20180710.patch, hadoop-2.7.2.gpu-port.patch, 
> hadoop-2.9.0.gpu-port.patch, hadoop_2.9.0.patch
>
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> We enhance Hadoop with GPU support for better AI job scheduling. 
> Currently, YARN-3926 also supports GPU scheduling, which treats GPU as 
> countable resource. 
> However, GPU placement is also very important to deep learning job for better 
> efficiency.
>  For example, a 2-GPU job runs on gpu {0,1} could be faster than run on gpu 
> {0, 7}, if GPU 0 and 1 are under the same PCI-E switch while 0 and 7 are not.
>  We add the support to Hadoop 2.7.2 to enable GPU locality scheduling, which 
> support fine-grained GPU placement. 
> A 64-bits bitmap is added to yarn Resource, which indicates both GPU usage 
> and locality information in a node (up to 64 GPUs per node). '1' means 
> available and '0' otherwise in the corresponding position of the bit.   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

1 2 >

1 - 100 of 109 matches

Mail list logo