[jira] [Commented] (YARN-8891) Documentation of the pluggable device framework

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772732#comment-16772732
 ] 

Hadoop QA commented on YARN-8891:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
27m  0s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} mvnsite {color} | {color:red}  0m 
16s{color} | {color:red} hadoop-yarn-site in the patch failed. {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 21 line(s) that end in whitespace. Use 
git apply --whitespace=fix <>. Refer 
https://git-scm.com/docs/git-apply {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 55s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 40m 37s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8891 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959385/YARN-8891-trunk.001.patch
 |
| Optional Tests |  dupname  asflicense  mvnsite  |
| uname | Linux 8728af006da8 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 1d30fd9 |
| maven | version: Apache Maven 3.3.9 |
| mvnsite | 
https://builds.apache.org/job/PreCommit-YARN-Build/23450/artifact/out/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-site.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/23450/artifact/out/whitespace-eol.txt
 |
| Max. process+thread count | 447 (vs. ulimit of 1) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23450/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, 
> YARN-8891-trunk.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8891:
---
Attachment: YARN-8891-trunk.002.patch

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8891:
---
Attachment: YARN-8891-trunk.003.patch

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, 
> YARN-8891-trunk.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6538) Inter Queue preemption is not happening when DRF is configured

2019-02-19 Thread niu (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772706#comment-16772706
 ] 

niu commented on YARN-6538:
---

Hi All,

Any updated on it?

> Inter Queue preemption is not happening when DRF is configured
> --
>
> Key: YARN-6538
> URL: https://issues.apache.org/jira/browse/YARN-6538
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler, scheduler preemption
>Affects Versions: 2.8.0
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
>
> Cluster capacity of . Here memory is more and vcores 
> are less. If applications have more demand, vcores might be exhausted. 
> Inter queue preemption ideally has to be kicked in once vcores is over 
> utilized. However preemption is not happening.
> Analysis:
> In {{AbstractPreemptableResourceCalculator.computeFixpointAllocation}}, 
> {code}
> // assign all cluster resources until no more demand, or no resources are
> // left
> while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant,
> unassigned, Resources.none())) {
> {code}
>  will loop even when vcores are 0 (because memory is still +ve). Hence we are 
> having more vcores in idealAssigned which cause no-preemption cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8891:
---
Attachment: YARN-8891-trunk.001.patch

> Documentation of the pluggable device framework
> ---
>
> Key: YARN-8891
> URL: https://issues.apache.org/jira/browse/YARN-8891
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8891-trunk.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772691#comment-16772691
 ] 

Prabhu Joseph commented on YARN-6929:
-

[~jlowe] Can you review this jira when you get time. Thanks.

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, 
> YARN-6929.3.patch, YARN-6929.4.patch, YARN-6929.5.patch, YARN-6929.6.patch, 
> YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> 

[jira] [Commented] (YARN-9227) DistributedShell RelativePath is not removed at end

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772689#comment-16772689
 ] 

Prabhu Joseph commented on YARN-9227:
-

[~sunilg] Can you review this jira when you get time. Thanks.

> DistributedShell RelativePath is not removed at end
> ---
>
> Key: YARN-9227
> URL: https://issues.apache.org/jira/browse/YARN-9227
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: 0001-YARN-9227.patch, 0002-YARN-9227.patch, 
> 0003-YARN-9227.patch
>
>
> DistributedShell Job does not remove the relative path which contains jars 
> and localized files.
> {code}
> [ambari-qa@ash hadoop-yarn]$ hadoop fs -ls 
> /user/ambari-qa/DistributedShell/application_1542665708563_0017
> Found 2 items
> -rw-r--r--   3 ambari-qa hdfs  46636 2019-01-23 13:37 
> /user/ambari-qa/DistributedShell/application_1542665708563_0017/AppMaster.jar
> -rwx--x---   3 ambari-qa hdfs  4 2019-01-23 13:37 
> /user/ambari-qa/DistributedShell/application_1542665708563_0017/shellCommands
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9258) Support to specify allocation tags without constraint in distributed shell CLI

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772687#comment-16772687
 ] 

Prabhu Joseph commented on YARN-9258:
-

[~cheersyang] Can you review this patch when you get time. Thanks.

> Support to specify allocation tags without constraint in distributed shell CLI
> --
>
> Key: YARN-9258
> URL: https://issues.apache.org/jira/browse/YARN-9258
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: distributed-shell
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9258-001.patch, YARN-9258-002.patch
>
>
> DistributedShell PlacementSpec fails to parse 
> {color:#d04437}zk=1:spark=1,NOTIN,NODE,zk{color}
> {code}
> java.lang.IllegalArgumentException: Invalid placement spec: 
> zk=1:spark=1,NOTIN,NODE,zk
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.PlacementSpec.parse(PlacementSpec.java:108)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.Client.init(Client.java:462)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDistributedShellWithPlacementConstraint(TestDistributedShell.java:1780)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>   at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.yarn.util.constraint.PlacementConstraintParseException: 
> Source allocation tags is required for a multi placement constraint 
> expression.
>   at 
> org.apache.hadoop.yarn.util.constraint.PlacementConstraintParser.parsePlacementSpec(PlacementConstraintParser.java:740)
>   at 
> org.apache.hadoop.yarn.applications.distributedshell.PlacementSpec.parse(PlacementSpec.java:94)
>   ... 16 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9290) Invalid SchedulingRequest not rejected in Scheduler PlacementConstraintsHandler

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772686#comment-16772686
 ] 

Prabhu Joseph commented on YARN-9290:
-

[~cheersyang] Can you review the patch for this jira when you get time. Thanks.

> Invalid SchedulingRequest not rejected in Scheduler 
> PlacementConstraintsHandler 
> 
>
> Key: YARN-9290
> URL: https://issues.apache.org/jira/browse/YARN-9290
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9290-001.patch, YARN-9290-002.patch, 
> YARN-9290-003.patch
>
>
> SchedulingRequest with Invalid namespace is not rejected in Scheduler  
> PlacementConstraintsHandler. RM keeps on trying to allocateOnNode with 
> logging the exception. This is rejected in case of placement-processor 
> handler.
> {code}
> 2019-02-08 16:51:27,548 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator:
>  Failed to query node cardinality:
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.InvalidAllocationTagsQueryException:
>  Invalid namespace prefix: notselfi, valid values are: 
> all,not-self,app-id,app-tag,self
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.TargetApplicationsNamespace.fromString(TargetApplicationsNamespace.java:277)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.TargetApplicationsNamespace.parse(TargetApplicationsNamespace.java:234)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.AllocationTags.createAllocationTags(AllocationTags.java:93)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfySingleConstraintExpression(PlacementConstraintsUtil.java:78)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfySingleConstraint(PlacementConstraintsUtil.java:240)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:321)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyAndConstraint(PlacementConstraintsUtil.java:272)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:324)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:365)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator.checkCardinalityAndPending(SingleConstraintAppPlacementAllocator.java:355)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator.precheckNode(SingleConstraintAppPlacementAllocator.java:395)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.precheckNode(AppSchedulingInfo.java:779)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.preCheckForNodeCandidateSet(RegularContainerAllocator.java:145)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:837)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:890)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:977)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1173)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1630)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1624)
>   at 
> 

[jira] [Commented] (YARN-9208) Distributed shell allow LocalResourceVisibility to be specified

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772683#comment-16772683
 ] 

Prabhu Joseph commented on YARN-9208:
-

[~bibinchundatt] Have changed the pattern into 
{{(PUBLIC=FileName1,FileName2,,),(PRIVATE=FileName3,FileName4,,),,}}. If only a 
PRIVATE file hdfs:/tmp/a is present - the pattern will be 
(PRIVATE=hdfs:/tmp/a). Can you review the same.

> Distributed shell allow LocalResourceVisibility to be specified
> ---
>
> Key: YARN-9208
> URL: https://issues.apache.org/jira/browse/YARN-9208
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bibin A Chundatt
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: YARN-9208-001.patch, YARN-9208-002.patch, 
> YARN-9208-003.patch, YARN-9208-004.patch
>
>
> YARN-9008 add feature to add list of files to be localized.
> Would be great to have Visibility type too. Allows testing of PRIVATE and 
> PUBLIC type too



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772665#comment-16772665
 ] 

Prabhu Joseph edited comment on YARN-8132 at 2/20/19 6:47 AM:
--

[~bibinchundatt] The existing test case {{TestRMAppTransitions#testAppNewKill}} 
covers the scenario. The {{currentAttempt}} is not created (Null) and the 
{{RMAppImpl}} StateMachine currentState is transitioned properly to KILLED. The 
issue happens only when the job is killed after attempt is created as the 
attempt's {{finalStatus}} is not updated.


was (Author: prabhu joseph):
[~bibinchundatt] The existing test case TestRMAppTransitions#testAppNewKill 
covers the scenario. The currentAttempt is not created (Null) and the 
StateMachine currentState is transitioned properly to KILLED. The issue happens 
only when the job is killed after attempt is created as the attempt finalStatus 
is not updated.

> Final Status of applications shown as UNDEFINED in ATS app queries
> --
>
> Key: YARN-8132
> URL: https://issues.apache.org/jira/browse/YARN-8132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2, timelineservice
>Reporter: Charan Hebri
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-8132-001.patch, YARN-8132-002.patch, 
> YARN-8132-003.patch, YARN-8132-004.patch
>
>
> Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A 
> sample request/response with INFO field for an application,
> {noformat}
> 2018-04-09 13:10:02,126 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1693)) - Received URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user 
> hrt_qa
> 2018-04-09 13:10:02,156 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1716)) - Processed URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 
> ms.){noformat}
> {noformat}
> {
>   "metrics": [],
>   "events": [],
>   "createdtime": 1523263360719,
>   "idprefix": 0,
>   "id": "application_1523259757659_0003",
>   "type": "YARN_APPLICATION",
>   "info": {
> "YARN_APPLICATION_CALLER_CONTEXT": "CLI",
> "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application 
> application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX",
> "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED",
> "YARN_APPLICATION_NAME": "Sleep job",
> "YARN_APPLICATION_USER": "hrt_qa",
> "YARN_APPLICATION_UNMANAGED_APPLICATION": false,
> "FROM_ID": 
> "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003",
> "UID": "yarn-cluster!application_1523259757659_0003",
> "YARN_APPLICATION_VIEW_ACLS": " ",
> "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718,
> "YARN_AM_CONTAINER_LAUNCH_COMMAND": [
>   "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 
> -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog 
> -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout 
> 2>/stderr "
> ],
> "YARN_APPLICATION_QUEUE": "default",
> "YARN_APPLICATION_TYPE": "MAPREDUCE",
> "YARN_APPLICATION_PRIORITY": 0,
> "YARN_APPLICATION_LATEST_APP_ATTEMPT": 
> "appattempt_1523259757659_0003_01",
> "YARN_APPLICATION_TAGS": [
>   "timeline_flow_name_tag:test_flow"
> ],
> "YARN_APPLICATION_STATE": "KILLED"
>   },
>   "configs": {},
>   "isrelatedto": {},
>   "relatesto": {}
> }{noformat}
> This is different to what the Resource Manager reports. For KILLED 
> applications the final status is KILLED and for FAILED applications it is 
> FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries

2019-02-19 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8132:

Attachment: YARN-8132-004.patch

> Final Status of applications shown as UNDEFINED in ATS app queries
> --
>
> Key: YARN-8132
> URL: https://issues.apache.org/jira/browse/YARN-8132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2, timelineservice
>Reporter: Charan Hebri
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-8132-001.patch, YARN-8132-002.patch, 
> YARN-8132-003.patch, YARN-8132-004.patch
>
>
> Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A 
> sample request/response with INFO field for an application,
> {noformat}
> 2018-04-09 13:10:02,126 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1693)) - Received URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user 
> hrt_qa
> 2018-04-09 13:10:02,156 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1716)) - Processed URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 
> ms.){noformat}
> {noformat}
> {
>   "metrics": [],
>   "events": [],
>   "createdtime": 1523263360719,
>   "idprefix": 0,
>   "id": "application_1523259757659_0003",
>   "type": "YARN_APPLICATION",
>   "info": {
> "YARN_APPLICATION_CALLER_CONTEXT": "CLI",
> "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application 
> application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX",
> "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED",
> "YARN_APPLICATION_NAME": "Sleep job",
> "YARN_APPLICATION_USER": "hrt_qa",
> "YARN_APPLICATION_UNMANAGED_APPLICATION": false,
> "FROM_ID": 
> "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003",
> "UID": "yarn-cluster!application_1523259757659_0003",
> "YARN_APPLICATION_VIEW_ACLS": " ",
> "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718,
> "YARN_AM_CONTAINER_LAUNCH_COMMAND": [
>   "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 
> -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog 
> -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout 
> 2>/stderr "
> ],
> "YARN_APPLICATION_QUEUE": "default",
> "YARN_APPLICATION_TYPE": "MAPREDUCE",
> "YARN_APPLICATION_PRIORITY": 0,
> "YARN_APPLICATION_LATEST_APP_ATTEMPT": 
> "appattempt_1523259757659_0003_01",
> "YARN_APPLICATION_TAGS": [
>   "timeline_flow_name_tag:test_flow"
> ],
> "YARN_APPLICATION_STATE": "KILLED"
>   },
>   "configs": {},
>   "isrelatedto": {},
>   "relatesto": {}
> }{noformat}
> This is different to what the Resource Manager reports. For KILLED 
> applications the final status is KILLED and for FAILED applications it is 
> FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772672#comment-16772672
 ] 

Hadoop QA commented on YARN-8821:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 34s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 43s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 36s{color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 42s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-8821 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959367/YARN-8821-trunk.010.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 29f8dd684c95 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 1d30fd9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23448/testReport/ |
| Max. process+thread count | 308 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 

[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Attachment: (was: 屏幕快照 2019-02-20 下午2.24.26.png)

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Fix For: 3.1.2
>
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Affects Version/s: 3.1.0

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: fengyongshe
>Priority: Major
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Description: 
The Queue Info is configured in fair-scheduler.xml like below


      {color:#ff}{color}
          3072mb,3vcores
         4096mb,4vcores
          
               1024mb,1vcores
              2048mb,2vcores
               Charlie
           
       
      {color:#ff}{color}
           1024mb,1vcores
           2048mb,2vcores
       
 

The Queue {color:#ff}root.deva{color} configured last will override 
existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
like the {color}attachment 

 root.deva
||Used Resources:||
||Min Resources:|.   => should be <3072mb,3vcore>|
||Max Resources:|.    => should be<4096mb,4vcores>|
||Reserved Resources:||
||Steady Fair Share:||
||Instantaneous Fair Share:||

root.deva.sample
||Min Resources:||
||Max Resources:||
||Reserved Resources:||
||Steady Fair Share:||

     

 

  was:
The Queue Info is configured in fair-scheduler.xml like below


     {color:#FF}{color}
         3072mb,3vcores
        4096mb,4vcores
         
              1024mb,1vcores
             2048mb,2vcores
              Charlie
          
      
     {color:#FF}{color}
          1024mb,1vcores
          2048mb,2vcores
      


The Queue {color:#FF}root.deva{color} configured last will override 
existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in 
root.deva.sample, like this{color}{color}

 

 


> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Fix For: 3.1.2
>
> Attachments: 屏幕快照 2019-02-20 下午2.24.26.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Description: 
The Queue Info is configured in fair-scheduler.xml like below


      {color:#ff}{color}
          3072mb,3vcores
         4096mb,4vcores
          
               1024mb,1vcores
              2048mb,2vcores
               Charlie
           
       
      {color:#ff}{color}
           1024mb,1vcores
           2048mb,2vcores
       
 

{color:#33}The Queue root.deva configured last will override existing 
root.deva{color}{color:#33} in root.deva.sample, like the {color}attachment 
 
  root.deva
||Used Resources:||
||Min Resources:|.  => should be <3072mb,3vcores>|
||Max Resources:|.  => should be <4096mb,4vcores>|
||Reserved Resources:||
||Steady Fair Share:||
||Instantaneous Fair Share:||
 
root.deva.sample
||Min Resources:||
||Max Resources:||
||Reserved Resources:||
||Steady Fair Share:||

     

 

  was:
The Queue Info is configured in fair-scheduler.xml like below


      {color:#ff}{color}
          3072mb,3vcores
         4096mb,4vcores
          
               1024mb,1vcores
              2048mb,2vcores
               Charlie
           
       
      {color:#ff}{color}
           1024mb,1vcores
           2048mb,2vcores
       
 

{color:#33}The Queue root.deva configured last will override existing 
root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the 
{color}attachment {color}

 root.deva
||Used Resources:||
||Min Resources:|.  {color:#d04437} => should be 
<3072mb,3vcore>{color}|
||Max Resources:|.    {color:#d04437}=> should 
be<4096mb,4vcores>{color}|
||Reserved Resources:||
||Steady Fair Share:||
||Instantaneous Fair Share:||

root.deva.sample
||Min Resources:||
||Max Resources:||
||Reserved Resources:||
||Steady Fair Share:||

     

 


> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: fengyongshe
>Priority: Major
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> {color:#33}The Queue root.deva configured last will override existing 
> root.deva{color}{color:#33} in root.deva.sample, like the 
> {color}attachment 
>  
>   root.deva
> ||Used Resources:||
> ||Min Resources:|.  => should be <3072mb,3vcores>|
> ||Max Resources:|.  => should be <4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
>  
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Attachment: (was: 屏幕快照 2019-02-20 下午2.24.26.png)

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Fix For: 3.1.2
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>      {color:#FF}{color}
>          3072mb,3vcores
>         4096mb,4vcores
>          
>               1024mb,1vcores
>              2048mb,2vcores
>               Charlie
>           
>       
>      {color:#FF}{color}
>           1024mb,1vcores
>           2048mb,2vcores
>       
> 
> The Queue {color:#FF}root.deva{color} configured last will override 
> existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in 
> root.deva.sample, like this{color}{color}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Description: 
The Queue Info is configured in fair-scheduler.xml like below


      {color:#ff}{color}
          3072mb,3vcores
         4096mb,4vcores
          
               1024mb,1vcores
              2048mb,2vcores
               Charlie
           
       
      {color:#ff}{color}
           1024mb,1vcores
           2048mb,2vcores
       
 

{color:#33}The Queue root.deva configured last will override existing 
root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the 
{color}attachment {color}

 root.deva
||Used Resources:||
||Min Resources:|.  {color:#d04437} => should be 
<3072mb,3vcore>{color}|
||Max Resources:|.    {color:#d04437}=> should 
be<4096mb,4vcores>{color}|
||Reserved Resources:||
||Steady Fair Share:||
||Instantaneous Fair Share:||

root.deva.sample
||Min Resources:||
||Max Resources:||
||Reserved Resources:||
||Steady Fair Share:||

     

 

  was:
The Queue Info is configured in fair-scheduler.xml like below


      {color:#ff}{color}
          3072mb,3vcores
         4096mb,4vcores
          
               1024mb,1vcores
              2048mb,2vcores
               Charlie
           
       
      {color:#ff}{color}
           1024mb,1vcores
           2048mb,2vcores
       
 

The Queue {color:#ff}root.deva{color} configured last will override 
existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
like the {color}attachment 

 root.deva
||Used Resources:||
||Min Resources:|.   => should be <3072mb,3vcore>|
||Max Resources:|.    => should be<4096mb,4vcores>|
||Reserved Resources:||
||Steady Fair Share:||
||Instantaneous Fair Share:||

root.deva.sample
||Min Resources:||
||Max Resources:||
||Reserved Resources:||
||Steady Fair Share:||

     

 


> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: fengyongshe
>Priority: Major
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> {color:#33}The Queue root.deva configured last will override existing 
> root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the 
> {color}attachment {color}
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.  {color:#d04437} => should be 
> <3072mb,3vcore>{color}|
> ||Max Resources:|.    {color:#d04437}=> should 
> be<4096mb,4vcores>{color}|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772665#comment-16772665
 ] 

Prabhu Joseph commented on YARN-8132:
-

[~bibinchundatt] The existing test case TestRMAppTransitions#testAppNewKill 
covers the scenario. The currentAttempt is not created (Null) and the 
StateMachine currentState is transitioned properly to KILLED. The issue happens 
only when the job is killed after attempt is created as the attempt finalStatus 
is not updated.

> Final Status of applications shown as UNDEFINED in ATS app queries
> --
>
> Key: YARN-8132
> URL: https://issues.apache.org/jira/browse/YARN-8132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2, timelineservice
>Reporter: Charan Hebri
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-8132-001.patch, YARN-8132-002.patch, 
> YARN-8132-003.patch
>
>
> Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A 
> sample request/response with INFO field for an application,
> {noformat}
> 2018-04-09 13:10:02,126 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1693)) - Received URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user 
> hrt_qa
> 2018-04-09 13:10:02,156 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1716)) - Processed URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 
> ms.){noformat}
> {noformat}
> {
>   "metrics": [],
>   "events": [],
>   "createdtime": 1523263360719,
>   "idprefix": 0,
>   "id": "application_1523259757659_0003",
>   "type": "YARN_APPLICATION",
>   "info": {
> "YARN_APPLICATION_CALLER_CONTEXT": "CLI",
> "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application 
> application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX",
> "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED",
> "YARN_APPLICATION_NAME": "Sleep job",
> "YARN_APPLICATION_USER": "hrt_qa",
> "YARN_APPLICATION_UNMANAGED_APPLICATION": false,
> "FROM_ID": 
> "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003",
> "UID": "yarn-cluster!application_1523259757659_0003",
> "YARN_APPLICATION_VIEW_ACLS": " ",
> "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718,
> "YARN_AM_CONTAINER_LAUNCH_COMMAND": [
>   "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 
> -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog 
> -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout 
> 2>/stderr "
> ],
> "YARN_APPLICATION_QUEUE": "default",
> "YARN_APPLICATION_TYPE": "MAPREDUCE",
> "YARN_APPLICATION_PRIORITY": 0,
> "YARN_APPLICATION_LATEST_APP_ATTEMPT": 
> "appattempt_1523259757659_0003_01",
> "YARN_APPLICATION_TAGS": [
>   "timeline_flow_name_tag:test_flow"
> ],
> "YARN_APPLICATION_STATE": "KILLED"
>   },
>   "configs": {},
>   "isrelatedto": {},
>   "relatesto": {}
> }{noformat}
> This is different to what the Resource Manager reports. For KILLED 
> applications the final status is KILLED and for FAILED applications it is 
> FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Attachment: Fair Scheduler Mistake when configured same queue at same 
level.png

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Fix For: 3.1.2
>
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Fix Version/s: (was: 3.1.2)

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Attachments: Fair Scheduler Mistake when configured same queue at 
> same level.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengyongshe updated YARN-9314:
--
Attachment: 屏幕快照 2019-02-20 下午2.24.26.png

> Fair Scheduler: Queue Info mistake when configured same queue name at same 
> level
> 
>
> Key: YARN-9314
> URL: https://issues.apache.org/jira/browse/YARN-9314
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: fengyongshe
>Priority: Major
> Fix For: 3.1.2
>
> Attachments: 屏幕快照 2019-02-20 下午2.24.26.png
>
>
> The Queue Info is configured in fair-scheduler.xml like below
> 
>       {color:#ff}{color}
>           3072mb,3vcores
>          4096mb,4vcores
>           
>                1024mb,1vcores
>               2048mb,2vcores
>                Charlie
>            
>        
>       {color:#ff}{color}
>            1024mb,1vcores
>            2048mb,2vcores
>        
>  
> The Queue {color:#ff}root.deva{color} configured last will override 
> existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, 
> like the {color}attachment 
>  root.deva
> ||Used Resources:||
> ||Min Resources:|.   => should be <3072mb,3vcore>|
> ||Max Resources:|.    => should be<4096mb,4vcores>|
> ||Reserved Resources:||
> ||Steady Fair Share:||
> ||Instantaneous Fair Share:||
> root.deva.sample
> ||Min Resources:||
> ||Max Resources:||
> ||Reserved Resources:||
> ||Steady Fair Share:||
>      
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level

2019-02-19 Thread fengyongshe (JIRA)
fengyongshe created YARN-9314:
-

 Summary: Fair Scheduler: Queue Info mistake when configured same 
queue name at same level
 Key: YARN-9314
 URL: https://issues.apache.org/jira/browse/YARN-9314
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: fengyongshe
 Fix For: 3.1.2


The Queue Info is configured in fair-scheduler.xml like below


     {color:#FF}{color}
         3072mb,3vcores
        4096mb,4vcores
         
              1024mb,1vcores
             2048mb,2vcores
              Charlie
          
      
     {color:#FF}{color}
          1024mb,1vcores
          2048mb,2vcores
      


The Queue {color:#FF}root.deva{color} configured last will override 
existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in 
root.deva.sample, like this{color}{color}

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772640#comment-16772640
 ] 

Zhankun Tang commented on YARN-8821:


[~cheersyang] , Thanks for the review!
{quote}1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}}

IIRC, line 396 and 402, they sort all combinations for a given count of devices 
every time. Why not just maintain a ordered list for these combinations in the 
map, so it only needs to sort once (when cost table initiated).
{quote}

Zhankun=> Good point! Yeah. I changed the value of costTable to a list of map 
entry. And when constructing costTable, the list is sorted by cost value in 
ascending order. When doing topology scheduling, we use an iterator of the 
list. If PACK policy, just use the iterator to loop. But if is SPREAD policy, 
the iterator is changed to descending iterator.

2,3,4,5 are fixed.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% 

[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-02-19 Thread Zhaohui Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638
 ] 

Zhaohui Xin edited comment on YARN-9278 at 2/20/19 5:32 AM:


Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY 
as resource name, it will find a best node in all nodes of this cluster. This 
will be costly when this cluster has more than 10k nodes.

I think we should limit the number of nodes in such a situation. How do you 
think this? :D


was (Author: uranus):
Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY 
as resource name, it will find a best node in all nodes of this cluster. This 
will be costly when this cluster has more than 10k nodes.

I think we should limit the number of nodes in such a situation. How do you 
think this? :D

 

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes

2019-02-19 Thread Zhaohui Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638
 ] 

Zhaohui Xin commented on YARN-9278:
---

Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY 
as resource name, it will find a best node in all nodes of this cluster. This 
will be costly when this cluster has more than 10k nodes.

I think we should limit the number of nodes in such a situation. How do you 
think this? :D

 

> Shuffle nodes when selecting to be preempted nodes
> --
>
> Key: YARN-9278
> URL: https://issues.apache.org/jira/browse/YARN-9278
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: fairscheduler
>Reporter: Zhaohui Xin
>Assignee: Zhaohui Xin
>Priority: Major
>
> We should *shuffle* the nodes to avoid some nodes being preempted frequently. 
> Also, we should *limit* the num of nodes to make preemption more efficient.
> Just like this,
> {code:java}
> // we should not iterate all nodes, that will be very slow
> long maxTryNodeNum = 
> context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce();
> if (potentialNodes.size() > maxTryNodeNum){
>   Collections.shuffle(potentialNodes);
>   List newPotentialNodes = new ArrayList();
> for (int i = 0; i < maxTryNodeNum; i++){
>   newPotentialNodes.add(potentialNodes.get(i));
> }
> potentialNodes = newPotentialNodes;
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8821:
---
Attachment: YARN-8821-trunk.010.patch

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, 
> YARN-8821-trunk.010.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
> 

[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772635#comment-16772635
 ] 

Prabhu Joseph commented on YARN-8132:
-

[~bibinchundatt] Yes, working on it, will update.

> Final Status of applications shown as UNDEFINED in ATS app queries
> --
>
> Key: YARN-8132
> URL: https://issues.apache.org/jira/browse/YARN-8132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2, timelineservice
>Reporter: Charan Hebri
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-8132-001.patch, YARN-8132-002.patch, 
> YARN-8132-003.patch
>
>
> Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A 
> sample request/response with INFO field for an application,
> {noformat}
> 2018-04-09 13:10:02,126 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1693)) - Received URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user 
> hrt_qa
> 2018-04-09 13:10:02,156 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1716)) - Processed URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 
> ms.){noformat}
> {noformat}
> {
>   "metrics": [],
>   "events": [],
>   "createdtime": 1523263360719,
>   "idprefix": 0,
>   "id": "application_1523259757659_0003",
>   "type": "YARN_APPLICATION",
>   "info": {
> "YARN_APPLICATION_CALLER_CONTEXT": "CLI",
> "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application 
> application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX",
> "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED",
> "YARN_APPLICATION_NAME": "Sleep job",
> "YARN_APPLICATION_USER": "hrt_qa",
> "YARN_APPLICATION_UNMANAGED_APPLICATION": false,
> "FROM_ID": 
> "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003",
> "UID": "yarn-cluster!application_1523259757659_0003",
> "YARN_APPLICATION_VIEW_ACLS": " ",
> "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718,
> "YARN_AM_CONTAINER_LAUNCH_COMMAND": [
>   "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 
> -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog 
> -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout 
> 2>/stderr "
> ],
> "YARN_APPLICATION_QUEUE": "default",
> "YARN_APPLICATION_TYPE": "MAPREDUCE",
> "YARN_APPLICATION_PRIORITY": 0,
> "YARN_APPLICATION_LATEST_APP_ATTEMPT": 
> "appattempt_1523259757659_0003_01",
> "YARN_APPLICATION_TAGS": [
>   "timeline_flow_name_tag:test_flow"
> ],
> "YARN_APPLICATION_STATE": "KILLED"
>   },
>   "configs": {},
>   "isrelatedto": {},
>   "relatesto": {}
> }{noformat}
> This is different to what the Resource Manager reports. For KILLED 
> applications the final status is KILLED and for FAILED applications it is 
> FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9313:
---
Attachment: (was: YARN-9313.001.patch)

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9313:
---
Attachment: YARN-9313.001.patch

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7129) Application Catalog for YARN applications

2019-02-19 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772497#comment-16772497
 ] 

Eric Yang commented on YARN-7129:
-

I filed the 160 shelldocs false positive tests as YETUS-798 for Yetus future 
improvement.

> Application Catalog for YARN applications
> -
>
> Key: YARN-7129
> URL: https://issues.apache.org/jira/browse/YARN-7129
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN Appstore.pdf, YARN-7129.001.patch, 
> YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch, 
> YARN-7129.005.patch, YARN-7129.006.patch, YARN-7129.007.patch, 
> YARN-7129.008.patch, YARN-7129.009.patch, YARN-7129.010.patch, 
> YARN-7129.011.patch, YARN-7129.012.patch, YARN-7129.013.patch, 
> YARN-7129.014.patch, YARN-7129.015.patch, YARN-7129.016.patch, 
> YARN-7129.017.patch, YARN-7129.018.patch, YARN-7129.019.patch, 
> YARN-7129.020.patch, YARN-7129.021.patch, YARN-7129.022.patch, 
> YARN-7129.023.patch, YARN-7129.024.patch
>
>
> YARN native services provides web services API to improve usability of 
> application deployment on Hadoop using collection of docker images.  It would 
> be nice to have an application catalog system which provides an editorial and 
> search interface for YARN applications.  This improves usability of YARN for 
> manage the life cycle of applications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772399#comment-16772399
 ] 

Hadoop QA commented on YARN-999:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 
 6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 21s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m 
12s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m 28s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 10 new + 336 unchanged - 10 fixed = 346 total (was 346) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  8s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
47s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}101m  6s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
48s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}179m 23s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestCapacitySchedulerMetrics |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-999 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959318/YARN-999.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 6f1ec3c47cee 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 
10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 02d04bd |
| maven | version: Apache Maven 3.3.9 |
| 

[jira] [Commented] (YARN-2489) ResouceOption's overcommitTimeout should be respected during resource update on NM

2019-02-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772289#comment-16772289
 ] 

Íñigo Goiri commented on YARN-2489:
---

I added a patch to YARN-999 which pretty much covers the description of this 
JIRA and the actual killing when we over commit.
What we do there is to change the resources of the NM and then after the over 
commit time out happens, we kill.
Another option would be to have in this JIRA the mechanism to wait X seconds to 
change the resources and YARN-999 to just kill when we go negative.
I think the current approach in YARN-999 covers the functionality better as it 
would allow to reduce the size of the NM and wait forever until containers are 
drained while showing the change in  resources.

> ResouceOption's overcommitTimeout should be respected during resource update 
> on NM
> --
>
> Key: YARN-2489
> URL: https://issues.apache.org/jira/browse/YARN-2489
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Priority: Major
>
> The ResourceOption to update NM's resource has two properties: Resource and 
> OvercommitTimeout. The later property is used to guarantee resource is 
> withdrawn after timeout is hit if resource is reduced to a value and current 
> resource consumption exceeds the new value. It currently use default value -1 
> which means no timeout, and we should make this property work when updating 
> NM resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9286) [Timeline Server] Sorting based on FinalStatus shows pop-up message

2019-02-19 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772287#comment-16772287
 ] 

Hudson commented on YARN-9286:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15998 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15998/])
YARN-9286. [Timeline Server] Sorting based on FinalStatus shows pop-up 
(bibinchundatt: rev b8de78c570babe4f802d951957c495ea0a4b07da)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java


> [Timeline Server] Sorting based on FinalStatus shows pop-up message
> ---
>
> Key: YARN-9286
> URL: https://issues.apache.org/jira/browse/YARN-9286
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Nallasivan
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9286-001.patch, YARN-9286-002.patch, 
> image-2019-02-15-18-16-21-804.png
>
>
> In Timeline Server GUI, if we try to sort the details based on FinalStatus, a 
> popup window is getting displayed. And further any operations which involves 
> the refreshing of the page, results in the display of same popup window.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries

2019-02-19 Thread Bibin A Chundatt (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772284#comment-16772284
 ] 

Bibin A Chundatt commented on YARN-8132:


Thank you [~Prabhu Joseph] for patch.

Latest patch fixes issue when the attempt is available and application is 
killed.

Could you add a test case to verify FINAL Status at TIMELINE when application 
is KILLED before attempt is created.

> Final Status of applications shown as UNDEFINED in ATS app queries
> --
>
> Key: YARN-8132
> URL: https://issues.apache.org/jira/browse/YARN-8132
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2, timelineservice
>Reporter: Charan Hebri
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-8132-001.patch, YARN-8132-002.patch, 
> YARN-8132-003.patch
>
>
> Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A 
> sample request/response with INFO field for an application,
> {noformat}
> 2018-04-09 13:10:02,126 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1693)) - Received URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user 
> hrt_qa
> 2018-04-09 13:10:02,156 INFO  reader.TimelineReaderWebServices 
> (TimelineReaderWebServices.java:getApp(1716)) - Processed URL 
> /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 
> ms.){noformat}
> {noformat}
> {
>   "metrics": [],
>   "events": [],
>   "createdtime": 1523263360719,
>   "idprefix": 0,
>   "id": "application_1523259757659_0003",
>   "type": "YARN_APPLICATION",
>   "info": {
> "YARN_APPLICATION_CALLER_CONTEXT": "CLI",
> "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application 
> application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX",
> "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED",
> "YARN_APPLICATION_NAME": "Sleep job",
> "YARN_APPLICATION_USER": "hrt_qa",
> "YARN_APPLICATION_UNMANAGED_APPLICATION": false,
> "FROM_ID": 
> "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003",
> "UID": "yarn-cluster!application_1523259757659_0003",
> "YARN_APPLICATION_VIEW_ACLS": " ",
> "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718,
> "YARN_AM_CONTAINER_LAUNCH_COMMAND": [
>   "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp 
> -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 
> -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog 
> -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout 
> 2>/stderr "
> ],
> "YARN_APPLICATION_QUEUE": "default",
> "YARN_APPLICATION_TYPE": "MAPREDUCE",
> "YARN_APPLICATION_PRIORITY": 0,
> "YARN_APPLICATION_LATEST_APP_ATTEMPT": 
> "appattempt_1523259757659_0003_01",
> "YARN_APPLICATION_TAGS": [
>   "timeline_flow_name_tag:test_flow"
> ],
> "YARN_APPLICATION_STATE": "KILLED"
>   },
>   "configs": {},
>   "isrelatedto": {},
>   "relatesto": {}
> }{noformat}
> This is different to what the Resource Manager reports. For KILLED 
> applications the final status is KILLED and for FAILED applications it is 
> FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.

2019-02-19 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri reassigned YARN-999:


Assignee: Íñigo Goiri

> In case of long running tasks, reduce node resource should balloon out 
> resource quickly by calling preemption API and suspending running task. 
> ---
>
> Key: YARN-999
> URL: https://issues.apache.org/jira/browse/YARN-999
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Assignee: Íñigo Goiri
>Priority: Major
> Attachments: YARN-291.000.patch, YARN-999.001.patch
>
>
> In current design and implementation, when we decrease resource on node to 
> less than resource consumption of current running tasks, tasks can still be 
> running until the end. But just no new task get assigned on this node 
> (because AvailableResource < 0) until some tasks are finished and 
> AvailableResource > 0 again. This is good for most cases but in case of long 
> running task, it could be too slow for resource setting to actually work so 
> preemption could be hired here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.

2019-02-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772286#comment-16772286
 ] 

Íñigo Goiri commented on YARN-999:
--

I think [^YARN-999.001.patch] is ready for review.
* When the resources were changed using Admin/REST interfaces, the NM didn't 
get updated. On the other hand, when we trigger it through the configuration, 
it does. I added {{RMNode#isUpdatedCapability()}} to handle this.
* I added the logic for the preemption in 
{{AbstractYarnScheduler#killContainersIfOvercommitted()}}. It could be done in 
FS or CS but I think this is more general. Maybe we can make it overridable.
* I tweaked the {{TestCapacityScheduler#testResourceOverCommit()}} and at the 
end I added a sequence to test the feature. It could technically be split in 
smaller pieces.

Thoughts?

> In case of long running tasks, reduce node resource should balloon out 
> resource quickly by calling preemption API and suspending running task. 
> ---
>
> Key: YARN-999
> URL: https://issues.apache.org/jira/browse/YARN-999
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Priority: Major
> Attachments: YARN-291.000.patch, YARN-999.001.patch
>
>
> In current design and implementation, when we decrease resource on node to 
> less than resource consumption of current running tasks, tasks can still be 
> running until the end. But just no new task get assigned on this node 
> (because AvailableResource < 0) until some tasks are finished and 
> AvailableResource > 0 again. This is good for most cases but in case of long 
> running task, it could be too slow for resource setting to actually work so 
> preemption could be hired here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.

2019-02-19 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri updated YARN-999:
-
Attachment: YARN-999.001.patch

> In case of long running tasks, reduce node resource should balloon out 
> resource quickly by calling preemption API and suspending running task. 
> ---
>
> Key: YARN-999
> URL: https://issues.apache.org/jira/browse/YARN-999
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, nodemanager, scheduler
>Reporter: Junping Du
>Priority: Major
> Attachments: YARN-291.000.patch, YARN-999.001.patch
>
>
> In current design and implementation, when we decrease resource on node to 
> less than resource consumption of current running tasks, tasks can still be 
> running until the end. But just no new task get assigned on this node 
> (because AvailableResource < 0) until some tasks are finished and 
> AvailableResource > 0 again. This is good for most cases but in case of long 
> running task, it could be too slow for resource setting to actually work so 
> preemption could be hired here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9309) Improve graph text in SLS to avoid overlapping

2019-02-19 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772257#comment-16772257
 ] 

Hudson commented on YARN-9309:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15996 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15996/])
YARN-9309. Improve graph text in SLS to avoid overlapping. Contributed 
(bibinchundatt: rev 779dae4de7e518938d58badcef065ea457be911c)
* (edit) hadoop-tools/hadoop-sls/src/main/html/simulate.html.template


> Improve graph text in SLS to avoid overlapping
> --
>
> Key: YARN-9309
> URL: https://issues.apache.org/jira/browse/YARN-9309
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9309-001.patch, YARN-9309-002.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9286) [Timeline Server] Sorting based on FinalStatus shows pop-up message

2019-02-19 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9286:
---
Summary: [Timeline Server] Sorting based on FinalStatus shows pop-up 
message  (was: [Timeline Server] Sorting based on FinalStatus throws pop-up 
message)

> [Timeline Server] Sorting based on FinalStatus shows pop-up message
> ---
>
> Key: YARN-9286
> URL: https://issues.apache.org/jira/browse/YARN-9286
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Reporter: Nallasivan
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-9286-001.patch, YARN-9286-002.patch, 
> image-2019-02-15-18-16-21-804.png
>
>
> In Timeline Server GUI, if we try to sort the details based on FinalStatus, a 
> popup window is getting displayed. And further any operations which involves 
> the refreshing of the page, results in the display of same popup window.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9309) Improve graph text in SLS to avoid overlapping

2019-02-19 Thread Bibin A Chundatt (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bibin A Chundatt updated YARN-9309:
---
Summary: Improve graph text in SLS to avoid overlapping  (was: Improvise 
graphs in SLS as values displayed in graph are overlapping)

> Improve graph text in SLS to avoid overlapping
> --
>
> Key: YARN-9309
> URL: https://issues.apache.org/jira/browse/YARN-9309
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Bilwa S T
>Assignee: Bilwa S T
>Priority: Minor
> Attachments: YARN-9309-001.patch, YARN-9309-002.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9039) App ACLs are not validated when serving logs from LogWebService

2019-02-19 Thread Suma Shivaprasad (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suma Shivaprasad reassigned YARN-9039:
--

Assignee: (was: Suma Shivaprasad)

> App ACLs are not validated when serving logs from LogWebService
> ---
>
> Key: YARN-9039
> URL: https://issues.apache.org/jira/browse/YARN-9039
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-9039.1.patch, YARN-9039.2.patch, YARN-9039.3.patch
>
>
> App Acls are not being validated while serving logs through REST and UI2 via 
> Log Webservice



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9039) App ACLs are not validated when serving logs from LogWebService

2019-02-19 Thread Suma Shivaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772149#comment-16772149
 ] 

Suma Shivaprasad commented on YARN-9039:


[~bibinchundatt] [~baktha] Apologize for the delayed response. I have not got a 
chance to look into this further after previous discussions. Please feel free 
to pick this up if you are interested. Thanks.

> App ACLs are not validated when serving logs from LogWebService
> ---
>
> Key: YARN-9039
> URL: https://issues.apache.org/jira/browse/YARN-9039
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-9039.1.patch, YARN-9039.2.patch, YARN-9039.3.patch
>
>
> App Acls are not being validated while serving logs through REST and UI2 via 
> Log Webservice



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772075#comment-16772075
 ] 

Hadoop QA commented on YARN-9265:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
33s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 43s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
53s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  9m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  9m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
15s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 
0 new + 260 unchanged - 10 fixed = 260 total (was 270) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 20s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
48s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  3m 
42s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 
33s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
58s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}112m 51s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9265 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959250/YARN-9265-006.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux dd52b33e95f8 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 

[jira] [Assigned] (YARN-9048) Add znode hierarchy in Federation ZK State Store

2019-02-19 Thread Bilwa S T (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T reassigned YARN-9048:
---

Assignee: Bilwa S T

> Add znode hierarchy in Federation ZK State Store
> 
>
> Key: YARN-9048
> URL: https://issues.apache.org/jira/browse/YARN-9048
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Major
>
> Similar to YARN-2962 consider having hierarchy in ZK federation store for 
> applications



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin

2019-02-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771997#comment-16771997
 ] 

Peter Bacsko commented on YARN-9264:


[~sunilg] [~tangzhankun] please review the first three patch: YARN-9265, 
YARN-9266 and YARN-9267. 

After committing YARN-9265, I'll perform a rebase if necessary.

> [Umbrella] Follow-up on IntelOpenCL FPGA plugin
> ---
>
> Key: YARN-9264
> URL: https://issues.apache.org/jira/browse/YARN-9264
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
>
> The Intel FPGA resource type support was released in Hadoop 3.1.0.
> Right now the plugin implementation has some deficiencies that need to be 
> fixed. This JIRA lists all problems that need to be resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771993#comment-16771993
 ] 

Weiwei Yang commented on YARN-8821:
---

Thanks for working on this [~tangzhankun], it looks really good.

For the v9 patch, I think it's almost there, just some minor comments

1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}}

IIRC, line 396 and 402, they sort all combinations for a given count of devices 
every time. Why not just maintain a ordered list for these combinations in the 
map, so it only needs to sort once (when cost table initiated).

2. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}}
{code:java}
topologyAwareSchedule(allocation, count, envs, availableDevices, 
this.costTable);
if (allocation.size() != count) {
  LOG.error("Failed to do topology scheduling. Skip to use basic " + 
"scheduling");
}
return allocation;
{code}
this seems to return the allocation result from {{topologyAwareSchedule}} 
instead of doing basic scheduling when it failed.

3. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}}

 line 249, this logging hides the actual error and stacktrace, can we change to 
LOG.error("", e)? Same comment applies to line 268.

4. NvidiaGPUPluginForRuntimeV2#allocateDevices

line 226 - 235, the 2nd if can be removed and added to 1st one.

5. I am wondering if it makes sense to add a debug logging to print cost table, 
as that is most important data for scheduling, we might need it while debugging 
issues.

Thanks

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the 

[jira] [Updated] (YARN-7266) Timeline Server event handler threads locked

2019-02-19 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7266:

Component/s: ATSv2

> Timeline Server event handler threads locked
> 
>
> Key: YARN-7266
> URL: https://issues.apache.org/jira/browse/YARN-7266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2, timelineserver
>Affects Versions: 2.7.3
>Reporter: Venkata Puneet Ravuri
>Assignee: Prabhu Joseph
>Priority: Major
>
> Event handlers for Timeline Server seem to take a lock while parsing HTTP 
> headers of the request. This is causing all other threads to wait and slowing 
> down the overall performance of Timeline server. We have resourcemanager 
> metrics enabled to send to timeline server. Because of the high load on 
> ResourceManager, the metrics to be sent are getting backlogged and in turn 
> increasing heap footprint of Resource Manager (due to pending metrics).
> This is the complete stack trace of a blocked thread on timeline server:-
> "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 
> tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry 
> [0x7f6b9142c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82)
> - waiting to lock <0x0005c0621860> (a java.lang.Class for 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168)
> at 
> com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282)
> at 
> com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94)
> at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62)
> at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170)
> at 
> com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.find(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352)
> at 
> com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115)
> at 
> com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104)
> at 
> com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120)
> at 
> 

[jira] [Commented] (YARN-7266) Timeline Server event handler threads locked

2019-02-19 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771982#comment-16771982
 ] 

Prabhu Joseph commented on YARN-7266:
-

The http threads in problematic jstack creates a new {{JAXBContextImpl}} 
everytime while accepting the http request which causes the synchronization 
issue.

Have below two ways to explore:

1. Implement a custom Jaxb context factory (javax.xml.bind.context.factory) 
which reuses the {{JAXBContextImpl}}. The default {{ContextFactory}} creates a 
new {{JAXBContextImpl}} every time.

2. Check if Jersey has a way to reuse {{JAXBContextImpl}} / Jersey 
{{JSONJAXBContext}} while accepting http request in similar to what it does 
when writing response through {{ContextResolver}} ({{JAXBContextResolver}} / 
{{YarnJacksonJaxbJsonProvider}})

Issue is applicable for other Webservices like RM, AM. This affects ATSV2 
Timeline Reader WebService.

> Timeline Server event handler threads locked
> 
>
> Key: YARN-7266
> URL: https://issues.apache.org/jira/browse/YARN-7266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Venkata Puneet Ravuri
>Assignee: Prabhu Joseph
>Priority: Major
>
> Event handlers for Timeline Server seem to take a lock while parsing HTTP 
> headers of the request. This is causing all other threads to wait and slowing 
> down the overall performance of Timeline server. We have resourcemanager 
> metrics enabled to send to timeline server. Because of the high load on 
> ResourceManager, the metrics to be sent are getting backlogged and in turn 
> increasing heap footprint of Resource Manager (due to pending metrics).
> This is the complete stack trace of a blocked thread on timeline server:-
> "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 
> tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry 
> [0x7f6b9142c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82)
> - waiting to lock <0x0005c0621860> (a java.lang.Class for 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168)
> at 
> com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282)
> at 
> com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94)
> at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62)
> at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170)
> at 
> com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.find(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at 

[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771981#comment-16771981
 ] 

Szilard Nemeth commented on YARN-9267:
--

Hi [~pbacsko]!
Latest patch LGTM, +1 (non-binding).

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Peter Bacsko (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771960#comment-16771960
 ] 

Peter Bacsko commented on YARN-9267:


[~snemeth] you can check it again.

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card

2019-02-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9265:
---
Attachment: YARN-9265-006.patch

> FPGA plugin fails to recognize Intel Processing Accelerator Card
> 
>
> Key: YARN-9265
> URL: https://issues.apache.org/jira/browse/YARN-9265
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.1.0
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Critical
> Attachments: YARN-9265-001.patch, YARN-9265-002.patch, 
> YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, 
> YARN-9265-006.patch
>
>
> The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card).
> There are two major issues.
> Problem #1
> The output of aocl diagnose:
> {noformat}
> 
> Device Name:
> acl0
>  
> Package Pat:
> /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp
>  
> Vendor: Intel Corp
>  
> Physical Dev Name   StatusInformation
>  
> pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20)
>   PCIe 08:00.0
>   FPGA temperature = 79 degrees C.
>  
> DIAGNOSTIC_PASSED
> 
>  
> Call "aocl diagnose " to run diagnose for specified devices
> Call "aocl diagnose all" to run diagnose for all devices
> {noformat}
> The plugin fails to recognize this and fails with the following message:
> {noformat}
> 2019-01-25 06:46:02,834 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin:
>  Using FPGA vendor plugin: 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin
> 2019-01-25 06:46:02,943 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer:
>  Trying to diagnose FPGA information ...
> 2019-01-25 06:46:03,085 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule:
>  Using traffic control bandwidth handler
> 2019-01-25 06:46:03,108 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
>  Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn
> 2019-01-25 06:46:03,139 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl:
>  FPGA Plugin bootstrap success.
> 2019-01-25 06:46:03,247 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)bus:slot.func\s=\s.*, pattern
> 2019-01-25 06:46:03,248 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern
> 2019-01-25 06:46:03,251 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin:
>  Failed to get major-minor number from reading /dev/pac_a10_f30
> 2019-01-25 06:46:03,252 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
> bootstrap configured resource subsystems!
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
>  No FPGA devices detected!
> {noformat}
> Problem #2
> The plugin assumes that the file name under {{/dev}} can be derived from the 
> "Physical Dev Name", but this is wrong. For example, it thinks that the 
> device file is {{/dev/pac_a10_f30}} which is not the case, the actual 
> file is {{/dev/intel-fpga-port.0}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771952#comment-16771952
 ] 

Hadoop QA commented on YARN-9267:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager:
 The patch generated 0 new + 111 unchanged - 8 fixed = 111 total (was 119) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 46s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 
25s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 71m 15s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9267 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12959240/YARN-9267-003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a70389523e96 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 
5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 1e0ae6e |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23445/testReport/ |
| Max. process+thread count | 340 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23445/console |
| Powered by | Apache 

[jira] [Updated] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl

2019-02-19 Thread Peter Bacsko (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9267:
---
Attachment: YARN-9267-003.patch

> Various fixes are needed in FpgaResourceHandlerImpl
> ---
>
> Key: YARN-9267
> URL: https://issues.apache.org/jira/browse/YARN-9267
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9267-001.patch, YARN-9267-002.patch, 
> YARN-9267-003.patch
>
>
> Fix some problems in {{FpgaResourceHandlerImpl}}:
>  * {{preStart()}} does not reconfigure card with the same IP - we see it as a 
> problem. If you recompile the FPGA application, you must rename the aocx file 
> because the card will not be reprogrammed. Suggestion: instead of storing 
> Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the 
> localized file).
>  * Switch to slf4j from Apache Commons Logging
>  * Some unused imports



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9050) [Umbrella] Usability improvements for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9050:
---
Summary: [Umbrella] Usability improvements for scheduler activities  (was: 
Usability improvements for scheduler activities)

> [Umbrella] Usability improvements for scheduler activities
> --
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
> 1. Not available for multi-thread asynchronous scheduling. App and node 
> activites maybe confused when multiple scheduling threads record activites of 
> different allocation processes in the same variables like appsAllocation and 
> recordingNodesAllocation in ActivitiesManager. I think these variables should 
> be thread-local to make activities clear among multiple threads.
> 2. Incomplete activites for multi-node lookup machanism, since 
> ActivitiesLogger will skip recording through {{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup machanism.
> 3. Current app activites can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activites, add diagnosis for placement constraints check, update insufficient 
> resource diagnosis with detailed info (like 'insufficient resource 
> names:[memory-mb]') and so on.
> 4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
> 5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
> 6. Aggragate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggragation for app activities by 
> diagnoses is neccessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnositics and example like this:
>  !image-2018-11-23-16-46-38-138.png! 
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771815#comment-16771815
 ] 

Tao Yang commented on YARN-9313:


Hi, [~cheersyang], [~leftnoteasy].

I have attached v1 patch, could you please help to review and give some 
advices? Thanks.

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9313:
---
Attachment: YARN-9313.001.patch

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-02-19 Thread Tao Yang (JIRA)
Tao Yang created YARN-9313:
--

 Summary: Support asynchronized scheduling mode and multi-node 
lookup mechanism for scheduler activities
 Key: YARN-9313
 URL: https://issues.apache.org/jira/browse/YARN-9313
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Tao Yang
Assignee: Tao Yang


[Design 
doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log

2019-02-19 Thread Rakesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771785#comment-16771785
 ] 

Rakesh Shah commented on YARN-6221:
---

[~ssreenivasan] can you elaborate it more

> Entities missing from ATS when summary log file info got returned to the ATS 
> before the domain log
> --
>
> Key: YARN-6221
> URL: https://issues.apache.org/jira/browse/YARN-6221
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sushmitha Sreenivasan
>Assignee: Li Lu
>Priority: Critical
>
> Events data missing for the following entities:
> curl -k --negotiate -u: 
> http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01
> {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}}
> {code:title=Timeline Server log entry}
> WARN  timeline.TimelineDataManager 
> (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { 
> id: tez_application_1487706062210_0012, type: TEZ_APPLICATION }
> org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the 
> timeline entity { id: tez_application_1487706062210_0012, type: 
> TEZ_APPLICATION } doesn't exist.
> at 
> org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6214) NullPointer Exception while querying timeline server API

2019-02-19 Thread Rakesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771791#comment-16771791
 ] 

Rakesh Shah commented on YARN-6214:
---

Hi [~raviorteja],

I have not got any error exception while executing

http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE

> NullPointer Exception while querying timeline server API
> 
>
> Key: YARN-6214
> URL: https://issues.apache.org/jira/browse/YARN-6214
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.1
>Reporter: Ravi Teja Chilukuri
>Priority: Major
>
> The apps API works fine and give all applications, including Mapreduce and Tez
> http://:8188/ws/v1/applicationhistory/apps
> But when queried with application types with these APIs, it fails with 
> NullpointerException.
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=TEZ
> http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE
> NullPointerExceptionjava.lang.NullPointerException
> Blocked on this issue as we are not able to run analytics on the tez job 
> counters on the prod jobs. 
> Timeline Logs:
> |2017-02-22 11:47:57,183 WARN  webapp.GenericExceptionHandler 
> (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:195)
>   at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
>   at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>   at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
> Complete stacktrace:
> http://pastebin.com/bRgxVabf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log

2019-02-19 Thread Rakesh Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Shah updated YARN-6221:
--
Comment: was deleted

(was: Hi,

Sushmitha Sreenivasan

Can you explain the issue little more.)

> Entities missing from ATS when summary log file info got returned to the ATS 
> before the domain log
> --
>
> Key: YARN-6221
> URL: https://issues.apache.org/jira/browse/YARN-6221
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Sushmitha Sreenivasan
>Assignee: Li Lu
>Priority: Critical
>
> Events data missing for the following entities:
> curl -k --negotiate -u: 
> http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01
> {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}}
> {code:title=Timeline Server log entry}
> WARN  timeline.TimelineDataManager 
> (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { 
> id: tez_application_1487706062210_0012, type: TEZ_APPLICATION }
> org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the 
> timeline entity { id: tez_application_1487706062210_0012, type: 
> TEZ_APPLICATION } doesn't exist.
> at 
> org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156)
> at 
> org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6937) Admin cannot post entities when domain is not exists

2019-02-19 Thread Rakesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771764#comment-16771764
 ] 

Rakesh Shah commented on YARN-6937:
---

Hi [~daemon],

can you please elaborate the issue

> Admin cannot post entities when domain is not exists
> 
>
> Key: YARN-6937
> URL: https://issues.apache.org/jira/browse/YARN-6937
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: YunFan Zhou
>Priority: Major
>
> When I post entities to timeline server, and found that it throw the 
> following exception:
> {code:java}
> org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:123)
> at 
> org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:273)
> at 
> org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260)
> at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> {code}
> In TimelineACLsManager#checkAccess logic:
> {code:java}
>   public boolean checkAccess(UserGroupInformation callerUGI,
>   ApplicationAccessType applicationAccessType,
>   TimelineEntity entity) throws YarnException, IOException {
> if (LOG.isDebugEnabled()) {
>   LOG.debug("Verifying the access of "
>   + (callerUGI == null ? null : callerUGI.getShortUserName())
>   + " on the timeline entity "
>   + new EntityIdentifier(entity.getEntityId(), 
> entity.getEntityType()));
> }
> if (!adminAclsManager.areACLsEnabled()) {
>   return true;
> }
> // find domain owner and acls
> AccessControlListExt aclExt = aclExts.get(entity.getDomainId());
> if (aclExt == null) {
>   aclExt = loadDomainFromTimelineStore(entity.getDomainId());
> }
> if (aclExt == null) {
>   throw new YarnException("Domain information of the timeline entity "
>   + new EntityIdentifier(entity.getEntityId(), entity.getEntityType())
>   + " doesn't exist.");
> }
> {code}
> Even if you're an administrator,  but you have not any permissions to do this.
> I think it's perfect to do follow-up checks though the value of *aclExt* is 
> null:
> {code:java}
> if (callerUGI != null
> && (adminAclsManager.isAdmin(callerUGI) ||
> callerUGI.getShortUserName().equals(owner) ||
> domainACL.isUserAllowed(callerUGI))) {
>   return true;
> }
> return false;
> {code}
> Any suggestions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates

2019-02-19 Thread Rakesh Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771735#comment-16771735
 ] 

Rakesh Shah commented on YARN-9080:
---

Thanks [~Prabhu Joseph]

> Bucket Directories as part of ATS done accumulates
> --
>
> Key: YARN-9080
> URL: https://issues.apache.org/jira/browse/YARN-9080
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, 
> 0003-YARN-9080.patch
>
>
> Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 
> as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner 
> removes only the app directories and not the bucket directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-9312) NPE while rendering SLS simulate page

2019-02-19 Thread Bilwa S T (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bilwa S T reassigned YARN-9312:
---

Assignee: Bilwa S T

> NPE while rendering SLS simulate page
> -
>
> Key: YARN-9312
> URL: https://issues.apache.org/jira/browse/YARN-9312
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Minor
>
> http://localhost:10001/simulate
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.sls.web.SLSWebApp.printPageSimulate(SLSWebApp.java:240)
> at 
> org.apache.hadoop.yarn.sls.web.SLSWebApp.access$100(SLSWebApp.java:55)
> at 
> org.apache.hadoop.yarn.sls.web.SLSWebApp$1.handle(SLSWebApp.java:152)
> at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:539)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
> at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9238) Allocate on previous or removed or non existent application attempt

2019-02-19 Thread lujie (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie updated YARN-9238:

Summary: Allocate on previous or removed or non existent application 
attempt  (was: We get a wrong attempt  by an appAttemptId when AM crash at some 
point)

> Allocate on previous or removed or non existent application attempt
> ---
>
> Key: YARN-9238
> URL: https://issues.apache.org/jira/browse/YARN-9238
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Attachments: YARN-9238_1.patch, YARN-9238_2.patch, YARN-9238_3.patch, 
> hadoop-test-resourcemanager-hadoop11.log
>
>
> We have found a data race that can make an odd situation.
> See 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff}:(code1){color}
> {code:java}
>  // Allocate OPPORTUNISTIC containers.
> 171.  SchedulerApplicationAttempt appAttempt =
> 172.((AbstractYarnScheduler)rmContext.getScheduler())
> 173.  .getApplicationAttempt(appAttemptId);
> 174.
> 175.  OpportunisticContainerContext oppCtx =
> 176.  appAttempt.getOpportunisticContainerContext();
> 177.  oppCtx.updateNodeList(getLeastLoadedNodes());
> {code}
> if we just crash the current AM(its attemptid is appattempt_0) just before 
> code1#171, when code1#171~173 continue to execute to get the appAttempt by 
> appattempt_0, the obtained appAttempt  should represent the  currenct AM. But 
> we found that the obtained appAttempt  represents  the new AM and its 
> attempid is appattempt_1. This  obtained appAttempt  has not init its oppCtx, 
> so NPE happnes at line code1#177.
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
> So why old appAttempt  disappeares and  why we use old appattempt_0 but get 
> the new appAttempt
> We have found the reason. Below code({color:#ff}code2{color}) is the 
> function body of getApplicationAttempt  at code1#173
> {code:java}
> 399. public T getApplicationAttempt(ApplicationAttemptId 
> applicationAttemptId) {
> 400   SchedulerApplication app = applications.get(
> 401  applicationAttemptId.getApplicationId());
> 402   return app == null ? null : app.getCurrentAppAttempt();
> 403  }
> {code}
> when old AM Crash,  new AM and new appAttempt comes.  The currentAttempt of 
> app will be setted as the new appAttempt (see code3). So the code2 #402 will 
> return the new appAttempt. 
> if AM crashes at the head of allocate function(code1), bug won't happens due 
> to ApplicationDoesNotExistInCacheException. AM crashed after code1, 
> everything is also ok.
> We shoud add the check: whether the the getted appAttempt have the same id 
> with given id.
> patch comes soon!
> {color:#ff}code3{color}
> {code:java}
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T
>  currentAttempt){
> this.currentAttempt = currentAttempt;
> }
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-9103) Fix the bug in DeviceMappingManager#getReleasingDevices

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-9103.

Resolution: Won't Fix

Resolve it as it is fixed in YARN-9060

> Fix the bug in DeviceMappingManager#getReleasingDevices
> ---
>
> Key: YARN-9103
> URL: https://issues.apache.org/jira/browse/YARN-9103
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> When one container is assigned with multiple devices and in releasing state. 
> This same containerId looping causes multiple times releasing device count 
> sum. It involved a bug which is the same as mentioned in YARN-9099.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8888) Support device topology scheduling

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-.

Resolution: Won't Fix

Resolve it due to the GPU topology algorithm is better implemented in the 
plugin for now.

Abstraction for all device topology is too early now.

See YARN-8821 for GPU topology scheduling.

> Support device topology scheduling
> --
>
> Key: YARN-
> URL: https://issues.apache.org/jira/browse/YARN-
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> An easy way for vendor plugin to describe topology information should be 
> provided in Device spec and the topology information will be used in the 
> device shared local scheduler to boost performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9312) NPE while rendering SLS simulate page

2019-02-19 Thread Bibin A Chundatt (JIRA)
Bibin A Chundatt created YARN-9312:
--

 Summary: NPE while rendering SLS simulate page
 Key: YARN-9312
 URL: https://issues.apache.org/jira/browse/YARN-9312
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Bibin A Chundatt


http://localhost:10001/simulate

{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.sls.web.SLSWebApp.printPageSimulate(SLSWebApp.java:240)
at 
org.apache.hadoop.yarn.sls.web.SLSWebApp.access$100(SLSWebApp.java:55)
at org.apache.hadoop.yarn.sls.web.SLSWebApp$1.handle(SLSWebApp.java:152)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:539)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:745)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8889) Add well-defined interface in container-executor to support vendor plugins isolation request

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8889.

Resolution: Duplicate

Resolve this as already implemented in YARN-9060

> Add well-defined interface in container-executor to support vendor plugins 
> isolation request
> 
>
> Key: YARN-8889
> URL: https://issues.apache.org/jira/browse/YARN-8889
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Because of different container runtime, the isolation request from vendor 
> device plugin may be raised before container launch (cgroups operations) or 
> at container launch (Docker runtime).
> An easy to use interface in container-executor should be provided to support 
> above requirements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover

2019-02-19 Thread Shengyang Sha (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771710#comment-16771710
 ] 

Shengyang Sha commented on YARN-9195:
-

{quote}
Just read the patch, I am trying to understand 
refreshContainersFromPreviousAttempts(), if a container from previous attempt 
is completed, then you are not removing it from outstanding requests. Why are 
you doing this?
{quote}
refreshContainersFromPreviousAttempts method is used to maintain running 
containers which originally obtained by previous app attempts, not outstanding 
requests.
Probably you meant removePreviousContainersFromOutstandingSchedulingRequests 
method. In this method, I filtered out (1) containers obtained by current app 
attempt and (2) known containers from previous app attempt.

{quote}
I am also not sure why you need to initApplicationAttempt(), this is retrieving 
current app attempt id from AM RM token. Since in the protocol, we have 
getContainersFromPreviousAttempts() already, what's the attempt id is used for 
here?
{quote}
I think current app attempt id is needed because RM might return all the 
running containers as previous containers 
(RegisterApplicationMasterResponse#getNMTokensFromPreviousAttempts). If we 
don't filter out such containers, outstanding request will be decreased 
unexpectedly. And if current outstanding request is zero, it will then be 
decreased to zero.

{quote}
Another thing is, why this issue would cause pending container/resource in RM's 
queue become negative? Can you add some more info?
{quote}
As have described above, outstanding requests could turn to negative values. 
Since RM has no sanity check, requests in RM will then become negative. Btw, 
the description of this issue also provides some detailed explanations.


> RM Queue's pending container number might get decreased unexpectedly or even 
> become negative once RM failover
> -
>
> Key: YARN-9195
> URL: https://issues.apache.org/jira/browse/YARN-9195
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.1.0
>Reporter: Shengyang Sha
>Assignee: Shengyang Sha
>Priority: Critical
> Attachments: YARN-9195.001.patch, YARN-9195.002.patch, 
> cases_to_recreate_negative_pending_requests_scenario.diff
>
>
> Hi, all:
> Previously we have encountered a serious problem in ResourceManager, we found 
> that pending container number of one RM queue became negative after RM failed 
> over. Since queues in RM are managed in hierarchical structure, the root 
> queue's pending containers became negative at last, thus the scheduling 
> process of the whole cluster became affected.
> The version of both our RM server and AMRM client in our application are 
> based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method 
> in our application to request resources from RM.
> After investigation, we found that the direct cause was numAllocations of 
> some AMs' requests became negative after RM failed over. And there are at 
> lease three necessary conditions:
> (1) Use schedulingRequests in AMRM client, and the application set zero to 
> the numAllocations for a schedulingRequest. In our batch job scenario, the 
> numAllocations of a schedulingRequest could turn to zero because 
> theoretically we can run a full batch job using only one container.
> (2) RM failovers.
> (3) Before AM reregisters itself to RM after RM restarts, RM has already 
> recovered some of the application's containers assigned before.
> Here are some more details about the implementation:
> (1) After RM recovers, RM will send all alive containers to AM once it 
> re-register itself through 
> RegisterApplicationMasterResponse#getContainersFromPreviousAttempts.
> (2) During registerApplicationMaster, AMRMClientImpl will 
> removeFromOutstandingSchedulingRequests once AM gets 
> ContainersFromPreviousAttempts without checking whether these containers have 
> been assigned before. As a consequence, its outstanding requests might be 
> decreased unexpectedly even if it may not become negative.
> (3) There is no sanity check in RM to validate requests from AMs.
> For better illustrating this case, I've written a test case based on the 
> latest hadoop trunk, posted in the attachment. You may try case 
> testAMRMClientWithNegativePendingRequestsOnRMRestart and 
> testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart .
> To solve this issue, I propose to filter allocated containers before 
> removeFromOutstandingSchedulingRequests in AMRMClientImpl during 
> registerApplicationMaster, and some sanity checks are also needed to prevent 
> things from getting worse.
> More comments and suggestions are welcomed.



--
This message was sent by 

[jira] [Resolved] (YARN-8883) Phase 1 - Provide an example of fake vendor plugin

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8883.

Resolution: Duplicate

Resolve it due to the YARN-9060 has an example of Nvidia GPU plugin

> Phase 1 - Provide an example of fake vendor plugin
> --
>
> Key: YARN-8883
> URL: https://issues.apache.org/jira/browse/YARN-8883
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8883-trunk.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8887) Support isolation in pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang resolved YARN-8887.

Resolution: Duplicate

Resolve it as duplicated with YAR-9060

> Support isolation in pluggable device framework
> ---
>
> Key: YARN-8887
> URL: https://issues.apache.org/jira/browse/YARN-8887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Devices isolation needs a complete description in API 
> specs(DeviceRuntimeSpec) and a translator in the adapter to convert the 
> requirements into uniform parameters passed to native container-executor. It 
> should support both default and Docker container.
> For default container, we use a new device module in container-executor to 
> isolate device. For docker container, we depend on current 
> DockerLinuxContainerRuntime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7266) Timeline Server event handler threads locked

2019-02-19 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-7266:
---

Assignee: Prabhu Joseph

> Timeline Server event handler threads locked
> 
>
> Key: YARN-7266
> URL: https://issues.apache.org/jira/browse/YARN-7266
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Venkata Puneet Ravuri
>Assignee: Prabhu Joseph
>Priority: Major
>
> Event handlers for Timeline Server seem to take a lock while parsing HTTP 
> headers of the request. This is causing all other threads to wait and slowing 
> down the overall performance of Timeline server. We have resourcemanager 
> metrics enabled to send to timeline server. Because of the high load on 
> ResourceManager, the metrics to be sent are getting backlogged and in turn 
> increasing heap footprint of Resource Manager (due to pending metrics).
> This is the complete stack trace of a blocked thread on timeline server:-
> "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 
> tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry 
> [0x7f6b9142c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82)
> - waiting to lock <0x0005c0621860> (a java.lang.Class for 
> com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector)
> at 
> com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168)
> at 
> com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282)
> at 
> com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94)
> at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112)
> at 
> com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62)
> at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown 
> Source)
> at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
> Source)
> at java.lang.reflect.Constructor.newInstance(Unknown Source)
> at 
> com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128)
> at 
> com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347)
> at 
> com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170)
> at 
> com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145)
> at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.newInstance(Unknown Source)
> at javax.xml.bind.ContextFinder.find(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at javax.xml.bind.JAXBContext.newInstance(Unknown Source)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412)
> at 
> com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352)
> at 
> com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115)
> at 
> com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104)
> at 
> com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120)
> at 
> 

[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework

2019-02-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771683#comment-16771683
 ] 

Zhankun Tang commented on YARN-8821:


The unit test seems unrelated to this patch.

> [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable 
> device framework
> -
>
> Key: YARN-8821
> URL: https://issues.apache.org/jira/browse/YARN-8821
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, 
> YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, 
> YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, 
> YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, 
> YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch
>
>
> h2. Background
> GPU topology affects performance. There's been a discussion in YARN-7481. But 
> we'd like to move related discussions here.
> And please note that YARN-8851 will provide a pluggable device framework 
> which can support plugin custom scheduler. Based on the framework, GPU plugin 
> could have own topology scheduler.
> h2. Details of the proposed scheduling algorithm
> The proposed patch has a topology algorithm implemented as below:
>  *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" 
> to build a hash map whose key is all pairs of GPUs and the value is the 
> communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - 
> 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set 
> based on the connection type.
> *Step 2*. And then it constructs a _+cost table+_ which caches all 
> combinations of GPUs and corresponding cost between them and cache it. The 
> cost table is a map whose structure is like
> {code:java}
> { 2=>{[0,1]=>2,..},
>   3=>{[0,1,2]=>10,..},
>   4=>{[0,1,2,3]=>18}}.
> {code}
> The key of the map is the count of GPUs, the value of it is a map whose key 
> is the combination of GPUs and the value is the calculated communication cost 
> of the numbers of GPUs. The cost calculation algorithm is to sum all 
> non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] 
> GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get 
> from the map built in step 1.
> *Step 3*. After the cost table is built, when allocating GPUs based on 
> topology, we provide two policy which container can set through an 
> environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or 
> "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The 
> "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not 
> using the same bus to CPU). And the key difference of the two policy is the 
> sort order of the inner map in the cost table. For instance, let's assume 2 
> GPUs is wanted. The costTable.get(2) would return a map containing all 
> combinations of two GPUs and their cost. If the policy is "PACK", we'll sort 
> the map by cost in ascending order. The first entry will be the GPUs has 
> minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending 
> order and get the first one which is the highest GPU-GPU cost which means 
> lowest CPU-GPU costs.
> h2. Estimation of the algorithm
> Initial analysis of the topology scheduling algorithm(Using PACK policy) 
> based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. 
> Below figure shows the performance gain of the topology scheduling 
> algorithm's allocation (PACK policy).
> !GPUTopologyPerformance.png!  
> Some of the conclusions are:
> 1. The topology between GPUs impacts the performance dramatically. The best 
> combination GPUs can get *5% to 185%* *performance gain* among the test cases 
> with various factors including CNN model, batch size, GPU subset, etc. The 
> scheduling algorithm should be close to this fact.
> 2. The "inception3" and "resnet50" networks seem not topology sensitive. The 
> topology scheduling can only potentially get *about 6.8% to 10%* speedup in 
> best cases.
> 3. Our current version of topology scheduling algorithm can achieve 6.8*% to 
> 177.1%* *performance gain in best cases. In average, it also outperforms the 
> median performance(0.8% to 28.2%).*
> *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" 
> best*.
>  
> In summary, the GPU topology scheduling algorithm is effective and can 
> potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% 
> on average.
>  *It means about maximum 3X comparing to a random GPU scheduling algorithm in 
> a specific scenario*.
>  
> The spreadsheets are here for your reference.
>  
>