[jira] [Commented] (YARN-8891) Documentation of the pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772732#comment-16772732 ] Hadoop QA commented on YARN-8891: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 16s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 27m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} mvnsite {color} | {color:red} 0m 16s{color} | {color:red} hadoop-yarn-site in the patch failed. {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 21 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 55s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 40m 37s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-8891 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959385/YARN-8891-trunk.001.patch | | Optional Tests | dupname asflicense mvnsite | | uname | Linux 8728af006da8 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1d30fd9 | | maven | version: Apache Maven 3.3.9 | | mvnsite | https://builds.apache.org/job/PreCommit-YARN-Build/23450/artifact/out/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-site.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/23450/artifact/out/whitespace-eol.txt | | Max. process+thread count | 447 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23450/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Documentation of the pluggable device framework > --- > > Key: YARN-8891 > URL: https://issues.apache.org/jira/browse/YARN-8891 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, > YARN-8891-trunk.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8891: --- Attachment: YARN-8891-trunk.002.patch > Documentation of the pluggable device framework > --- > > Key: YARN-8891 > URL: https://issues.apache.org/jira/browse/YARN-8891 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8891: --- Attachment: YARN-8891-trunk.003.patch > Documentation of the pluggable device framework > --- > > Key: YARN-8891 > URL: https://issues.apache.org/jira/browse/YARN-8891 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8891-trunk.001.patch, YARN-8891-trunk.002.patch, > YARN-8891-trunk.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6538) Inter Queue preemption is not happening when DRF is configured
[ https://issues.apache.org/jira/browse/YARN-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772706#comment-16772706 ] niu commented on YARN-6538: --- Hi All, Any updated on it? > Inter Queue preemption is not happening when DRF is configured > -- > > Key: YARN-6538 > URL: https://issues.apache.org/jira/browse/YARN-6538 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler, scheduler preemption >Affects Versions: 2.8.0 >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > > Cluster capacity of . Here memory is more and vcores > are less. If applications have more demand, vcores might be exhausted. > Inter queue preemption ideally has to be kicked in once vcores is over > utilized. However preemption is not happening. > Analysis: > In {{AbstractPreemptableResourceCalculator.computeFixpointAllocation}}, > {code} > // assign all cluster resources until no more demand, or no resources are > // left > while (!orderedByNeed.isEmpty() && Resources.greaterThan(rc, totGuarant, > unassigned, Resources.none())) { > {code} > will loop even when vcores are 0 (because memory is still +ve). Hence we are > having more vcores in idealAssigned which cause no-preemption cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8891) Documentation of the pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8891: --- Attachment: YARN-8891-trunk.001.patch > Documentation of the pluggable device framework > --- > > Key: YARN-8891 > URL: https://issues.apache.org/jira/browse/YARN-8891 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8891-trunk.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable
[ https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772691#comment-16772691 ] Prabhu Joseph commented on YARN-6929: - [~jlowe] Can you review this jira when you get time. Thanks. > yarn.nodemanager.remote-app-log-dir structure is not scalable > - > > Key: YARN-6929 > URL: https://issues.apache.org/jira/browse/YARN-6929 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, > YARN-6929.3.patch, YARN-6929.4.patch, YARN-6929.5.patch, YARN-6929.6.patch, > YARN-6929.patch > > > The current directory structure for yarn.nodemanager.remote-app-log-dir is > not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). > With retention yarn.log-aggregation.retain-seconds of 7days, there are more > chances LogAggregationService fails to create a new directory with > FSLimitException$MaxDirectoryItemsExceededException. > The current structure is > //logs/. This can be > improved with adding date as a subdirectory like > //logs// > {code} > WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: > Application failed to init aggregation > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): > The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 > items=1048576 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): > The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 > items=1048576 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072) > > at >
[jira] [Commented] (YARN-9227) DistributedShell RelativePath is not removed at end
[ https://issues.apache.org/jira/browse/YARN-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772689#comment-16772689 ] Prabhu Joseph commented on YARN-9227: - [~sunilg] Can you review this jira when you get time. Thanks. > DistributedShell RelativePath is not removed at end > --- > > Key: YARN-9227 > URL: https://issues.apache.org/jira/browse/YARN-9227 > Project: Hadoop YARN > Issue Type: Bug > Components: distributed-shell >Affects Versions: 3.1.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Minor > Attachments: 0001-YARN-9227.patch, 0002-YARN-9227.patch, > 0003-YARN-9227.patch > > > DistributedShell Job does not remove the relative path which contains jars > and localized files. > {code} > [ambari-qa@ash hadoop-yarn]$ hadoop fs -ls > /user/ambari-qa/DistributedShell/application_1542665708563_0017 > Found 2 items > -rw-r--r-- 3 ambari-qa hdfs 46636 2019-01-23 13:37 > /user/ambari-qa/DistributedShell/application_1542665708563_0017/AppMaster.jar > -rwx--x--- 3 ambari-qa hdfs 4 2019-01-23 13:37 > /user/ambari-qa/DistributedShell/application_1542665708563_0017/shellCommands > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9258) Support to specify allocation tags without constraint in distributed shell CLI
[ https://issues.apache.org/jira/browse/YARN-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772687#comment-16772687 ] Prabhu Joseph commented on YARN-9258: - [~cheersyang] Can you review this patch when you get time. Thanks. > Support to specify allocation tags without constraint in distributed shell CLI > -- > > Key: YARN-9258 > URL: https://issues.apache.org/jira/browse/YARN-9258 > Project: Hadoop YARN > Issue Type: Sub-task > Components: distributed-shell >Affects Versions: 3.1.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9258-001.patch, YARN-9258-002.patch > > > DistributedShell PlacementSpec fails to parse > {color:#d04437}zk=1:spark=1,NOTIN,NODE,zk{color} > {code} > java.lang.IllegalArgumentException: Invalid placement spec: > zk=1:spark=1,NOTIN,NODE,zk > at > org.apache.hadoop.yarn.applications.distributedshell.PlacementSpec.parse(PlacementSpec.java:108) > at > org.apache.hadoop.yarn.applications.distributedshell.Client.init(Client.java:462) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDistributedShellWithPlacementConstraint(TestDistributedShell.java:1780) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.yarn.util.constraint.PlacementConstraintParseException: > Source allocation tags is required for a multi placement constraint > expression. > at > org.apache.hadoop.yarn.util.constraint.PlacementConstraintParser.parsePlacementSpec(PlacementConstraintParser.java:740) > at > org.apache.hadoop.yarn.applications.distributedshell.PlacementSpec.parse(PlacementSpec.java:94) > ... 16 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9290) Invalid SchedulingRequest not rejected in Scheduler PlacementConstraintsHandler
[ https://issues.apache.org/jira/browse/YARN-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772686#comment-16772686 ] Prabhu Joseph commented on YARN-9290: - [~cheersyang] Can you review the patch for this jira when you get time. Thanks. > Invalid SchedulingRequest not rejected in Scheduler > PlacementConstraintsHandler > > > Key: YARN-9290 > URL: https://issues.apache.org/jira/browse/YARN-9290 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9290-001.patch, YARN-9290-002.patch, > YARN-9290-003.patch > > > SchedulingRequest with Invalid namespace is not rejected in Scheduler > PlacementConstraintsHandler. RM keeps on trying to allocateOnNode with > logging the exception. This is rejected in case of placement-processor > handler. > {code} > 2019-02-08 16:51:27,548 WARN > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator: > Failed to query node cardinality: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.InvalidAllocationTagsQueryException: > Invalid namespace prefix: notselfi, valid values are: > all,not-self,app-id,app-tag,self > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.TargetApplicationsNamespace.fromString(TargetApplicationsNamespace.java:277) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.TargetApplicationsNamespace.parse(TargetApplicationsNamespace.java:234) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.AllocationTags.createAllocationTags(AllocationTags.java:93) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfySingleConstraintExpression(PlacementConstraintsUtil.java:78) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfySingleConstraint(PlacementConstraintsUtil.java:240) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:321) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyAndConstraint(PlacementConstraintsUtil.java:272) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.PlacementConstraintsUtil.canSatisfyConstraints(PlacementConstraintsUtil.java:365) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator.checkCardinalityAndPending(SingleConstraintAppPlacementAllocator.java:355) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator.precheckNode(SingleConstraintAppPlacementAllocator.java:395) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.precheckNode(AppSchedulingInfo.java:779) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.preCheckForNodeCandidateSet(RegularContainerAllocator.java:145) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:837) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:890) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:977) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1173) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:795) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1630) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1624) > at >
[jira] [Commented] (YARN-9208) Distributed shell allow LocalResourceVisibility to be specified
[ https://issues.apache.org/jira/browse/YARN-9208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772683#comment-16772683 ] Prabhu Joseph commented on YARN-9208: - [~bibinchundatt] Have changed the pattern into {{(PUBLIC=FileName1,FileName2,,),(PRIVATE=FileName3,FileName4,,),,}}. If only a PRIVATE file hdfs:/tmp/a is present - the pattern will be (PRIVATE=hdfs:/tmp/a). Can you review the same. > Distributed shell allow LocalResourceVisibility to be specified > --- > > Key: YARN-9208 > URL: https://issues.apache.org/jira/browse/YARN-9208 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin A Chundatt >Assignee: Prabhu Joseph >Priority: Minor > Attachments: YARN-9208-001.patch, YARN-9208-002.patch, > YARN-9208-003.patch, YARN-9208-004.patch > > > YARN-9008 add feature to add list of files to be localized. > Would be great to have Visibility type too. Allows testing of PRIVATE and > PUBLIC type too -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries
[ https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772665#comment-16772665 ] Prabhu Joseph edited comment on YARN-8132 at 2/20/19 6:47 AM: -- [~bibinchundatt] The existing test case {{TestRMAppTransitions#testAppNewKill}} covers the scenario. The {{currentAttempt}} is not created (Null) and the {{RMAppImpl}} StateMachine currentState is transitioned properly to KILLED. The issue happens only when the job is killed after attempt is created as the attempt's {{finalStatus}} is not updated. was (Author: prabhu joseph): [~bibinchundatt] The existing test case TestRMAppTransitions#testAppNewKill covers the scenario. The currentAttempt is not created (Null) and the StateMachine currentState is transitioned properly to KILLED. The issue happens only when the job is killed after attempt is created as the attempt finalStatus is not updated. > Final Status of applications shown as UNDEFINED in ATS app queries > -- > > Key: YARN-8132 > URL: https://issues.apache.org/jira/browse/YARN-8132 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2, timelineservice >Reporter: Charan Hebri >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-8132-001.patch, YARN-8132-002.patch, > YARN-8132-003.patch, YARN-8132-004.patch > > > Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A > sample request/response with INFO field for an application, > {noformat} > 2018-04-09 13:10:02,126 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1693)) - Received URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user > hrt_qa > 2018-04-09 13:10:02,156 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1716)) - Processed URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 > ms.){noformat} > {noformat} > { > "metrics": [], > "events": [], > "createdtime": 1523263360719, > "idprefix": 0, > "id": "application_1523259757659_0003", > "type": "YARN_APPLICATION", > "info": { > "YARN_APPLICATION_CALLER_CONTEXT": "CLI", > "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application > application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX", > "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED", > "YARN_APPLICATION_NAME": "Sleep job", > "YARN_APPLICATION_USER": "hrt_qa", > "YARN_APPLICATION_UNMANAGED_APPLICATION": false, > "FROM_ID": > "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003", > "UID": "yarn-cluster!application_1523259757659_0003", > "YARN_APPLICATION_VIEW_ACLS": " ", > "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718, > "YARN_AM_CONTAINER_LAUNCH_COMMAND": [ > "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 > -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog > -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 > org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout > 2>/stderr " > ], > "YARN_APPLICATION_QUEUE": "default", > "YARN_APPLICATION_TYPE": "MAPREDUCE", > "YARN_APPLICATION_PRIORITY": 0, > "YARN_APPLICATION_LATEST_APP_ATTEMPT": > "appattempt_1523259757659_0003_01", > "YARN_APPLICATION_TAGS": [ > "timeline_flow_name_tag:test_flow" > ], > "YARN_APPLICATION_STATE": "KILLED" > }, > "configs": {}, > "isrelatedto": {}, > "relatesto": {} > }{noformat} > This is different to what the Resource Manager reports. For KILLED > applications the final status is KILLED and for FAILED applications it is > FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries
[ https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-8132: Attachment: YARN-8132-004.patch > Final Status of applications shown as UNDEFINED in ATS app queries > -- > > Key: YARN-8132 > URL: https://issues.apache.org/jira/browse/YARN-8132 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2, timelineservice >Reporter: Charan Hebri >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-8132-001.patch, YARN-8132-002.patch, > YARN-8132-003.patch, YARN-8132-004.patch > > > Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A > sample request/response with INFO field for an application, > {noformat} > 2018-04-09 13:10:02,126 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1693)) - Received URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user > hrt_qa > 2018-04-09 13:10:02,156 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1716)) - Processed URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 > ms.){noformat} > {noformat} > { > "metrics": [], > "events": [], > "createdtime": 1523263360719, > "idprefix": 0, > "id": "application_1523259757659_0003", > "type": "YARN_APPLICATION", > "info": { > "YARN_APPLICATION_CALLER_CONTEXT": "CLI", > "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application > application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX", > "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED", > "YARN_APPLICATION_NAME": "Sleep job", > "YARN_APPLICATION_USER": "hrt_qa", > "YARN_APPLICATION_UNMANAGED_APPLICATION": false, > "FROM_ID": > "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003", > "UID": "yarn-cluster!application_1523259757659_0003", > "YARN_APPLICATION_VIEW_ACLS": " ", > "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718, > "YARN_AM_CONTAINER_LAUNCH_COMMAND": [ > "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 > -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog > -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 > org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout > 2>/stderr " > ], > "YARN_APPLICATION_QUEUE": "default", > "YARN_APPLICATION_TYPE": "MAPREDUCE", > "YARN_APPLICATION_PRIORITY": 0, > "YARN_APPLICATION_LATEST_APP_ATTEMPT": > "appattempt_1523259757659_0003_01", > "YARN_APPLICATION_TAGS": [ > "timeline_flow_name_tag:test_flow" > ], > "YARN_APPLICATION_STATE": "KILLED" > }, > "configs": {}, > "isrelatedto": {}, > "relatesto": {} > }{noformat} > This is different to what the Resource Manager reports. For KILLED > applications the final status is KILLED and for FAILED applications it is > FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772672#comment-16772672 ] Hadoop QA commented on YARN-8821: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 34s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 20m 36s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 71m 42s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-8821 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959367/YARN-8821-trunk.010.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 29f8dd684c95 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1d30fd9 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/23448/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23448/testReport/ | | Max. process+thread count | 308 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U:
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Attachment: (was: 屏幕快照 2019-02-20 下午2.24.26.png) > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Fix For: 3.1.2 > > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Affects Version/s: 3.1.0 > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: fengyongshe >Priority: Major > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Description: The Queue Info is configured in fair-scheduler.xml like below {color:#ff}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#ff}{color} 1024mb,1vcores 2048mb,2vcores The Queue {color:#ff}root.deva{color} configured last will override existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, like the {color}attachment root.deva ||Used Resources:|| ||Min Resources:|. => should be <3072mb,3vcore>| ||Max Resources:|. => should be<4096mb,4vcores>| ||Reserved Resources:|| ||Steady Fair Share:|| ||Instantaneous Fair Share:|| root.deva.sample ||Min Resources:|| ||Max Resources:|| ||Reserved Resources:|| ||Steady Fair Share:|| was: The Queue Info is configured in fair-scheduler.xml like below {color:#FF}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#FF}{color} 1024mb,1vcores 2048mb,2vcores The Queue {color:#FF}root.deva{color} configured last will override existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in root.deva.sample, like this{color}{color} > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Fix For: 3.1.2 > > Attachments: 屏幕快照 2019-02-20 下午2.24.26.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Description: The Queue Info is configured in fair-scheduler.xml like below {color:#ff}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#ff}{color} 1024mb,1vcores 2048mb,2vcores {color:#33}The Queue root.deva configured last will override existing root.deva{color}{color:#33} in root.deva.sample, like the {color}attachment root.deva ||Used Resources:|| ||Min Resources:|. => should be <3072mb,3vcores>| ||Max Resources:|. => should be <4096mb,4vcores>| ||Reserved Resources:|| ||Steady Fair Share:|| ||Instantaneous Fair Share:|| root.deva.sample ||Min Resources:|| ||Max Resources:|| ||Reserved Resources:|| ||Steady Fair Share:|| was: The Queue Info is configured in fair-scheduler.xml like below {color:#ff}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#ff}{color} 1024mb,1vcores 2048mb,2vcores {color:#33}The Queue root.deva configured last will override existing root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the {color}attachment {color} root.deva ||Used Resources:|| ||Min Resources:|. {color:#d04437} => should be <3072mb,3vcore>{color}| ||Max Resources:|. {color:#d04437}=> should be<4096mb,4vcores>{color}| ||Reserved Resources:|| ||Steady Fair Share:|| ||Instantaneous Fair Share:|| root.deva.sample ||Min Resources:|| ||Max Resources:|| ||Reserved Resources:|| ||Steady Fair Share:|| > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: fengyongshe >Priority: Major > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > {color:#33}The Queue root.deva configured last will override existing > root.deva{color}{color:#33} in root.deva.sample, like the > {color}attachment > > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcores>| > ||Max Resources:|. => should be <4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Attachment: (was: 屏幕快照 2019-02-20 下午2.24.26.png) > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Fix For: 3.1.2 > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#FF}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#FF}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#FF}root.deva{color} configured last will override > existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in > root.deva.sample, like this{color}{color} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Description: The Queue Info is configured in fair-scheduler.xml like below {color:#ff}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#ff}{color} 1024mb,1vcores 2048mb,2vcores {color:#33}The Queue root.deva configured last will override existing root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the {color}attachment {color} root.deva ||Used Resources:|| ||Min Resources:|. {color:#d04437} => should be <3072mb,3vcore>{color}| ||Max Resources:|. {color:#d04437}=> should be<4096mb,4vcores>{color}| ||Reserved Resources:|| ||Steady Fair Share:|| ||Instantaneous Fair Share:|| root.deva.sample ||Min Resources:|| ||Max Resources:|| ||Reserved Resources:|| ||Steady Fair Share:|| was: The Queue Info is configured in fair-scheduler.xml like below {color:#ff}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#ff}{color} 1024mb,1vcores 2048mb,2vcores The Queue {color:#ff}root.deva{color} configured last will override existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, like the {color}attachment root.deva ||Used Resources:|| ||Min Resources:|. => should be <3072mb,3vcore>| ||Max Resources:|. => should be<4096mb,4vcores>| ||Reserved Resources:|| ||Steady Fair Share:|| ||Instantaneous Fair Share:|| root.deva.sample ||Min Resources:|| ||Max Resources:|| ||Reserved Resources:|| ||Steady Fair Share:|| > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: fengyongshe >Priority: Major > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > {color:#33}The Queue root.deva configured last will override existing > root.deva{color}{color:#33} {color:#33}in root.deva.sample, like the > {color}attachment {color} > root.deva > ||Used Resources:|| > ||Min Resources:|. {color:#d04437} => should be > <3072mb,3vcore>{color}| > ||Max Resources:|. {color:#d04437}=> should > be<4096mb,4vcores>{color}| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries
[ https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772665#comment-16772665 ] Prabhu Joseph commented on YARN-8132: - [~bibinchundatt] The existing test case TestRMAppTransitions#testAppNewKill covers the scenario. The currentAttempt is not created (Null) and the StateMachine currentState is transitioned properly to KILLED. The issue happens only when the job is killed after attempt is created as the attempt finalStatus is not updated. > Final Status of applications shown as UNDEFINED in ATS app queries > -- > > Key: YARN-8132 > URL: https://issues.apache.org/jira/browse/YARN-8132 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2, timelineservice >Reporter: Charan Hebri >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-8132-001.patch, YARN-8132-002.patch, > YARN-8132-003.patch > > > Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A > sample request/response with INFO field for an application, > {noformat} > 2018-04-09 13:10:02,126 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1693)) - Received URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user > hrt_qa > 2018-04-09 13:10:02,156 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1716)) - Processed URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 > ms.){noformat} > {noformat} > { > "metrics": [], > "events": [], > "createdtime": 1523263360719, > "idprefix": 0, > "id": "application_1523259757659_0003", > "type": "YARN_APPLICATION", > "info": { > "YARN_APPLICATION_CALLER_CONTEXT": "CLI", > "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application > application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX", > "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED", > "YARN_APPLICATION_NAME": "Sleep job", > "YARN_APPLICATION_USER": "hrt_qa", > "YARN_APPLICATION_UNMANAGED_APPLICATION": false, > "FROM_ID": > "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003", > "UID": "yarn-cluster!application_1523259757659_0003", > "YARN_APPLICATION_VIEW_ACLS": " ", > "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718, > "YARN_AM_CONTAINER_LAUNCH_COMMAND": [ > "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 > -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog > -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 > org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout > 2>/stderr " > ], > "YARN_APPLICATION_QUEUE": "default", > "YARN_APPLICATION_TYPE": "MAPREDUCE", > "YARN_APPLICATION_PRIORITY": 0, > "YARN_APPLICATION_LATEST_APP_ATTEMPT": > "appattempt_1523259757659_0003_01", > "YARN_APPLICATION_TAGS": [ > "timeline_flow_name_tag:test_flow" > ], > "YARN_APPLICATION_STATE": "KILLED" > }, > "configs": {}, > "isrelatedto": {}, > "relatesto": {} > }{noformat} > This is different to what the Resource Manager reports. For KILLED > applications the final status is KILLED and for FAILED applications it is > FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Attachment: Fair Scheduler Mistake when configured same queue at same level.png > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Fix For: 3.1.2 > > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Fix Version/s: (was: 3.1.2) > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Attachments: Fair Scheduler Mistake when configured same queue at > same level.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
[ https://issues.apache.org/jira/browse/YARN-9314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengyongshe updated YARN-9314: -- Attachment: 屏幕快照 2019-02-20 下午2.24.26.png > Fair Scheduler: Queue Info mistake when configured same queue name at same > level > > > Key: YARN-9314 > URL: https://issues.apache.org/jira/browse/YARN-9314 > Project: Hadoop YARN > Issue Type: Bug >Reporter: fengyongshe >Priority: Major > Fix For: 3.1.2 > > Attachments: 屏幕快照 2019-02-20 下午2.24.26.png > > > The Queue Info is configured in fair-scheduler.xml like below > > {color:#ff}{color} > 3072mb,3vcores > 4096mb,4vcores > > 1024mb,1vcores > 2048mb,2vcores > Charlie > > > {color:#ff}{color} > 1024mb,1vcores > 2048mb,2vcores > > > The Queue {color:#ff}root.deva{color} configured last will override > existing{color:#ff} root.deva{color} {color:#33}in root.deva.sample, > like the {color}attachment > root.deva > ||Used Resources:|| > ||Min Resources:|. => should be <3072mb,3vcore>| > ||Max Resources:|. => should be<4096mb,4vcores>| > ||Reserved Resources:|| > ||Steady Fair Share:|| > ||Instantaneous Fair Share:|| > root.deva.sample > ||Min Resources:|| > ||Max Resources:|| > ||Reserved Resources:|| > ||Steady Fair Share:|| > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9314) Fair Scheduler: Queue Info mistake when configured same queue name at same level
fengyongshe created YARN-9314: - Summary: Fair Scheduler: Queue Info mistake when configured same queue name at same level Key: YARN-9314 URL: https://issues.apache.org/jira/browse/YARN-9314 Project: Hadoop YARN Issue Type: Bug Reporter: fengyongshe Fix For: 3.1.2 The Queue Info is configured in fair-scheduler.xml like below {color:#FF}{color} 3072mb,3vcores 4096mb,4vcores 1024mb,1vcores 2048mb,2vcores Charlie {color:#FF}{color} 1024mb,1vcores 2048mb,2vcores The Queue {color:#FF}root.deva{color} configured last will override existing{color:#FF}{color:#d04437} root.deva{color} {color:#33}in root.deva.sample, like this{color}{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772640#comment-16772640 ] Zhankun Tang commented on YARN-8821: [~cheersyang] , Thanks for the review! {quote}1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}} IIRC, line 396 and 402, they sort all combinations for a given count of devices every time. Why not just maintain a ordered list for these combinations in the map, so it only needs to sort once (when cost table initiated). {quote} Zhankun=> Good point! Yeah. I changed the value of costTable to a list of map entry. And when constructing costTable, the list is sorted by cost value in ascending order. When doing topology scheduling, we use an iterator of the list. If PACK policy, just use the iterator to loop. But if is SPREAD policy, the iterator is changed to descending iterator. 2,3,4,5 are fixed. > [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable > device framework > - > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, > YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, > YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, > YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, > YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, > YARN-8821-trunk.010.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Below figure shows the performance gain of the topology scheduling > algorithm's allocation (PACK policy). > !GPUTopologyPerformance.png! > Some of the conclusions are: > 1. The topology between GPUs impacts the performance dramatically. The best > combination GPUs can get *5% to 185%* *performance gain* among the test cases > with various factors including CNN model, batch size, GPU subset, etc. The > scheduling algorithm should be close to this fact. > 2. The "inception3" and "resnet50" networks seem not topology sensitive. The > topology scheduling can only potentially get *about 6.8%
[jira] [Comment Edited] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638 ] Zhaohui Xin edited comment on YARN-9278 at 2/20/19 5:32 AM: Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY as resource name, it will find a best node in all nodes of this cluster. This will be costly when this cluster has more than 10k nodes. I think we should limit the number of nodes in such a situation. How do you think this? :D was (Author: uranus): Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY as resource name, it will find a best node in all nodes of this cluster. This will be costly when this cluster has more than 10k nodes. I think we should limit the number of nodes in such a situation. How do you think this? :D > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9278) Shuffle nodes when selecting to be preempted nodes
[ https://issues.apache.org/jira/browse/YARN-9278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772638#comment-16772638 ] Zhaohui Xin commented on YARN-9278: --- Hi, [~yufeigu]. When preemption thread satisfies a starved container with ANY as resource name, it will find a best node in all nodes of this cluster. This will be costly when this cluster has more than 10k nodes. I think we should limit the number of nodes in such a situation. How do you think this? :D > Shuffle nodes when selecting to be preempted nodes > -- > > Key: YARN-9278 > URL: https://issues.apache.org/jira/browse/YARN-9278 > Project: Hadoop YARN > Issue Type: Sub-task > Components: fairscheduler >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > We should *shuffle* the nodes to avoid some nodes being preempted frequently. > Also, we should *limit* the num of nodes to make preemption more efficient. > Just like this, > {code:java} > // we should not iterate all nodes, that will be very slow > long maxTryNodeNum = > context.getPreemptionConfig().getToBePreemptedNodeMaxNumOnce(); > if (potentialNodes.size() > maxTryNodeNum){ > Collections.shuffle(potentialNodes); > List newPotentialNodes = new ArrayList(); > for (int i = 0; i < maxTryNodeNum; i++){ > newPotentialNodes.add(potentialNodes.get(i)); > } > potentialNodes = newPotentialNodes; > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8821: --- Attachment: YARN-8821-trunk.010.patch > [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable > device framework > - > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, > YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, > YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, > YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, > YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch, > YARN-8821-trunk.010.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Below figure shows the performance gain of the topology scheduling > algorithm's allocation (PACK policy). > !GPUTopologyPerformance.png! > Some of the conclusions are: > 1. The topology between GPUs impacts the performance dramatically. The best > combination GPUs can get *5% to 185%* *performance gain* among the test cases > with various factors including CNN model, batch size, GPU subset, etc. The > scheduling algorithm should be close to this fact. > 2. The "inception3" and "resnet50" networks seem not topology sensitive. The > topology scheduling can only potentially get *about 6.8% to 10%* speedup in > best cases. > 3. Our current version of topology scheduling algorithm can achieve 6.8*% to > 177.1%* *performance gain in best cases. In average, it also outperforms the > median performance(0.8% to 28.2%).* > *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" > best*. > > In summary, the GPU topology scheduling algorithm is effective and can > potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% > on average. > *It means about maximum 3X comparing to a random GPU scheduling algorithm in > a specific scenario*. > > The spreadsheets are here for your reference. > >
[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries
[ https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772635#comment-16772635 ] Prabhu Joseph commented on YARN-8132: - [~bibinchundatt] Yes, working on it, will update. > Final Status of applications shown as UNDEFINED in ATS app queries > -- > > Key: YARN-8132 > URL: https://issues.apache.org/jira/browse/YARN-8132 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2, timelineservice >Reporter: Charan Hebri >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-8132-001.patch, YARN-8132-002.patch, > YARN-8132-003.patch > > > Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A > sample request/response with INFO field for an application, > {noformat} > 2018-04-09 13:10:02,126 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1693)) - Received URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user > hrt_qa > 2018-04-09 13:10:02,156 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1716)) - Processed URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 > ms.){noformat} > {noformat} > { > "metrics": [], > "events": [], > "createdtime": 1523263360719, > "idprefix": 0, > "id": "application_1523259757659_0003", > "type": "YARN_APPLICATION", > "info": { > "YARN_APPLICATION_CALLER_CONTEXT": "CLI", > "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application > application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX", > "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED", > "YARN_APPLICATION_NAME": "Sleep job", > "YARN_APPLICATION_USER": "hrt_qa", > "YARN_APPLICATION_UNMANAGED_APPLICATION": false, > "FROM_ID": > "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003", > "UID": "yarn-cluster!application_1523259757659_0003", > "YARN_APPLICATION_VIEW_ACLS": " ", > "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718, > "YARN_AM_CONTAINER_LAUNCH_COMMAND": [ > "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 > -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog > -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 > org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout > 2>/stderr " > ], > "YARN_APPLICATION_QUEUE": "default", > "YARN_APPLICATION_TYPE": "MAPREDUCE", > "YARN_APPLICATION_PRIORITY": 0, > "YARN_APPLICATION_LATEST_APP_ATTEMPT": > "appattempt_1523259757659_0003_01", > "YARN_APPLICATION_TAGS": [ > "timeline_flow_name_tag:test_flow" > ], > "YARN_APPLICATION_STATE": "KILLED" > }, > "configs": {}, > "isrelatedto": {}, > "relatesto": {} > }{noformat} > This is different to what the Resource Manager reports. For KILLED > applications the final status is KILLED and for FAILED applications it is > FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9313: --- Attachment: (was: YARN-9313.001.patch) > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9313: --- Attachment: YARN-9313.001.patch > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7129) Application Catalog for YARN applications
[ https://issues.apache.org/jira/browse/YARN-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772497#comment-16772497 ] Eric Yang commented on YARN-7129: - I filed the 160 shelldocs false positive tests as YETUS-798 for Yetus future improvement. > Application Catalog for YARN applications > - > > Key: YARN-7129 > URL: https://issues.apache.org/jira/browse/YARN-7129 > Project: Hadoop YARN > Issue Type: New Feature > Components: applications >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN Appstore.pdf, YARN-7129.001.patch, > YARN-7129.002.patch, YARN-7129.003.patch, YARN-7129.004.patch, > YARN-7129.005.patch, YARN-7129.006.patch, YARN-7129.007.patch, > YARN-7129.008.patch, YARN-7129.009.patch, YARN-7129.010.patch, > YARN-7129.011.patch, YARN-7129.012.patch, YARN-7129.013.patch, > YARN-7129.014.patch, YARN-7129.015.patch, YARN-7129.016.patch, > YARN-7129.017.patch, YARN-7129.018.patch, YARN-7129.019.patch, > YARN-7129.020.patch, YARN-7129.021.patch, YARN-7129.022.patch, > YARN-7129.023.patch, YARN-7129.024.patch > > > YARN native services provides web services API to improve usability of > application deployment on Hadoop using collection of docker images. It would > be nice to have an application catalog system which provides an editorial and > search interface for YARN applications. This improves usability of YARN for > manage the life cycle of applications. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.
[ https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772399#comment-16772399 ] Hadoop QA commented on YARN-999: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 16s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 16m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 21s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 12s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 28s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 10 new + 336 unchanged - 10 fixed = 346 total (was 346) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 8s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 47s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}101m 6s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 48s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}179m 23s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestCapacitySchedulerMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-999 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959318/YARN-999.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 6f1ec3c47cee 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 02d04bd | | maven | version: Apache Maven 3.3.9 | |
[jira] [Commented] (YARN-2489) ResouceOption's overcommitTimeout should be respected during resource update on NM
[ https://issues.apache.org/jira/browse/YARN-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772289#comment-16772289 ] Íñigo Goiri commented on YARN-2489: --- I added a patch to YARN-999 which pretty much covers the description of this JIRA and the actual killing when we over commit. What we do there is to change the resources of the NM and then after the over commit time out happens, we kill. Another option would be to have in this JIRA the mechanism to wait X seconds to change the resources and YARN-999 to just kill when we go negative. I think the current approach in YARN-999 covers the functionality better as it would allow to reduce the size of the NM and wait forever until containers are drained while showing the change in resources. > ResouceOption's overcommitTimeout should be respected during resource update > on NM > -- > > Key: YARN-2489 > URL: https://issues.apache.org/jira/browse/YARN-2489 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, scheduler >Reporter: Junping Du >Priority: Major > > The ResourceOption to update NM's resource has two properties: Resource and > OvercommitTimeout. The later property is used to guarantee resource is > withdrawn after timeout is hit if resource is reduced to a value and current > resource consumption exceeds the new value. It currently use default value -1 > which means no timeout, and we should make this property work when updating > NM resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9286) [Timeline Server] Sorting based on FinalStatus shows pop-up message
[ https://issues.apache.org/jira/browse/YARN-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772287#comment-16772287 ] Hudson commented on YARN-9286: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15998 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15998/]) YARN-9286. [Timeline Server] Sorting based on FinalStatus shows pop-up (bibinchundatt: rev b8de78c570babe4f802d951957c495ea0a4b07da) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/WebPageUtils.java > [Timeline Server] Sorting based on FinalStatus shows pop-up message > --- > > Key: YARN-9286 > URL: https://issues.apache.org/jira/browse/YARN-9286 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Nallasivan >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9286-001.patch, YARN-9286-002.patch, > image-2019-02-15-18-16-21-804.png > > > In Timeline Server GUI, if we try to sort the details based on FinalStatus, a > popup window is getting displayed. And further any operations which involves > the refreshing of the page, results in the display of same popup window. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8132) Final Status of applications shown as UNDEFINED in ATS app queries
[ https://issues.apache.org/jira/browse/YARN-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772284#comment-16772284 ] Bibin A Chundatt commented on YARN-8132: Thank you [~Prabhu Joseph] for patch. Latest patch fixes issue when the attempt is available and application is killed. Could you add a test case to verify FINAL Status at TIMELINE when application is KILLED before attempt is created. > Final Status of applications shown as UNDEFINED in ATS app queries > -- > > Key: YARN-8132 > URL: https://issues.apache.org/jira/browse/YARN-8132 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2, timelineservice >Reporter: Charan Hebri >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-8132-001.patch, YARN-8132-002.patch, > YARN-8132-003.patch > > > Final Status is shown as UNDEFINED for applications that are KILLED/FAILED. A > sample request/response with INFO field for an application, > {noformat} > 2018-04-09 13:10:02,126 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1693)) - Received URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO from user > hrt_qa > 2018-04-09 13:10:02,156 INFO reader.TimelineReaderWebServices > (TimelineReaderWebServices.java:getApp(1716)) - Processed URL > /ws/v2/timeline/apps/application_1523259757659_0003?fields=INFO (Took 30 > ms.){noformat} > {noformat} > { > "metrics": [], > "events": [], > "createdtime": 1523263360719, > "idprefix": 0, > "id": "application_1523259757659_0003", > "type": "YARN_APPLICATION", > "info": { > "YARN_APPLICATION_CALLER_CONTEXT": "CLI", > "YARN_APPLICATION_DIAGNOSTICS_INFO": "Application > application_1523259757659_0003 was killed by user xxx_xx at XXX.XXX.XXX.XXX", > "YARN_APPLICATION_FINAL_STATUS": "UNDEFINED", > "YARN_APPLICATION_NAME": "Sleep job", > "YARN_APPLICATION_USER": "hrt_qa", > "YARN_APPLICATION_UNMANAGED_APPLICATION": false, > "FROM_ID": > "yarn-cluster!hrt_qa!test_flow!1523263360719!application_1523259757659_0003", > "UID": "yarn-cluster!application_1523259757659_0003", > "YARN_APPLICATION_VIEW_ACLS": " ", > "YARN_APPLICATION_SUBMITTED_TIME": 1523263360718, > "YARN_AM_CONTAINER_LAUNCH_COMMAND": [ > "$JAVA_HOME/bin/java -Djava.io.tmpdir=$PWD/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir= -Dyarn.app.container.log.filesize=0 > -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog > -Dhdp.version=3.0.0.0-1163 -Xmx819m -Dhdp.version=3.0.0.0-1163 > org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1>/stdout > 2>/stderr " > ], > "YARN_APPLICATION_QUEUE": "default", > "YARN_APPLICATION_TYPE": "MAPREDUCE", > "YARN_APPLICATION_PRIORITY": 0, > "YARN_APPLICATION_LATEST_APP_ATTEMPT": > "appattempt_1523259757659_0003_01", > "YARN_APPLICATION_TAGS": [ > "timeline_flow_name_tag:test_flow" > ], > "YARN_APPLICATION_STATE": "KILLED" > }, > "configs": {}, > "isrelatedto": {}, > "relatesto": {} > }{noformat} > This is different to what the Resource Manager reports. For KILLED > applications the final status is KILLED and for FAILED applications it is > FAILED. This behavior is seen in ATSv2 as well as older versions of ATS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.
[ https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri reassigned YARN-999: Assignee: Íñigo Goiri > In case of long running tasks, reduce node resource should balloon out > resource quickly by calling preemption API and suspending running task. > --- > > Key: YARN-999 > URL: https://issues.apache.org/jira/browse/YARN-999 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, scheduler >Reporter: Junping Du >Assignee: Íñigo Goiri >Priority: Major > Attachments: YARN-291.000.patch, YARN-999.001.patch > > > In current design and implementation, when we decrease resource on node to > less than resource consumption of current running tasks, tasks can still be > running until the end. But just no new task get assigned on this node > (because AvailableResource < 0) until some tasks are finished and > AvailableResource > 0 again. This is good for most cases but in case of long > running task, it could be too slow for resource setting to actually work so > preemption could be hired here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.
[ https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772286#comment-16772286 ] Íñigo Goiri commented on YARN-999: -- I think [^YARN-999.001.patch] is ready for review. * When the resources were changed using Admin/REST interfaces, the NM didn't get updated. On the other hand, when we trigger it through the configuration, it does. I added {{RMNode#isUpdatedCapability()}} to handle this. * I added the logic for the preemption in {{AbstractYarnScheduler#killContainersIfOvercommitted()}}. It could be done in FS or CS but I think this is more general. Maybe we can make it overridable. * I tweaked the {{TestCapacityScheduler#testResourceOverCommit()}} and at the end I added a sequence to test the feature. It could technically be split in smaller pieces. Thoughts? > In case of long running tasks, reduce node resource should balloon out > resource quickly by calling preemption API and suspending running task. > --- > > Key: YARN-999 > URL: https://issues.apache.org/jira/browse/YARN-999 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, scheduler >Reporter: Junping Du >Priority: Major > Attachments: YARN-291.000.patch, YARN-999.001.patch > > > In current design and implementation, when we decrease resource on node to > less than resource consumption of current running tasks, tasks can still be > running until the end. But just no new task get assigned on this node > (because AvailableResource < 0) until some tasks are finished and > AvailableResource > 0 again. This is good for most cases but in case of long > running task, it could be too slow for resource setting to actually work so > preemption could be hired here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-999) In case of long running tasks, reduce node resource should balloon out resource quickly by calling preemption API and suspending running task.
[ https://issues.apache.org/jira/browse/YARN-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri updated YARN-999: - Attachment: YARN-999.001.patch > In case of long running tasks, reduce node resource should balloon out > resource quickly by calling preemption API and suspending running task. > --- > > Key: YARN-999 > URL: https://issues.apache.org/jira/browse/YARN-999 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, nodemanager, scheduler >Reporter: Junping Du >Priority: Major > Attachments: YARN-291.000.patch, YARN-999.001.patch > > > In current design and implementation, when we decrease resource on node to > less than resource consumption of current running tasks, tasks can still be > running until the end. But just no new task get assigned on this node > (because AvailableResource < 0) until some tasks are finished and > AvailableResource > 0 again. This is good for most cases but in case of long > running task, it could be too slow for resource setting to actually work so > preemption could be hired here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9309) Improve graph text in SLS to avoid overlapping
[ https://issues.apache.org/jira/browse/YARN-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772257#comment-16772257 ] Hudson commented on YARN-9309: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15996 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15996/]) YARN-9309. Improve graph text in SLS to avoid overlapping. Contributed (bibinchundatt: rev 779dae4de7e518938d58badcef065ea457be911c) * (edit) hadoop-tools/hadoop-sls/src/main/html/simulate.html.template > Improve graph text in SLS to avoid overlapping > -- > > Key: YARN-9309 > URL: https://issues.apache.org/jira/browse/YARN-9309 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9309-001.patch, YARN-9309-002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9286) [Timeline Server] Sorting based on FinalStatus shows pop-up message
[ https://issues.apache.org/jira/browse/YARN-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-9286: --- Summary: [Timeline Server] Sorting based on FinalStatus shows pop-up message (was: [Timeline Server] Sorting based on FinalStatus throws pop-up message) > [Timeline Server] Sorting based on FinalStatus shows pop-up message > --- > > Key: YARN-9286 > URL: https://issues.apache.org/jira/browse/YARN-9286 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Nallasivan >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-9286-001.patch, YARN-9286-002.patch, > image-2019-02-15-18-16-21-804.png > > > In Timeline Server GUI, if we try to sort the details based on FinalStatus, a > popup window is getting displayed. And further any operations which involves > the refreshing of the page, results in the display of same popup window. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9309) Improve graph text in SLS to avoid overlapping
[ https://issues.apache.org/jira/browse/YARN-9309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bibin A Chundatt updated YARN-9309: --- Summary: Improve graph text in SLS to avoid overlapping (was: Improvise graphs in SLS as values displayed in graph are overlapping) > Improve graph text in SLS to avoid overlapping > -- > > Key: YARN-9309 > URL: https://issues.apache.org/jira/browse/YARN-9309 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bilwa S T >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-9309-001.patch, YARN-9309-002.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9039) App ACLs are not validated when serving logs from LogWebService
[ https://issues.apache.org/jira/browse/YARN-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suma Shivaprasad reassigned YARN-9039: -- Assignee: (was: Suma Shivaprasad) > App ACLs are not validated when serving logs from LogWebService > --- > > Key: YARN-9039 > URL: https://issues.apache.org/jira/browse/YARN-9039 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Reporter: Suma Shivaprasad >Priority: Critical > Attachments: YARN-9039.1.patch, YARN-9039.2.patch, YARN-9039.3.patch > > > App Acls are not being validated while serving logs through REST and UI2 via > Log Webservice -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9039) App ACLs are not validated when serving logs from LogWebService
[ https://issues.apache.org/jira/browse/YARN-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772149#comment-16772149 ] Suma Shivaprasad commented on YARN-9039: [~bibinchundatt] [~baktha] Apologize for the delayed response. I have not got a chance to look into this further after previous discussions. Please feel free to pick this up if you are interested. Thanks. > App ACLs are not validated when serving logs from LogWebService > --- > > Key: YARN-9039 > URL: https://issues.apache.org/jira/browse/YARN-9039 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-9039.1.patch, YARN-9039.2.patch, YARN-9039.3.patch > > > App Acls are not being validated while serving logs through REST and UI2 via > Log Webservice -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card
[ https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16772075#comment-16772075 ] Hadoop QA commented on YARN-9265: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 43s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 53s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 9m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 9m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 15s{color} | {color:green} hadoop-yarn-project/hadoop-yarn: The patch generated 0 new + 260 unchanged - 10 fixed = 260 total (was 270) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 20s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 48s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 42s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 21m 33s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 58s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}112m 51s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9265 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959250/YARN-9265-006.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux dd52b33e95f8 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64
[jira] [Assigned] (YARN-9048) Add znode hierarchy in Federation ZK State Store
[ https://issues.apache.org/jira/browse/YARN-9048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned YARN-9048: --- Assignee: Bilwa S T > Add znode hierarchy in Federation ZK State Store > > > Key: YARN-9048 > URL: https://issues.apache.org/jira/browse/YARN-9048 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Major > > Similar to YARN-2962 consider having hierarchy in ZK federation store for > applications -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9264) [Umbrella] Follow-up on IntelOpenCL FPGA plugin
[ https://issues.apache.org/jira/browse/YARN-9264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771997#comment-16771997 ] Peter Bacsko commented on YARN-9264: [~sunilg] [~tangzhankun] please review the first three patch: YARN-9265, YARN-9266 and YARN-9267. After committing YARN-9265, I'll perform a rebase if necessary. > [Umbrella] Follow-up on IntelOpenCL FPGA plugin > --- > > Key: YARN-9264 > URL: https://issues.apache.org/jira/browse/YARN-9264 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > > The Intel FPGA resource type support was released in Hadoop 3.1.0. > Right now the plugin implementation has some deficiencies that need to be > fixed. This JIRA lists all problems that need to be resolved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771993#comment-16771993 ] Weiwei Yang commented on YARN-8821: --- Thanks for working on this [~tangzhankun], it looks really good. For the v9 patch, I think it's almost there, just some minor comments 1. {{NvidiaGPUPluginForRuntimeV2#topologyAwareSchedule}} IIRC, line 396 and 402, they sort all combinations for a given count of devices every time. Why not just maintain a ordered list for these combinations in the map, so it only needs to sort once (when cost table initiated). 2. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}} {code:java} topologyAwareSchedule(allocation, count, envs, availableDevices, this.costTable); if (allocation.size() != count) { LOG.error("Failed to do topology scheduling. Skip to use basic " + "scheduling"); } return allocation; {code} this seems to return the allocation result from {{topologyAwareSchedule}} instead of doing basic scheduling when it failed. 3. {{NvidiaGPUPluginForRuntimeV2#allocateDevices}} line 249, this logging hides the actual error and stacktrace, can we change to LOG.error("", e)? Same comment applies to line 268. 4. NvidiaGPUPluginForRuntimeV2#allocateDevices line 226 - 235, the 2nd if can be removed and added to 1st one. 5. I am wondering if it makes sense to add a debug logging to print cost table, as that is most important data for scheduling, we might need it while debugging issues. Thanks > [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable > device framework > - > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, > YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, > YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, > YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, > YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Below figure shows the
[jira] [Updated] (YARN-7266) Timeline Server event handler threads locked
[ https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-7266: Component/s: ATSv2 > Timeline Server event handler threads locked > > > Key: YARN-7266 > URL: https://issues.apache.org/jira/browse/YARN-7266 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2, timelineserver >Affects Versions: 2.7.3 >Reporter: Venkata Puneet Ravuri >Assignee: Prabhu Joseph >Priority: Major > > Event handlers for Timeline Server seem to take a lock while parsing HTTP > headers of the request. This is causing all other threads to wait and slowing > down the overall performance of Timeline server. We have resourcemanager > metrics enabled to send to timeline server. Because of the high load on > ResourceManager, the metrics to be sent are getting backlogged and in turn > increasing heap footprint of Resource Manager (due to pending metrics). > This is the complete stack trace of a blocked thread on timeline server:- > "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 > tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry > [0x7f6b9142c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82) > - waiting to lock <0x0005c0621860> (a java.lang.Class for > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector) > at > com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168) > at > com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282) > at > com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94) > at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62) > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.find(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) > at > com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) > at >
[jira] [Commented] (YARN-7266) Timeline Server event handler threads locked
[ https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771982#comment-16771982 ] Prabhu Joseph commented on YARN-7266: - The http threads in problematic jstack creates a new {{JAXBContextImpl}} everytime while accepting the http request which causes the synchronization issue. Have below two ways to explore: 1. Implement a custom Jaxb context factory (javax.xml.bind.context.factory) which reuses the {{JAXBContextImpl}}. The default {{ContextFactory}} creates a new {{JAXBContextImpl}} every time. 2. Check if Jersey has a way to reuse {{JAXBContextImpl}} / Jersey {{JSONJAXBContext}} while accepting http request in similar to what it does when writing response through {{ContextResolver}} ({{JAXBContextResolver}} / {{YarnJacksonJaxbJsonProvider}}) Issue is applicable for other Webservices like RM, AM. This affects ATSV2 Timeline Reader WebService. > Timeline Server event handler threads locked > > > Key: YARN-7266 > URL: https://issues.apache.org/jira/browse/YARN-7266 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Venkata Puneet Ravuri >Assignee: Prabhu Joseph >Priority: Major > > Event handlers for Timeline Server seem to take a lock while parsing HTTP > headers of the request. This is causing all other threads to wait and slowing > down the overall performance of Timeline server. We have resourcemanager > metrics enabled to send to timeline server. Because of the high load on > ResourceManager, the metrics to be sent are getting backlogged and in turn > increasing heap footprint of Resource Manager (due to pending metrics). > This is the complete stack trace of a blocked thread on timeline server:- > "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 > tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry > [0x7f6b9142c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82) > - waiting to lock <0x0005c0621860> (a java.lang.Class for > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector) > at > com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168) > at > com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282) > at > com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94) > at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62) > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.find(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at
[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771981#comment-16771981 ] Szilard Nemeth commented on YARN-9267: -- Hi [~pbacsko]! Latest patch LGTM, +1 (non-binding). > Various fixes are needed in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771960#comment-16771960 ] Peter Bacsko commented on YARN-9267: [~snemeth] you can check it again. > Various fixes are needed in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9265) FPGA plugin fails to recognize Intel Processing Accelerator Card
[ https://issues.apache.org/jira/browse/YARN-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9265: --- Attachment: YARN-9265-006.patch > FPGA plugin fails to recognize Intel Processing Accelerator Card > > > Key: YARN-9265 > URL: https://issues.apache.org/jira/browse/YARN-9265 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.1.0 >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Critical > Attachments: YARN-9265-001.patch, YARN-9265-002.patch, > YARN-9265-003.patch, YARN-9265-004.patch, YARN-9265-005.patch, > YARN-9265-006.patch > > > The plugin cannot autodetect Intel FPGA PAC (Processing Accelerator Card). > There are two major issues. > Problem #1 > The output of aocl diagnose: > {noformat} > > Device Name: > acl0 > > Package Pat: > /home/pbacsko/inteldevstack/intelFPGA_pro/hld/board/opencl_bsp > > Vendor: Intel Corp > > Physical Dev Name StatusInformation > > pac_a10_f20 PassedPAC Arria 10 Platform (pac_a10_f20) > PCIe 08:00.0 > FPGA temperature = 79 degrees C. > > DIAGNOSTIC_PASSED > > > Call "aocl diagnose " to run diagnose for specified devices > Call "aocl diagnose all" to run diagnose for all devices > {noformat} > The plugin fails to recognize this and fails with the following message: > {noformat} > 2019-01-25 06:46:02,834 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaResourcePlugin: > Using FPGA vendor plugin: > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin > 2019-01-25 06:46:02,943 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.FpgaDiscoverer: > Trying to diagnose FPGA information ... > 2019-01-25 06:46:03,085 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerModule: > Using traffic control bandwidth handler > 2019-01-25 06:46:03,108 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: > Initializing mounted controller cpu at /sys/fs/cgroup/cpu,cpuacct/yarn > 2019-01-25 06:46:03,139 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.fpga.FpgaResourceHandlerImpl: > FPGA Plugin bootstrap success. > 2019-01-25 06:46:03,247 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Couldn't find (?i)bus:slot.func\s=\s.*, pattern > 2019-01-25 06:46:03,248 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Couldn't find (?i)Total\sCard\sPower\sUsage\s=\s.* pattern > 2019-01-25 06:46:03,251 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.IntelFpgaOpenclPlugin: > Failed to get major-minor number from reading /dev/pac_a10_f30 > 2019-01-25 06:46:03,252 ERROR > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to > bootstrap configured resource subsystems! > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: > No FPGA devices detected! > {noformat} > Problem #2 > The plugin assumes that the file name under {{/dev}} can be derived from the > "Physical Dev Name", but this is wrong. For example, it thinks that the > device file is {{/dev/pac_a10_f30}} which is not the case, the actual > file is {{/dev/intel-fpga-port.0}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771952#comment-16771952 ] Hadoop QA commented on YARN-9267: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 12s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 111 unchanged - 8 fixed = 111 total (was 119) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 25s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 71m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9267 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12959240/YARN-9267-003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a70389523e96 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 1e0ae6e | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23445/testReport/ | | Max. process+thread count | 340 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23445/console | | Powered by | Apache
[jira] [Updated] (YARN-9267) Various fixes are needed in FpgaResourceHandlerImpl
[ https://issues.apache.org/jira/browse/YARN-9267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9267: --- Attachment: YARN-9267-003.patch > Various fixes are needed in FpgaResourceHandlerImpl > --- > > Key: YARN-9267 > URL: https://issues.apache.org/jira/browse/YARN-9267 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9267-001.patch, YARN-9267-002.patch, > YARN-9267-003.patch > > > Fix some problems in {{FpgaResourceHandlerImpl}}: > * {{preStart()}} does not reconfigure card with the same IP - we see it as a > problem. If you recompile the FPGA application, you must rename the aocx file > because the card will not be reprogrammed. Suggestion: instead of storing > Node<\->IPID mapping, store Node<\->IPID hash (like the SHA-256 of the > localized file). > * Switch to slf4j from Apache Commons Logging > * Some unused imports -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9050: --- Summary: [Umbrella] Usability improvements for scheduler activities (was: Usability improvements for scheduler activities) > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activites maybe confused when multiple scheduling threads record activites of > different allocation processes in the same variables like appsAllocation and > recordingNodesAllocation in ActivitiesManager. I think these variables should > be thread-local to make activities clear among multiple threads. > 2. Incomplete activites for multi-node lookup machanism, since > ActivitiesLogger will skip recording through {{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup machanism. > 3. Current app activites can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activites, add diagnosis for placement constraints check, update insufficient > resource diagnosis with detailed info (like 'insufficient resource > names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggragate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggragation for app activities by > diagnoses is neccessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnositics and example like this: > !image-2018-11-23-16-46-38-138.png! > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771815#comment-16771815 ] Tao Yang commented on YARN-9313: Hi, [~cheersyang], [~leftnoteasy]. I have attached v1 patch, could you please help to review and give some advices? Thanks. > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9313: --- Attachment: YARN-9313.001.patch > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
Tao Yang created YARN-9313: -- Summary: Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities Key: YARN-9313 URL: https://issues.apache.org/jira/browse/YARN-9313 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tao Yang Assignee: Tao Yang [Design doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log
[ https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771785#comment-16771785 ] Rakesh Shah commented on YARN-6221: --- [~ssreenivasan] can you elaborate it more > Entities missing from ATS when summary log file info got returned to the ATS > before the domain log > -- > > Key: YARN-6221 > URL: https://issues.apache.org/jira/browse/YARN-6221 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sushmitha Sreenivasan >Assignee: Li Lu >Priority: Critical > > Events data missing for the following entities: > curl -k --negotiate -u: > http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01 > {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}} > {code:title=Timeline Server log entry} > WARN timeline.TimelineDataManager > (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { > id: tez_application_1487706062210_0012, type: TEZ_APPLICATION } > org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the > timeline entity { id: tez_application_1487706062210_0012, type: > TEZ_APPLICATION } doesn't exist. > at > org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316) > at > org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6214) NullPointer Exception while querying timeline server API
[ https://issues.apache.org/jira/browse/YARN-6214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771791#comment-16771791 ] Rakesh Shah commented on YARN-6214: --- Hi [~raviorteja], I have not got any error exception while executing http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE > NullPointer Exception while querying timeline server API > > > Key: YARN-6214 > URL: https://issues.apache.org/jira/browse/YARN-6214 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.1 >Reporter: Ravi Teja Chilukuri >Priority: Major > > The apps API works fine and give all applications, including Mapreduce and Tez > http://:8188/ws/v1/applicationhistory/apps > But when queried with application types with these APIs, it fails with > NullpointerException. > http://:8188/ws/v1/applicationhistory/apps?applicationTypes=TEZ > http://:8188/ws/v1/applicationhistory/apps?applicationTypes=MAPREDUCE > NullPointerExceptionjava.lang.NullPointerException > Blocked on this issue as we are not able to run analytics on the tez job > counters on the prod jobs. > Timeline Logs: > |2017-02-22 11:47:57,183 WARN webapp.GenericExceptionHandler > (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:195) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > Complete stacktrace: > http://pastebin.com/bRgxVabf -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-6221) Entities missing from ATS when summary log file info got returned to the ATS before the domain log
[ https://issues.apache.org/jira/browse/YARN-6221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Shah updated YARN-6221: -- Comment: was deleted (was: Hi, Sushmitha Sreenivasan Can you explain the issue little more.) > Entities missing from ATS when summary log file info got returned to the ATS > before the domain log > -- > > Key: YARN-6221 > URL: https://issues.apache.org/jira/browse/YARN-6221 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Sushmitha Sreenivasan >Assignee: Li Lu >Priority: Critical > > Events data missing for the following entities: > curl -k --negotiate -u: > http://:8188/ws/v1/timeline/TEZ_APPLICATION_ATTEMPT/tez_appattempt_1487706062210_0012_01 > {"events":[],"entitytype":"TEZ_APPLICATION_ATTEMPT","entity":"tez_appattempt_1487706062210_0012_01","starttime":1487711606077,"domain":"Tez_ATS_application_1487706062210_0012","relatedentities":{"TEZ_DAG_ID":["dag_1487706062210_0012_2","dag_1487706062210_0012_1"]},"primaryfilters":{},"otherinfo":{}} > {code:title=Timeline Server log entry} > WARN timeline.TimelineDataManager > (TimelineDataManager.java:doPostEntities(366)) - Skip the timeline entity: { > id: tez_application_1487706062210_0012, type: TEZ_APPLICATION } > org.apache.hadoop.yarn.exceptions.YarnException: Domain information of the > timeline entity { id: tez_application_1487706062210_0012, type: > TEZ_APPLICATION } doesn't exist. > at > org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:122) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.doPostEntities(TimelineDataManager.java:356) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:316) > at > org.apache.hadoop.yarn.server.timeline.EntityLogInfo.doParse(LogInfo.java:204) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parsePath(LogInfo.java:156) > at > org.apache.hadoop.yarn.server.timeline.LogInfo.parseForStore(LogInfo.java:113) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:682) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$AppLogs.parseSummaryLogs(EntityGroupFSTimelineStore.java:657) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore$ActiveLogParser.run(EntityGroupFSTimelineStore.java:870) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6937) Admin cannot post entities when domain is not exists
[ https://issues.apache.org/jira/browse/YARN-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771764#comment-16771764 ] Rakesh Shah commented on YARN-6937: --- Hi [~daemon], can you please elaborate the issue > Admin cannot post entities when domain is not exists > > > Key: YARN-6937 > URL: https://issues.apache.org/jira/browse/YARN-6937 > Project: Hadoop YARN > Issue Type: Bug >Reporter: YunFan Zhou >Priority: Major > > When I post entities to timeline server, and found that it throw the > following exception: > {code:java} > org.apache.hadoop.yarn.server.timeline.security.TimelineACLsManager.checkAccess(TimelineACLsManager.java:123) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:273) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > {code} > In TimelineACLsManager#checkAccess logic: > {code:java} > public boolean checkAccess(UserGroupInformation callerUGI, > ApplicationAccessType applicationAccessType, > TimelineEntity entity) throws YarnException, IOException { > if (LOG.isDebugEnabled()) { > LOG.debug("Verifying the access of " > + (callerUGI == null ? null : callerUGI.getShortUserName()) > + " on the timeline entity " > + new EntityIdentifier(entity.getEntityId(), > entity.getEntityType())); > } > if (!adminAclsManager.areACLsEnabled()) { > return true; > } > // find domain owner and acls > AccessControlListExt aclExt = aclExts.get(entity.getDomainId()); > if (aclExt == null) { > aclExt = loadDomainFromTimelineStore(entity.getDomainId()); > } > if (aclExt == null) { > throw new YarnException("Domain information of the timeline entity " > + new EntityIdentifier(entity.getEntityId(), entity.getEntityType()) > + " doesn't exist."); > } > {code} > Even if you're an administrator, but you have not any permissions to do this. > I think it's perfect to do follow-up checks though the value of *aclExt* is > null: > {code:java} > if (callerUGI != null > && (adminAclsManager.isAdmin(callerUGI) || > callerUGI.getShortUserName().equals(owner) || > domainACL.isUserAllowed(callerUGI))) { > return true; > } > return false; > {code} > Any suggestions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9080) Bucket Directories as part of ATS done accumulates
[ https://issues.apache.org/jira/browse/YARN-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771735#comment-16771735 ] Rakesh Shah commented on YARN-9080: --- Thanks [~Prabhu Joseph] > Bucket Directories as part of ATS done accumulates > -- > > Key: YARN-9080 > URL: https://issues.apache.org/jira/browse/YARN-9080 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: 0001-YARN-9080.patch, 0002-YARN-9080.patch, > 0003-YARN-9080.patch > > > Have observed older bucket directories cluster_timestamp, bucket1 and bucket2 > as part of ATS done accumulates. The cleanLogs part of EntityLogCleaner > removes only the app directories and not the bucket directories. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9312) NPE while rendering SLS simulate page
[ https://issues.apache.org/jira/browse/YARN-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bilwa S T reassigned YARN-9312: --- Assignee: Bilwa S T > NPE while rendering SLS simulate page > - > > Key: YARN-9312 > URL: https://issues.apache.org/jira/browse/YARN-9312 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Minor > > http://localhost:10001/simulate > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.sls.web.SLSWebApp.printPageSimulate(SLSWebApp.java:240) > at > org.apache.hadoop.yarn.sls.web.SLSWebApp.access$100(SLSWebApp.java:55) > at > org.apache.hadoop.yarn.sls.web.SLSWebApp$1.handle(SLSWebApp.java:152) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:539) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9238) Allocate on previous or removed or non existent application attempt
[ https://issues.apache.org/jira/browse/YARN-9238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lujie updated YARN-9238: Summary: Allocate on previous or removed or non existent application attempt (was: We get a wrong attempt by an appAttemptId when AM crash at some point) > Allocate on previous or removed or non existent application attempt > --- > > Key: YARN-9238 > URL: https://issues.apache.org/jira/browse/YARN-9238 > Project: Hadoop YARN > Issue Type: Bug >Reporter: lujie >Assignee: lujie >Priority: Critical > Attachments: YARN-9238_1.patch, YARN-9238_2.patch, YARN-9238_3.patch, > hadoop-test-resourcemanager-hadoop11.log > > > We have found a data race that can make an odd situation. > See > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService.OpportunisticAMSProcessor.allocate{color:#ff}:(code1){color} > {code:java} > // Allocate OPPORTUNISTIC containers. > 171. SchedulerApplicationAttempt appAttempt = > 172.((AbstractYarnScheduler)rmContext.getScheduler()) > 173. .getApplicationAttempt(appAttemptId); > 174. > 175. OpportunisticContainerContext oppCtx = > 176. appAttempt.getOpportunisticContainerContext(); > 177. oppCtx.updateNodeList(getLeastLoadedNodes()); > {code} > if we just crash the current AM(its attemptid is appattempt_0) just before > code1#171, when code1#171~173 continue to execute to get the appAttempt by > appattempt_0, the obtained appAttempt should represent the currenct AM. But > we found that the obtained appAttempt represents the new AM and its > attempid is appattempt_1. This obtained appAttempt has not init its oppCtx, > so NPE happnes at line code1#177. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.OpportunisticContainerAllocatorAMService$OpportunisticAMSProcessor.allocate(OpportunisticContainerAllocatorAMService.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830) > {code} > So why old appAttempt disappeares and why we use old appattempt_0 but get > the new appAttempt > We have found the reason. Below code({color:#ff}code2{color}) is the > function body of getApplicationAttempt at code1#173 > {code:java} > 399. public T getApplicationAttempt(ApplicationAttemptId > applicationAttemptId) { > 400 SchedulerApplication app = applications.get( > 401 applicationAttemptId.getApplicationId()); > 402 return app == null ? null : app.getCurrentAppAttempt(); > 403 } > {code} > when old AM Crash, new AM and new appAttempt comes. The currentAttempt of > app will be setted as the new appAttempt (see code3). So the code2 #402 will > return the new appAttempt. > if AM crashes at the head of allocate function(code1), bug won't happens due > to ApplicationDoesNotExistInCacheException. AM crashed after code1, > everything is also ok. > We shoud add the check: whether the the getted appAttempt have the same id > with given id. > patch comes soon! > {color:#ff}code3{color} > {code:java} > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplication.setCurrentAppAttempt(T > currentAttempt){ > this.currentAttempt = currentAttempt; > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-9103) Fix the bug in DeviceMappingManager#getReleasingDevices
[ https://issues.apache.org/jira/browse/YARN-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang resolved YARN-9103. Resolution: Won't Fix Resolve it as it is fixed in YARN-9060 > Fix the bug in DeviceMappingManager#getReleasingDevices > --- > > Key: YARN-9103 > URL: https://issues.apache.org/jira/browse/YARN-9103 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > When one container is assigned with multiple devices and in releasing state. > This same containerId looping causes multiple times releasing device count > sum. It involved a bug which is the same as mentioned in YARN-9099. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8888) Support device topology scheduling
[ https://issues.apache.org/jira/browse/YARN-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang resolved YARN-. Resolution: Won't Fix Resolve it due to the GPU topology algorithm is better implemented in the plugin for now. Abstraction for all device topology is too early now. See YARN-8821 for GPU topology scheduling. > Support device topology scheduling > -- > > Key: YARN- > URL: https://issues.apache.org/jira/browse/YARN- > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > An easy way for vendor plugin to describe topology information should be > provided in Device spec and the topology information will be used in the > device shared local scheduler to boost performance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9312) NPE while rendering SLS simulate page
Bibin A Chundatt created YARN-9312: -- Summary: NPE while rendering SLS simulate page Key: YARN-9312 URL: https://issues.apache.org/jira/browse/YARN-9312 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt http://localhost:10001/simulate {code} java.lang.NullPointerException at org.apache.hadoop.yarn.sls.web.SLSWebApp.printPageSimulate(SLSWebApp.java:240) at org.apache.hadoop.yarn.sls.web.SLSWebApp.access$100(SLSWebApp.java:55) at org.apache.hadoop.yarn.sls.web.SLSWebApp$1.handle(SLSWebApp.java:152) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:539) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8889) Add well-defined interface in container-executor to support vendor plugins isolation request
[ https://issues.apache.org/jira/browse/YARN-8889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang resolved YARN-8889. Resolution: Duplicate Resolve this as already implemented in YARN-9060 > Add well-defined interface in container-executor to support vendor plugins > isolation request > > > Key: YARN-8889 > URL: https://issues.apache.org/jira/browse/YARN-8889 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Because of different container runtime, the isolation request from vendor > device plugin may be raised before container launch (cgroups operations) or > at container launch (Docker runtime). > An easy to use interface in container-executor should be provided to support > above requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9195) RM Queue's pending container number might get decreased unexpectedly or even become negative once RM failover
[ https://issues.apache.org/jira/browse/YARN-9195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771710#comment-16771710 ] Shengyang Sha commented on YARN-9195: - {quote} Just read the patch, I am trying to understand refreshContainersFromPreviousAttempts(), if a container from previous attempt is completed, then you are not removing it from outstanding requests. Why are you doing this? {quote} refreshContainersFromPreviousAttempts method is used to maintain running containers which originally obtained by previous app attempts, not outstanding requests. Probably you meant removePreviousContainersFromOutstandingSchedulingRequests method. In this method, I filtered out (1) containers obtained by current app attempt and (2) known containers from previous app attempt. {quote} I am also not sure why you need to initApplicationAttempt(), this is retrieving current app attempt id from AM RM token. Since in the protocol, we have getContainersFromPreviousAttempts() already, what's the attempt id is used for here? {quote} I think current app attempt id is needed because RM might return all the running containers as previous containers (RegisterApplicationMasterResponse#getNMTokensFromPreviousAttempts). If we don't filter out such containers, outstanding request will be decreased unexpectedly. And if current outstanding request is zero, it will then be decreased to zero. {quote} Another thing is, why this issue would cause pending container/resource in RM's queue become negative? Can you add some more info? {quote} As have described above, outstanding requests could turn to negative values. Since RM has no sanity check, requests in RM will then become negative. Btw, the description of this issue also provides some detailed explanations. > RM Queue's pending container number might get decreased unexpectedly or even > become negative once RM failover > - > > Key: YARN-9195 > URL: https://issues.apache.org/jira/browse/YARN-9195 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.1.0 >Reporter: Shengyang Sha >Assignee: Shengyang Sha >Priority: Critical > Attachments: YARN-9195.001.patch, YARN-9195.002.patch, > cases_to_recreate_negative_pending_requests_scenario.diff > > > Hi, all: > Previously we have encountered a serious problem in ResourceManager, we found > that pending container number of one RM queue became negative after RM failed > over. Since queues in RM are managed in hierarchical structure, the root > queue's pending containers became negative at last, thus the scheduling > process of the whole cluster became affected. > The version of both our RM server and AMRM client in our application are > based on yarn 3.1, and we uses AMRMClientAsync#addSchedulingRequests() method > in our application to request resources from RM. > After investigation, we found that the direct cause was numAllocations of > some AMs' requests became negative after RM failed over. And there are at > lease three necessary conditions: > (1) Use schedulingRequests in AMRM client, and the application set zero to > the numAllocations for a schedulingRequest. In our batch job scenario, the > numAllocations of a schedulingRequest could turn to zero because > theoretically we can run a full batch job using only one container. > (2) RM failovers. > (3) Before AM reregisters itself to RM after RM restarts, RM has already > recovered some of the application's containers assigned before. > Here are some more details about the implementation: > (1) After RM recovers, RM will send all alive containers to AM once it > re-register itself through > RegisterApplicationMasterResponse#getContainersFromPreviousAttempts. > (2) During registerApplicationMaster, AMRMClientImpl will > removeFromOutstandingSchedulingRequests once AM gets > ContainersFromPreviousAttempts without checking whether these containers have > been assigned before. As a consequence, its outstanding requests might be > decreased unexpectedly even if it may not become negative. > (3) There is no sanity check in RM to validate requests from AMs. > For better illustrating this case, I've written a test case based on the > latest hadoop trunk, posted in the attachment. You may try case > testAMRMClientWithNegativePendingRequestsOnRMRestart and > testAMRMClientOnUnexpectedlyDecreasedPendingRequestsOnRMRestart . > To solve this issue, I propose to filter allocated containers before > removeFromOutstandingSchedulingRequests in AMRMClientImpl during > registerApplicationMaster, and some sanity checks are also needed to prevent > things from getting worse. > More comments and suggestions are welcomed. -- This message was sent by
[jira] [Resolved] (YARN-8883) Phase 1 - Provide an example of fake vendor plugin
[ https://issues.apache.org/jira/browse/YARN-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang resolved YARN-8883. Resolution: Duplicate Resolve it due to the YARN-9060 has an example of Nvidia GPU plugin > Phase 1 - Provide an example of fake vendor plugin > -- > > Key: YARN-8883 > URL: https://issues.apache.org/jira/browse/YARN-8883 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8883-trunk.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-8887) Support isolation in pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang resolved YARN-8887. Resolution: Duplicate Resolve it as duplicated with YAR-9060 > Support isolation in pluggable device framework > --- > > Key: YARN-8887 > URL: https://issues.apache.org/jira/browse/YARN-8887 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Devices isolation needs a complete description in API > specs(DeviceRuntimeSpec) and a translator in the adapter to convert the > requirements into uniform parameters passed to native container-executor. It > should support both default and Docker container. > For default container, we use a new device module in container-executor to > isolate device. For docker container, we depend on current > DockerLinuxContainerRuntime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7266) Timeline Server event handler threads locked
[ https://issues.apache.org/jira/browse/YARN-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph reassigned YARN-7266: --- Assignee: Prabhu Joseph > Timeline Server event handler threads locked > > > Key: YARN-7266 > URL: https://issues.apache.org/jira/browse/YARN-7266 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.7.3 >Reporter: Venkata Puneet Ravuri >Assignee: Prabhu Joseph >Priority: Major > > Event handlers for Timeline Server seem to take a lock while parsing HTTP > headers of the request. This is causing all other threads to wait and slowing > down the overall performance of Timeline server. We have resourcemanager > metrics enabled to send to timeline server. Because of the high load on > ResourceManager, the metrics to be sent are getting backlogged and in turn > increasing heap footprint of Resource Manager (due to pending metrics). > This is the complete stack trace of a blocked thread on timeline server:- > "2079644967@qtp-1658980982-4560" #4632 daemon prio=5 os_prio=0 > tid=0x7f6ba490a000 nid=0x5eb waiting for monitor entry > [0x7f6b9142c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector.prepare(AccessorInjector.java:82) > - waiting to lock <0x0005c0621860> (a java.lang.Class for > com.sun.xml.bind.v2.runtime.reflect.opt.AccessorInjector) > at > com.sun.xml.bind.v2.runtime.reflect.opt.OptimizedAccessorFactory.get(OptimizedAccessorFactory.java:168) > at > com.sun.xml.bind.v2.runtime.reflect.Accessor$FieldReflection.optimize(Accessor.java:282) > at > com.sun.xml.bind.v2.runtime.property.SingleElementNodeProperty.(SingleElementNodeProperty.java:94) > at sun.reflect.GeneratedConstructorAccessor52.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:551) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementProperty.(ArrayElementProperty.java:112) > at > com.sun.xml.bind.v2.runtime.property.ArrayElementNodeProperty.(ArrayElementNodeProperty.java:62) > at sun.reflect.GeneratedConstructorAccessor19.newInstance(Unknown > Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown > Source) > at java.lang.reflect.Constructor.newInstance(Unknown Source) > at > com.sun.xml.bind.v2.runtime.property.PropertyFactory.create(PropertyFactory.java:128) > at > com.sun.xml.bind.v2.runtime.ClassBeanInfoImpl.(ClassBeanInfoImpl.java:183) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getOrCreate(JAXBContextImpl.java:532) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:347) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.newInstance(Unknown Source) > at javax.xml.bind.ContextFinder.find(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at javax.xml.bind.JAXBContext.newInstance(Unknown Source) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) > at > com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) > at >
[jira] [Commented] (YARN-8821) [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable device framework
[ https://issues.apache.org/jira/browse/YARN-8821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16771683#comment-16771683 ] Zhankun Tang commented on YARN-8821: The unit test seems unrelated to this patch. > [YARN-8851] GPU hierarchy/topology scheduling support based on pluggable > device framework > - > > Key: YARN-8821 > URL: https://issues.apache.org/jira/browse/YARN-8821 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: GPUTopologyPerformance.png, YARN-8821-trunk.001.patch, > YARN-8821-trunk.002.patch, YARN-8821-trunk.003.patch, > YARN-8821-trunk.004.patch, YARN-8821-trunk.005.patch, > YARN-8821-trunk.006.patch, YARN-8821-trunk.007.patch, > YARN-8821-trunk.008.patch, YARN-8821-trunk.009.patch > > > h2. Background > GPU topology affects performance. There's been a discussion in YARN-7481. But > we'd like to move related discussions here. > And please note that YARN-8851 will provide a pluggable device framework > which can support plugin custom scheduler. Based on the framework, GPU plugin > could have own topology scheduler. > h2. Details of the proposed scheduling algorithm > The proposed patch has a topology algorithm implemented as below: > *Step 1*. When allocating devices, parse the output of "nvidia-smi topo -m" > to build a hash map whose key is all pairs of GPUs and the value is the > communication cost between the two. The map is like \{"0 - 1"=> 2, "0 - > 2"=>4, ...} which means the minimum cost of GPU 0 to 1 is 2. The cost is set > based on the connection type. > *Step 2*. And then it constructs a _+cost table+_ which caches all > combinations of GPUs and corresponding cost between them and cache it. The > cost table is a map whose structure is like > {code:java} > { 2=>{[0,1]=>2,..}, > 3=>{[0,1,2]=>10,..}, > 4=>{[0,1,2,3]=>18}}. > {code} > The key of the map is the count of GPUs, the value of it is a map whose key > is the combination of GPUs and the value is the calculated communication cost > of the numbers of GPUs. The cost calculation algorithm is to sum all > non-duplicate pairs of GPU's cost. For instance, the total cost of [0,1,2] > GPUs are the sum of cost "0 - 1", "0 - 2" and "1 - 2". And each cost can get > from the map built in step 1. > *Step 3*. After the cost table is built, when allocating GPUs based on > topology, we provide two policy which container can set through an > environment variable "NVIDIA_TOPO_POLICY". The value can be either "PACK" or > "SPREAD". The "PACK" means it prefers faster GPU-GPU communication. The > "SPREAD" means it prefers faster CPU-GPU communication( since GPUs are not > using the same bus to CPU). And the key difference of the two policy is the > sort order of the inner map in the cost table. For instance, let's assume 2 > GPUs is wanted. The costTable.get(2) would return a map containing all > combinations of two GPUs and their cost. If the policy is "PACK", we'll sort > the map by cost in ascending order. The first entry will be the GPUs has > minimum GPU-GPU cost. If the policy is "SPREAD", we sort it in descending > order and get the first one which is the highest GPU-GPU cost which means > lowest CPU-GPU costs. > h2. Estimation of the algorithm > Initial analysis of the topology scheduling algorithm(Using PACK policy) > based on the performance tests in an AWS EC2 with 8 GPU cards (P3) is done. > Below figure shows the performance gain of the topology scheduling > algorithm's allocation (PACK policy). > !GPUTopologyPerformance.png! > Some of the conclusions are: > 1. The topology between GPUs impacts the performance dramatically. The best > combination GPUs can get *5% to 185%* *performance gain* among the test cases > with various factors including CNN model, batch size, GPU subset, etc. The > scheduling algorithm should be close to this fact. > 2. The "inception3" and "resnet50" networks seem not topology sensitive. The > topology scheduling can only potentially get *about 6.8% to 10%* speedup in > best cases. > 3. Our current version of topology scheduling algorithm can achieve 6.8*% to > 177.1%* *performance gain in best cases. In average, it also outperforms the > median performance(0.8% to 28.2%).* > *4. And the algorithm's allocations match the fastest GPUs needed by "vgg16" > best*. > > In summary, the GPU topology scheduling algorithm is effective and can > potentially get 6.8% to 185% performance gain in the best cases and 1% to 30% > on average. > *It means about maximum 3X comparing to a random GPU scheduling algorithm in > a specific scenario*. > > The spreadsheets are here for your reference. > >