[jira] [Updated] (YARN-9639) DecommissioningNodesWatcher cause memory leak
[ https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-9639: -- Priority: Blocker (was: Critical) > DecommissioningNodesWatcher cause memory leak > - > > Key: YARN-9639 > URL: https://issues.apache.org/jira/browse/YARN-9639 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bilwa S T >Priority: Blocker > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9639-001.patch > > > Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to > memory leak. > PollTimerTask holds the reference of rmcontext -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16893792#comment-16893792 ] Tan, Wangda commented on YARN-9698: --- [~cane] , is the feature you mentioned supported by FairScheduler? Or it is just a new feature. > [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler > > > Key: YARN-9698 > URL: https://issues.apache.org/jira/browse/YARN-9698 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Weiwei Yang >Priority: Major > Labels: fs2cs > > We see some users want to migrate from Fair Scheduler to Capacity Scheduler, > this Jira is created as an umbrella to track all related efforts for the > migration, the scope contains > * Bug fixes > * Add missing features > * Migration tools that help to generate CS configs based on FS, validate > configs etc > * Documents > this is part of CS component, the purpose is to make the migration process > smooth. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9660) Enhance documentation of Docker on YARN support
[ https://issues.apache.org/jira/browse/YARN-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878079#comment-16878079 ] Tan, Wangda commented on YARN-9660: --- Thanks [~pbacsko] for working on this. All great improvements! > Enhance documentation of Docker on YARN support > --- > > Key: YARN-9660 > URL: https://issues.apache.org/jira/browse/YARN-9660 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, nodemanager >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9660-001.patch > > > Right now, using Docker on YARN has some hard requirements. If these > requirements are not met, then launching the containers will fail and and > error message will be printed. Depending on how familiar the user is with > Docker, it might or might not be easy for them to understand what went wrong > and how to fix the underlying problem. > It would be important to explicitly document these requirements along with > the error messages. > *#1: CGroups handler cannot be systemd* > If docker deamon runs with systemd cgroups handler, we receive the following > error upon launching a container: > {noformat} > Container id: container_1561638268473_0006_01_02 > Exit code: 7 > Exception message: Launch container failed > Shell error output: /usr/bin/docker-current: Error response from daemon: > cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice". > See '/usr/bin/docker-current run --help'. > Shell output: main : command provided 4 > main : run as user is johndoe > main : requested yarn user is johndoe > {noformat} > Solution: switch to cgroupfs. Doing so can be OS-specific, but we can > document a {{systemcl}} example. > > *#2: {{/bin/bash}} must be present on the {{$PATH}} inside the container* > Some smaller images like "busybox" or "alpine" does not have {{/bin/bash}}. > It's because all commands under {{/bin}} are linked to {{/bin/busybox}} and > there's only {{/bin/sh}}. > If we try to use these kind of images, we'll see the following error message: > {noformat} > Container id: container_1561638268473_0015_01_02 > Exit code: 7 > Exception message: Launch container failed > Shell error output: /usr/bin/docker-current: Error response from daemon: oci > runtime error: container_linux.go:235: starting container process caused > "exec: \"bash\": executable file not found in $PATH". > Shell output: main : command provided 4 > main : run as user is johndoe > main : requested yarn user is johndoe > {noformat} > > *#3: {{find}} command must be available on the {{$PATH}}* > It seems obvious that we have the {{find}} command, but even very popular > images like {{fedora}} requires that we install it separately. > If we don't have {{find}} available, then {{launcher_container.sh}} fails > with: > {noformat} > [2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. > Error file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh: > line 44: find: command not found > Last 4096 bytes of stderr.txt : > [2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. > Error file: prelaunch.err. > Last 4096 bytes of prelaunch.err : > /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh: > line 44: find: command not found > Last 4096 bytes of stderr.txt : > {noformat} > *#4 Add cmd-line example of how to tag local images* > This is actually documented under "Privileged Container Security > Consideration", but an one-liner would be helpful. I had trouble running a > local docker image and tagging it appropriately. Just an example like > {{docker tag local_ubuntu local/ubuntu:latest}} is already very informative. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic
[ https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867128#comment-16867128 ] Tan, Wangda commented on YARN-9559: --- Gotcha, make sense, thanks for the clarification. [~jhung] > Create AbstractContainersLauncher for pluggable ContainersLauncher logic > > > Key: YARN-9559 > URL: https://issues.apache.org/jira/browse/YARN-9559 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-9559.001.patch, YARN-9559.002.patch, > YARN-9559.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic
[ https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867030#comment-16867030 ] Tan, Wangda commented on YARN-9559: --- [~jhung], could you share what is the use cases of this? Why we want to make it abstract? > Create AbstractContainersLauncher for pluggable ContainersLauncher logic > > > Key: YARN-9559 > URL: https://issues.apache.org/jira/browse/YARN-9559 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Jonathan Hung >Priority: Major > Attachments: YARN-9559.001.patch, YARN-9559.002.patch, > YARN-9559.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more
[ https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862433#comment-16862433 ] Tan, Wangda edited comment on YARN-9327 at 6/12/19 8:22 PM: This seems tricky, because the ResourcePBImpl doesn't have proper protection as I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387 I think we should make sure at least {{getProto}} and {{maybeInitBuilder}} protected by synchronized lock. Now the synchronized lock is only on {{mergeLocalToBuilder}}, which is not sufficient. This won't protect read stale value of resource information, but if we want to protect read/write resource information path, we need to carefully look at performance impact. Remove the synchronized static lock seems like a right fix, it looks like a mistake in previous patch. was (Author: wangda): This seems tricky, because the ResourcePBImpl doesn't have proper protection as I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387 I think we should make sure at least {{getProto}} and {{maybeInitBuilder }}protected by synchronized lock. Now the synchronized lock is only on {{mergeLocalToBuilder}}, which is not sufficient. This won't protect read stale value of resource information, but if we want to protect read/write resource information path, we need to carefully look at performance impact. Remove the synchronized static lock seems like a right fix, it looks like a mistake in previous patch. > ProtoUtils#convertToProtoFormat block Application Master Service and many more > -- > > Key: YARN-9327 > URL: https://issues.apache.org/jira/browse/YARN-9327 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-9327.001.patch > > > {code} > public static synchronized ResourceProto convertToProtoFormat(Resource r) { > return ResourcePBImpl.getProto(r); > } > {code} > {noformat} > "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 > tid=0x7f181de72800 nid=0x222 waiting for monitor entry > [0x7ef153dad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404) > - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for > org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844) > - locked <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810) > - locked <0x7f0fed96f500> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158) > - locked <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198) > - eliminated <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103) > - locked <0x7f0fed968a30> (a > org.apache.hadoo
[jira] [Commented] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more
[ https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862433#comment-16862433 ] Tan, Wangda commented on YARN-9327: --- This seems tricky, because the ResourcePBImpl doesn't have proper protection as I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387 I think we should make sure at least {{getProto}} and {{maybeInitBuilder }}protected by synchronized lock. Now the synchronized lock is only on {{mergeLocalToBuilder}}, which is not sufficient. This won't protect read stale value of resource information, but if we want to protect read/write resource information path, we need to carefully look at performance impact. Remove the synchronized static lock seems like a right fix, it looks like a mistake in previous patch. > ProtoUtils#convertToProtoFormat block Application Master Service and many more > -- > > Key: YARN-9327 > URL: https://issues.apache.org/jira/browse/YARN-9327 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: YARN-9327.001.patch > > > {code} > public static synchronized ResourceProto convertToProtoFormat(Resource r) { > return ResourcePBImpl.getProto(r); > } > {code} > {noformat} > "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 > tid=0x7f181de72800 nid=0x222 waiting for monitor entry > [0x7ef153dad000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404) > - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for > org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844) > - locked <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810) > - locked <0x7f0fed96f500> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158) > - locked <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198) > - eliminated <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103) > - locked <0x7f0fed968a30> (a > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824) > at java.security.Ac
[jira] [Commented] (YARN-6875) New aggregated log file format for YARN log aggregation.
[ https://issues.apache.org/jira/browse/YARN-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858095#comment-16858095 ] Tan, Wangda commented on YARN-6875: --- [~larsfrancke], this is already usable in append-able file system like HDFS. I'd prefer to close this ticket and leave the remaining tasks open. > New aggregated log file format for YARN log aggregation. > > > Key: YARN-6875 > URL: https://issues.apache.org/jira/browse/YARN-6875 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Major > Attachments: YARN-6875-NewLogAggregationFormat-design-doc.pdf > > > T-file is the underlying log format for the aggregated logs in YARN. We have > seen several performance issues, especially for very large log files. > We will introduce a new log format which have better performance for large > log files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems
[ https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858066#comment-16858066 ] Tan, Wangda commented on YARN-9607: --- Thanks [~adam.antal], I would vote to enforcing the roll size to zero if the scheme is known to be a non-appendable file system. (We may need a list of non-appendable FS to make it to be more comprehensive). cc: [~ste...@apache.org] > Auto-configuring rollover-size of IFile format for non-appendable filesystems > - > > Key: YARN-9607 > URL: https://issues.apache.org/jira/browse/YARN-9607 > Project: Hadoop YARN > Issue Type: Sub-task > Components: log-aggregation, yarn >Affects Versions: 3.3.0 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > > In YARN-9525, we made IFile format compatible with remote folders with s3a > scheme. In rolling fashioned log-aggregation IFile still fails with the > "append is not supported" error message, which is a known limitation of the > format by design. > There is a workaround though: setting the rollover size in the configuration > of the IFile format, in each rolling cycle a new aggregated log file will be > created, thus we eliminated the append from the process. Setting this config > globally would cause performance problems in the regular log-aggregation, so > I'm suggesting to enforcing this config to zero, if the scheme of the URI is > s3a (or any other non-appendable filesystem). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3213) Respect labels in Capacity Scheduler when computing user-limit
[ https://issues.apache.org/jira/browse/YARN-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857131#comment-16857131 ] Tan, Wangda commented on YARN-3213: --- This Jira is about respect the queue's user-limit for per-label basis. So far, YARN CS doesn't support specifies different user limit for different node partition. Please feel free to file a Jira and work on that if you have such needs, we can help with reviews. cc: [~sunil.gov...@gmail.com] > Respect labels in Capacity Scheduler when computing user-limit > -- > > Key: YARN-3213 > URL: https://issues.apache.org/jira/browse/YARN-3213 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > > Now we can support node-labels in Capacity Scheduler, but user-limit > computing doesn't respect node-labels enough, we should fix that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856807#comment-16856807 ] Tan, Wangda commented on YARN-9525: --- Good news, Patch looks good, thanks [~adam.antal] and [~pbacsko] for working on the patch and validate it! :) Have we tested the patch when rolling aggregation is enabled, and file system is appendable? Just want to make sure append-rolling scenario is not broken by it. > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, > YARN-9525.002.patch, YARN-9525.003.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852175#comment-16852175 ] Tan, Wangda commented on YARN-9525: --- Thanks [~adam.antal], For the rolling issue, is it still existed once we change the rolling size to 0? {code:java} @Private @VisibleForTesting public long getRollOverLogMaxSize(Configuration conf) { return 1024L * 1024 * 1024 * conf.getInt( LOG_ROLL_OVER_MAX_FILE_SIZE_GB, 10); }{code} > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state
[ https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852169#comment-16852169 ] Tan, Wangda commented on YARN-9386: --- Thanks [~kyungwan nam], IIUC, only app owner or queue/cluster admin can destroy an app, is it correct? To me I think it is same for stop + destroy and destroy, I don't find destroy a running service is substantially more dangerous than stop a running app and destroy it. Could you share your thoughts? If we want support of more grandular permission. A better way is to define the rule access about who can do what operations, and operations include start/stop/destroy. > destroying yarn-service is allowed even though running state > > > Key: YARN-9386 > URL: https://issues.apache.org/jira/browse/YARN-9386 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: kyungwan nam >Assignee: kyungwan nam >Priority: Major > Attachments: YARN-9386.001.patch, YARN-9386.002.patch > > > It looks very dangerous to destroy a running app. It should not be allowed. > {code} > [yarn-ats@test ~]$ yarn app -list > 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > Total number of applications (application-types: [], states: [SUBMITTED, > ACCEPTED, RUNNING] and tags: []):3 > Application-Id Application-NameApplication-Type > User Queue State Final-State >ProgressTracking-URL > application_1551250841677_0003fbyarn-service >ambari-qa default RUNNING UNDEFINED >100% N/A > application_1552379723611_0002 fb1yarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > application_1550801435420_0001 ats-hbaseyarn-service > yarn-ats default RUNNING UNDEFINED >100% N/A > [yarn-ats@test ~]$ yarn app -destroy fb1 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at > test1.com/10.1.1.11:8050 > 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History > server at test1.com/10.1.1.101:10200 > 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms > 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed > service fb1 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9581) LogsCli getAMContainerInfoForRMWebService ignores rm2
[ https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16847962#comment-16847962 ] Tan, Wangda commented on YARN-9581: --- Nice catch, thanks [~Prabhu Joseph]. > LogsCli getAMContainerInfoForRMWebService ignores rm2 > - > > Key: YARN-9581 > URL: https://issues.apache.org/jira/browse/YARN-9581 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Yarn Logs fails for a running job in case of RM HA with rm2 active. > {code} > hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn logs > -applicationId application_1558613472348_0004 -am 1 > 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History > server at prabhuYarn/172.27.23.55:10200 > 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Unable to get AM container informations for the > application:application_1558613472348_0004 > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > Error while authenticating with endpoint: > https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts > Can not get AMContainers logs for the > application:application_1558613472348_0004 with the appOwner:hrt_qa > {code} > LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM > list yarn.resourcemanager.ha.rm-ids. > {code} > yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0)); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9521) RM filed to start due to system services
[ https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846230#comment-16846230 ] Tan, Wangda commented on YARN-9521: --- Thanks [~kyungwan nam] for the patch. cc: [~rohithsharma], [~billie.rina...@gmail.com] for patch reviewers. > RM filed to start due to system services > > > Key: YARN-9521 > URL: https://issues.apache.org/jira/browse/YARN-9521 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.2 >Reporter: kyungwan nam >Priority: Major > Attachments: YARN-9521.001.patch > > > when starting RM, listing system services directory has failed as follows. > {code} > 2019-04-30 17:18:25,441 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory > is configured to /services > 2019-04-30 17:18:25,467 INFO client.SystemServiceManagerImpl > (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation > initialized to yarn (auth:SIMPLE) > 2019-04-30 17:18:25,467 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in > state STARTED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > Filesystem closed > at > org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:203) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501) > Caused by: java.io.IOException: Filesystem closed > at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179) > at > org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282) > at > org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) > ... 13 more > {code} > it looks like due to the usage of filesystem cache. > this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to > yarn-site -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9576) ResourceUsageMultiNodeLookupPolicy may cause Application starve forever
[ https://issues.apache.org/jira/browse/YARN-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846213#comment-16846213 ] Tan, Wangda commented on YARN-9576: --- [~jutia], actually this behavior is not caused by multi-node lookup policy, it is caused by resource fragmentation. There's no good solution for this except queue priority based preemption. See YARN-5864. > ResourceUsageMultiNodeLookupPolicy may cause Application starve forever > > > Key: YARN-9576 > URL: https://issues.apache.org/jira/browse/YARN-9576 > Project: Hadoop YARN > Issue Type: Bug >Reporter: tianjuan >Priority: Major > > eems that ResourceUsageMultiNodeLookupPolicy in YARN-7494 may cause > Application starve forever > for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in > cluster, and two queues A,B, each is configured with 50% capacity. > firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, > and each node of the 10 nodes will have a contianer allocated. > Afterwards, another job JobB which requests 3G resource is submited to queue > B, and there will be one container with 3G size reserved on node h1, > with ResourceUsageMultiNodeLookupPolicy, the order policy will always be > h1,h2,..h9,h10, and there will always be one container re-reverved on node > h1, no other reservation happen, JobB will hang forever, [~sunilg] what's > ypur thought about this situation? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846088#comment-16846088 ] Tan, Wangda commented on YARN-9567: --- [~Tao Yang] {quote}Is it ok to support like this in UI1? Please feel free to give your suggestions. {quote} Of course! > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9569) Auto-created leaf queues do not honor cluster-wide min/max memory/vcores
[ https://issues.apache.org/jira/browse/YARN-9569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846085#comment-16846085 ] Tan, Wangda commented on YARN-9569: --- Thanks [~ccondit], good catch. I remember the reason why we initialize CSConf w/o loading default is we try to not pollute configs. [~suma.shivaprasad] do you remember? Could you suggest what should be the proper fix? > Auto-created leaf queues do not honor cluster-wide min/max memory/vcores > > > Key: YARN-9569 > URL: https://issues.apache.org/jira/browse/YARN-9569 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Affects Versions: 3.2.0 >Reporter: Craig Condit >Priority: Major > > Auto-created leaf queues do not honor cluster-wide settings for maximum > CPU/vcores allocation. > To reproduce: > # Set auto-create-child-queue.enabled=true for a parent queue. > # Set leaf-queue-template.maximum-allocation-mb=16384. > # Set yarn.resource-types.memory-mb.maximum-allocation=16384 in > resource-types.xml > # Launch a YARN app with a container requesting 16 GB RAM. > > This scenario should work, but instead you get an error similar to this: > {code:java} > java.lang.IllegalArgumentException: Queue maximum allocation cannot be larger > than the cluster setting for queue root.auto.test max allocation per queue: > cluster setting: {code} > > This seems to be caused by this code in > ManagedParentQueue.getLeafQueueConfigs: > {code:java} > CapacitySchedulerConfiguration leafQueueConfigTemplate = new > CapacitySchedulerConfiguration(new Configuration(false), false);{code} > > This initializes a new leaf queue configuration that does not read > resource-types.xml (or any other config). Later, this > CapacitySchedulerConfiguration instance calls > ResourceUtils.fetchMaximumAllocationFromConfig() from its > getMaximumAllocationPerQueue() method and passes itself as the configuration > to use. Since the resource types are not present, ResourceUtils falls back to > compiled-in defaults of 8GB RAM, 4 cores. > > I was able to work around this with a custom AutoCreatedQueueManagementPolicy > implementation which does something like this in init() and reinitialize(): > {code:java} > for (Map.Entry entry : this.scheduler.getConfiguration()) { > if (entry.getKey().startsWith("yarn.resource-types")) { > parentQueue.getLeafQueueTemplate().getLeafQueueConfigs() > .set(entry.getKey(), entry.getValue()); > } > } > {code} > However, this is obviously a very hacky way to solve the problem. > I can submit a proper patch if someone can provide some direction as to the > best way to proceed. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7494) Add muti-node lookup mechanism and pluggable nodes sorting policies to optimize placement decision
[ https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16846081#comment-16846081 ] Tan, Wangda commented on YARN-7494: --- + [~cheersyang] > Add muti-node lookup mechanism and pluggable nodes sorting policies to > optimize placement decision > -- > > Key: YARN-7494 > URL: https://issues.apache.org/jira/browse/YARN-7494 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-7494.001.patch, YARN-7494.002.patch, > YARN-7494.003.patch, YARN-7494.004.patch, YARN-7494.005.patch, > YARN-7494.006.patch, YARN-7494.007.patch, YARN-7494.008.patch, > YARN-7494.009.patch, YARN-7494.010.patch, YARN-7494.11.patch, > YARN-7494.12.patch, YARN-7494.13.patch, YARN-7494.14.patch, > YARN-7494.15.patch, YARN-7494.16.patch, YARN-7494.17.patch, > YARN-7494.18.patch, YARN-7494.19.patch, YARN-7494.20.patch, > YARN-7494.v0.patch, YARN-7494.v1.patch, multi-node-designProposal.png > > > Instead of single node, for effectiveness we can consider a multi node lookup > based on partition to start with. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16845054#comment-16845054 ] Tan, Wangda commented on YARN-9525: --- Thanks [~pbacsko] for the patch. The change looks good to me, [~pbacsko], could you update what is the test you have done? > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page
[ https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844506#comment-16844506 ] Tan, Wangda commented on YARN-9567: --- This looks great! Huge thanks to [~Tao Yang] for pushing it! How can user request activities? Only through REST API? It might be better if we can make it as part of the UI. cc: [~akhilpb], [~sunil.gov...@gmail.com]: This might be easier to support in UI2 than UI1, and users can get a more integrated experience from the new UI. + Other folks might be interested: [~vinodkv], [~aiden_zhang], [~jeffreyz], [~yzhangal], [~liuxun323]. > Add diagnostics for outstanding resource requests on app attempts page > -- > > Key: YARN-9567 > URL: https://issues.apache.org/jira/browse/YARN-9567 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: no_diagnostic_at_first.png, > show_diagnostics_after_requesting_app_activities_REST_API.png > > > Currently on app attempt page we can see outstanding resource requests, it > will be helpful for users to know why if we can join diagnostics of this app > with these. > Discussed with [~cheersyang], we can passively load diagnostics from cache of > completed app activities instead of actively triggering which may bring > uncontrollable risks. > For example: > (1) At first we can see no diagnostic in cache if app activities not > triggered below the outstanding requests. > !no_diagnostic_at_first.png|width=793,height=248! > (2) After requesting the application activities REST API, we can see > diagnostics now. > !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844496#comment-16844496 ] Tan, Wangda commented on YARN-4946: --- Thanks [~ccondit], I'm not entirely convinced that fix of this patch is correct, I don't see a strong need to backport this patch, and introducing a new config for this seems unnecessary. I will be fine with: a. Revert this patch and update MAPREDUCE-6415 (if any changes needed). b. Keep this patch as-is since it solved a minor problem but also introduces risks. Adding a config parameter seems over-kill to me, I would like to avoid it if possible. > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9570) application in pending-ordering-policy is not considered while container allocation
[ https://issues.apache.org/jira/browse/YARN-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-9570: -- Summary: application in pending-ordering-policy is not considered while container allocation (was: pplication in pending-ordering-policy is not considered while container allocation) > application in pending-ordering-policy is not considered while container > allocation > --- > > Key: YARN-9570 > URL: https://issues.apache.org/jira/browse/YARN-9570 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Yesha Vora >Priority: Major > > This is 5 node cluster with total 15GB capacity. > 1) Configure Capacity scheduler and set max cluster priority=10 > 2) launch app1 with no priority and wait for it to occupy full cluster > application_1558135983180_0001 is launched with Priority=0 > 3) launch app2 with priority=2 and check its in ACCEPTED state > application_1558135983180_0002 is launched with Priority=2 > 4) launch app3 with priority=3 and check its in ACCEPTED state > application_1558135983180_0003 is launched with Priority=2 > 5) kill container from app1 > 6) Verify app3 with higher priority goes to RUNNING state. > When max-application-master-percentage is set to 0.1, app2 goes to RUNNING > state even though app3 has higher priority. > Root cause: > In CS LeafQueue, there's two ordering list: > If the queue's total application master usage below > maxAMResourcePerQueuePercent, the app will be added to the "ordering-policy" > list. > Otherwise, the app will be added to the "pending-ordering-policy" list. > During allocation, only apps in "ordering-policy" are considered. > If there's any app finish, or queue config changed, or node add/remove > happen, "pending-ordering-policy" will be reconsidered, and some apps from > "pending-ordering-policy" will be added to "ordering-policy". > This behavior leads to the issue of this JIRA: > The cluster has 15GB resource, the max-application-master-percentage is set > to 0.1. So it can at most accept 2GB (rounded by 1GB) AM resource, which > equals to 2 applications. > When app2 submitted, it will be added to ordering-policy. > When app3 submitted, it will be added to pending-ordering-policy. > When we kill app1, it won't finish immediately. Instead, it will still be > part of "odering-policy" until all containers of app1 released. (That makes > app3 stays in pending-ordering-policy). > So any resource released by app1, app3 cannot pick up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully
[ https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844309#comment-16844309 ] Tan, Wangda commented on YARN-9543: --- Cool, thanks [~giovanni.fumarola] :) > UI2 should handle missing ATSv2 gracefully > -- > > Key: YARN-9543 > URL: https://issues.apache.org/jira/browse/YARN-9543 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2, yarn-ui-v2 >Affects Versions: 3.1.2 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Attachments: YARN-9543.001.patch > > > Resource manager UI2 is throwing some console errors and a error page on > flows page. > Suggested improvements: > * Disable or remove flows tab if ATSv2 is not available/installed > * Handle all connection errors to ATSv2 gracefully -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state
[ https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844308#comment-16844308 ] Tan, Wangda commented on YARN-4946: --- Thanks [~snemeth], [~ccondit] for commenting. The question I was trying to find out is: should we backport this patch to older release? After digging into details, I'm wondering should we do this or not. YARN-7952 should solve part of the problem: log aggregation status is saved on NM as well. So the only issue this Jira could solve is: if #apps grow greater than configured ZK state store limits, we will keep the apps if log aggregation is not finished yet. I agree with [~ccondit] mentioned, this exception (to keep app in state store) seems safe, however, if something bad happens, like log aggregation bug, or slowness of log aggregation HDFS cluster, etc. It will bring down RM. My understanding of this problem is: if RM recovery is enabled (I believe most prod clusters do), an app is removed from state-store (which should be a long time/buffer for log aggregation). If the log aggregation still not finished, we should still remove the app from RM state store and move on. The description of the Jira: {quote}When the RM "forgets" about an older completed Application (e.g. RM failover, enough time has passed, etc), the tool won't find the Application in the RM and will just assume that its log aggregation succeeded, even if it actually failed or is still running. {quote} Seems the right behavior when completed apps forgot by RM. Thoughts? > RM should not consider an application as COMPLETED when log aggregation is > not in a terminal state > -- > > Key: YARN-4946 > URL: https://issues.apache.org/jira/browse/YARN-4946 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 2.8.0 >Reporter: Robert Kanter >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-4946.001.patch, YARN-4946.002.patch, > YARN-4946.003.patch, YARN-4946.004.patch > > > MAPREDUCE-6415 added a tool that combines the aggregated log files for each > Yarn App into a HAR file. When run, it seeds the list by looking at the > aggregated logs directory, and then filters out ineligible apps. One of the > criteria involves checking with the RM that an Application's log aggregation > status is not still running and has not failed. When the RM "forgets" about > an older completed Application (e.g. RM failover, enough time has passed, > etc), the tool won't find the Application in the RM and will just assume that > its log aggregation succeeded, even if it actually failed or is still running. > We can solve this problem by doing the following: > The RM should not consider an app to be fully completed (and thus removed > from its history) until the aggregation status has reached a terminal state > (e.g. SUCCEEDED, FAILED, TIME_OUT). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully
[ https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844212#comment-16844212 ] Tan, Wangda commented on YARN-9543: --- [~giovanni.fumarola], do you have the same request internally? It gonna be good if you or someone from MS can help with feature testing. Appreciated. > UI2 should handle missing ATSv2 gracefully > -- > > Key: YARN-9543 > URL: https://issues.apache.org/jira/browse/YARN-9543 > Project: Hadoop YARN > Issue Type: Improvement > Components: ATSv2, yarn-ui-v2 >Affects Versions: 3.1.2 >Reporter: Zoltan Siegl >Assignee: Zoltan Siegl >Priority: Major > Attachments: YARN-9543.001.patch > > > Resource manager UI2 is throwing some console errors and a error page on > flows page. > Suggested improvements: > * Disable or remove flows tab if ATSv2 is not available/installed > * Handle all connection errors to ATSv2 gracefully -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844133#comment-16844133 ] Tan, Wangda commented on YARN-9525: --- Thanks [~pbacsko]. > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Peter Bacsko >Priority: Major > Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder
[ https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844097#comment-16844097 ] Tan, Wangda commented on YARN-9525: --- [~pbacsko], can you rename the patch to YARN-9525.001.poc.patch, and change the status to PA? So Jenkins can pick it up. > IFile format is not working against s3a remote folder > - > > Key: YARN-9525 > URL: https://issues.apache.org/jira/browse/YARN-9525 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 3.1.2 >Reporter: Adam Antal >Assignee: Adam Antal >Priority: Major > Attachments: IFile-S3A-POC01.patch > > > Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} > configured to an s3a URI throws the following exception during log > aggregation: > {noformat} > Cannot create writer for app application_1556199768861_0001. Skip log upload > this time. > java.io.IOException: java.io.FileNotFoundException: No such file or > directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382) > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244) > at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at > org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195) > ... 7 more > {noformat} > This stack trace point to > {{LogAggregationIndexedFileController$initializeWriter}} where we do the > following steps (in a non-rolling log aggregation setup): > - create FSDataOutputStream > - writing out a UUID > - flushing > - immediately after that we call a GetFileStatus to get the length of the log > file (the bytes we just wrote out), and that's where the failures happens: > the file is not there yet due to eventual consistency. > Maybe we can get rid of that, so we can use IFile format against a s3a target. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9517) When aggregation is not enabled, can't see the container log
[ https://issues.apache.org/jira/browse/YARN-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829570#comment-16829570 ] Tan, Wangda commented on YARN-9517: --- [~shurong.mai], thanks for putting a patch. However, I'm not sure why you closed this Jira? Is the patch or fix already in mentioned branches? > When aggregation is not enabled, can't see the container log > > > Key: YARN-9517 > URL: https://issues.apache.org/jira/browse/YARN-9517 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.2.0, 2.3.0, 2.4.1, 2.5.2, 2.6.5, 3.2.0, 2.9.2, 2.8.5, > 2.7.7, 3.1.2 >Reporter: Shurong Mai >Priority: Major > Labels: patch > Attachments: YARN-9517.patch > > > yarn-site.xml > {code:java} > > yarn.log-aggregation-enable > false > > {code} > > When aggregation is not enabled, we click the "container log link"(in web > page > "http://xx:19888/jobhistory/attempts/job_1556431770792_0001/m/SUCCESSFUL";) > after a job is finished successfully. > It will jump to the webpage displaying "Aggregation is not enabled. Try the > nodemanager at yy:48038" after we click, and the url is > "http://xx:19888/jobhistory/logs/yy:48038/container_1556431770792_0001_01_02/attempt_1556431770792_0001_m_00_0/hadoop"; > I also fund this problem in all hadoop version 2.x.y and 3.x.y and I submit > a patch which is simple and can apply to this hadoop version. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.
[ https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-8193: -- Target Version/s: 2.9.2 Priority: Blocker (was: Critical) > YARN RM hangs abruptly (stops allocating resources) when running successive > applications. > - > > Key: YARN-8193 > URL: https://issues.apache.org/jira/browse/YARN-8193 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Zian Chen >Assignee: Zian Chen >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8193-branch-2-001.patch, > YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, YARN-8193.002.patch > > > When running massive queries successively, at some point RM just hangs and > stops allocating resources. At the point RM get hangs, YARN throw > NullPointerException at RegularContainerAllocator.getLocalityWaitFactor. > There's sufficient space given to yarn.nodemanager.local-dirs (not a node > health issue, RM didn't report any node being unhealthy). There is no fixed > trigger for this (query or operation). > This problem goes away on restarting ResourceManager. No NM restart is > required. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812881#comment-16812881 ] Tan, Wangda commented on YARN-9445: --- [~shuzirra], [~snemeth], Changing yarn.admin.acl could cause some more security issues (like allowing cluster ops to run jobs and consume all the resources which they're disallowed before), and it is an incompatible change to me. I suggest to not do that. Changing default value of admin.acl and queue acls are also incompatible changes, but the later ones are important since they can potentially prevent cluster being hacked. It's better to start an email thread to discuss. > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9319) Fix compilation issue of handling typedef an existing name by gcc compiler
[ https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-9319: -- Summary: Fix compilation issue of handling typedef an existing name by gcc compiler (was: Fix compilation issue of ) > Fix compilation issue of handling typedef an existing name by gcc compiler > --- > > Key: YARN-9319 > URL: https://issues.apache.org/jira/browse/YARN-9319 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 > Environment: RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 > 20120313 (Red Hat 4.4.7-17) (GCC) >Reporter: Wei-Chiu Chuang >Assignee: Zhankun Tang >Priority: Blocker > Attachments: YARN-9319-trunk.001.patch > > > When I do: > mvn clean install -DskipTests -Pdist,native -Dmaven.javadoc.skip=true > It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc > version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)) > {noformat} > [WARNING] [ 54%] Built target test-container-executor > [WARNING] Linking CXX static library libgtest.a > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P > CMakeFiles/gtest.dir/cmake_clean_target.cmake > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script > CMakeFiles/gtest.dir/link.txt --verbose=1 > [WARNING] /usr/bin/ar cq libgtest.a > CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o > [WARNING] /usr/bin/ranlib libgtest.a > [WARNING] make[2]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles > 26 > [WARNING] [ 54%] Built target gtest > [WARNING] make[1]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] In file included from > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27: > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31: > error: redefinition of typedef 'update_cgroups_parameters_function' > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31: > note: previous declaration of 'update_cgroups_parameters_function' was here > [WARNING] make[2]: *** > [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o] > Error 1 > [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2 > [WARNING] make[1]: *** Waiting for unfinished jobs > [WARNING] make: *** [all] Error 2 > {noformat} > The code compiles once I revert YARN-9060. > [~tangzhankun], [~sunilg] care to take a look? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9319) YARN-9060 does not compile
[ https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774430#comment-16774430 ] Tan, Wangda commented on YARN-9319: --- Committing this patch now, thanks [~tangzhankun] and everybody for review. > YARN-9060 does not compile > -- > > Key: YARN-9319 > URL: https://issues.apache.org/jira/browse/YARN-9319 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 > Environment: RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 > 20120313 (Red Hat 4.4.7-17) (GCC) >Reporter: Wei-Chiu Chuang >Assignee: Zhankun Tang >Priority: Blocker > Attachments: YARN-9319-trunk.001.patch > > > When I do: > mvn clean install -DskipTests -Pdist,native -Dmaven.javadoc.skip=true > It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc > version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)) > {noformat} > [WARNING] [ 54%] Built target test-container-executor > [WARNING] Linking CXX static library libgtest.a > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P > CMakeFiles/gtest.dir/cmake_clean_target.cmake > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script > CMakeFiles/gtest.dir/link.txt --verbose=1 > [WARNING] /usr/bin/ar cq libgtest.a > CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o > [WARNING] /usr/bin/ranlib libgtest.a > [WARNING] make[2]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles > 26 > [WARNING] [ 54%] Built target gtest > [WARNING] make[1]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] In file included from > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27: > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31: > error: redefinition of typedef 'update_cgroups_parameters_function' > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31: > note: previous declaration of 'update_cgroups_parameters_function' was here > [WARNING] make[2]: *** > [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o] > Error 1 > [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2 > [WARNING] make[1]: *** Waiting for unfinished jobs > [WARNING] make: *** [all] Error 2 > {noformat} > The code compiles once I revert YARN-9060. > [~tangzhankun], [~sunilg] care to take a look? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9319) Fix compilation issue of
[ https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-9319: -- Summary: Fix compilation issue of (was: YARN-9060 does not compile) > Fix compilation issue of > - > > Key: YARN-9319 > URL: https://issues.apache.org/jira/browse/YARN-9319 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0 > Environment: RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 > 20120313 (Red Hat 4.4.7-17) (GCC) >Reporter: Wei-Chiu Chuang >Assignee: Zhankun Tang >Priority: Blocker > Attachments: YARN-9319-trunk.001.patch > > > When I do: > mvn clean install -DskipTests -Pdist,native -Dmaven.javadoc.skip=true > It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc > version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)) > {noformat} > [WARNING] [ 54%] Built target test-container-executor > [WARNING] Linking CXX static library libgtest.a > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P > CMakeFiles/gtest.dir/cmake_clean_target.cmake > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script > CMakeFiles/gtest.dir/link.txt --verbose=1 > [WARNING] /usr/bin/ar cq libgtest.a > CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o > [WARNING] /usr/bin/ranlib libgtest.a > [WARNING] make[2]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles > 26 > [WARNING] [ 54%] Built target gtest > [WARNING] make[1]: Leaving directory > `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native' > [WARNING] In file included from > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27: > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31: > error: redefinition of typedef 'update_cgroups_parameters_function' > [WARNING] > /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31: > note: previous declaration of 'update_cgroups_parameters_function' was here > [WARNING] make[2]: *** > [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o] > Error 1 > [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2 > [WARNING] make[1]: *** Waiting for unfinished jobs > [WARNING] make: *** [all] Error 2 > {noformat} > The code compiles once I revert YARN-9060. > [~tangzhankun], [~sunilg] care to take a look? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9310) Test submarine maven module build
[ https://issues.apache.org/jira/browse/YARN-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tan, Wangda updated YARN-9310: -- Attachment: YARN-9310.001.patch > Test submarine maven module build > - > > Key: YARN-9310 > URL: https://issues.apache.org/jira/browse/YARN-9310 > Project: Hadoop YARN > Issue Type: Task >Reporter: Tan, Wangda >Priority: Major > Attachments: YARN-9310.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9310) Test submarine maven module build
Tan, Wangda created YARN-9310: - Summary: Test submarine maven module build Key: YARN-9310 URL: https://issues.apache.org/jira/browse/YARN-9310 Project: Hadoop YARN Issue Type: Task Reporter: Tan, Wangda -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5600) Add a parameter to ContainerLaunchContext to emulate yarn.nodemanager.delete.debug-delay-sec on a per-application basis
[ https://issues.apache.org/jira/browse/YARN-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655703#comment-15655703 ] Tan, Wangda commented on YARN-5600: --- This is a very useful feature without any doubt, thanks [~miklos.szeg...@cloudera.com] for working on this JIRA and thanks reviews from [~Naganarasimha] / [~templedf] for reviewing the patch. Apologize for my very late review, I only looked at API of the patch: Have you ever considered the other approach, which we can turn on debug-delay-sec by passing a pre-defined environment. The biggest benefit is, we don't need to update most applications to use this feature, for example, MR/Spark support specify environment. Making changes to all major applications to use this feature sounds like a big task. As an example LinuxDockerContainerExecutor uses the approach which specify configurations by passing env var. And in addition, it will be better to have a global max-debug-delay-sec in yarn-site (which could be MAX_INT by default), considering disk space and security, we may want application occupy disk beyond some specified time. + [~djp] > Add a parameter to ContainerLaunchContext to emulate > yarn.nodemanager.delete.debug-delay-sec on a per-application basis > --- > > Key: YARN-5600 > URL: https://issues.apache.org/jira/browse/YARN-5600 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Daniel Templeton >Assignee: Miklos Szegedi > Labels: oct16-medium > Attachments: YARN-5600.000.patch, YARN-5600.001.patch, > YARN-5600.002.patch, YARN-5600.003.patch, YARN-5600.004.patch, > YARN-5600.005.patch, YARN-5600.006.patch, YARN-5600.007.patch > > > To make debugging application launch failures simpler, I'd like to add a > parameter to the CLC to allow an application owner to request delayed > deletion of the application's launch artifacts. > This JIRA solves largely the same problem as YARN-5599, but for cases where > ATS is not in use, e.g. branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster
[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655514#comment-15655514 ] Tan, Wangda commented on YARN-5864: --- Thanks [~curino] for sharing these insightful suggestions. The problem you mentioned is totally true: we were putting lots of efforts to add features for various of resource constraints (such as limits, node partition, priority, etc.) but we paid less attention about how to make easier/consistent semantics. I also agree that we do need to spend some time to think about what is the semantics that YARN scheduler should have. For example, the minimum guarantee of CS is queue should get at least their configured capacity, but a picky app could make an under-utilized queue waiting forever for the resource. And also as you mentioned above, non-preemptable queue can invalidate configured capacity as well. However, I would argue that the scheduler is not able to run perfectly without invalidating all the constraints. It is not just a group of formulas we need to define and let the solver to optimize it, it involves lots of human's emotions and preferences. For example, user may not understand and glad to accept why a picky request cannot be allocated even if the queue/cluster have available capacity. And it may not be acceptable to a production cluster that a long running service for realtime queries cannot be launched because we don't want to kill some less-important batch jobs. My point is, if we can have these rules defined in the doc and user can know what happened from the UI/log, we can add them. To improve these, I think your suggestion (1) will be more helpful and achievable in a short term, we can definitely remove some parameters, for example, existing user-limit definition is not good enough and user-limit-factor can always make a queue cannot fully utilize its capacity. And we can better define these semantics in doc and UI. (2) Looks beautiful but it may not be able to solve the root problem directly: The first priority is to make our users feel happy to accept it instead of beautifully solving it in mathematics. For example, for the problem I put in description of the JIRA, I don't think (2) can get allocation without harming other applications. And in implementation's perspective, I'm not sure how to make a solver-based solution can handle both of fast allocation (we want to do allocation within milli-seconds for interactive queries) and good placement (such as gang scheduling with some other constraints like anti-affinity). It seems to me that we will sacrifice low latency to get better quality of placement for the option (2). bq. This opens up many abuses, one that comes to mind ... Actually this feature will be only used in a pretty controlled environment: Important long running services running in a separate queue, and admin/user agrees that it can preempt other batch jobs to get new containers. ACLs will be set to avoid normal user running inside these queues, all apps running in the queue should be trusted apps such as YARN native services (Slider), Spark, etc. And we can also make sure these apps will try best to respect other apps. And please advice if you think we can improve the semantics of this feature. Thanks, > Capacity Scheduler preemption for fragmented cluster > - > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-5864.poc-0.patch > > > YARN-4390 added preemption for reserved container. However, we found one case > that large container cannot be allocated even if all queues are under their > limit. > For example, we have: > {code} > Two queues, a and b, capacity 50:50 > Two nodes: n1 and n2, each of them have 50 resource > Now queue-a uses 10 on n1 and 10 on n2 > queue-b asks for one single container with resource=45. > {code} > The container could be reserved on any of the host, but no preemption will > happen because all queues are under their limits. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org