[jira] [Updated] (YARN-9639) DecommissioningNodesWatcher cause memory leak

2019-07-30 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-9639:
--
Priority: Blocker  (was: Critical)

> DecommissioningNodesWatcher cause memory leak
> -
>
> Key: YARN-9639
> URL: https://issues.apache.org/jira/browse/YARN-9639
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bilwa S T
>Priority: Blocker
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9639-001.patch
>
>
> Missing cancel() of Timer task in DecommissioningNodesWatcher could leak to 
> memory leak.
> PollTimerTask holds the reference of rmcontext



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9698) [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler

2019-07-26 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893792#comment-16893792
 ] 

Tan, Wangda commented on YARN-9698:
---

[~cane] , is the feature you mentioned supported by FairScheduler? Or it is 
just a new feature. 

> [Umbrella] Tools to help migration from Fair Scheduler to Capacity Scheduler
> 
>
> Key: YARN-9698
> URL: https://issues.apache.org/jira/browse/YARN-9698
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Weiwei Yang
>Priority: Major
>  Labels: fs2cs
>
> We see some users want to migrate from Fair Scheduler to Capacity Scheduler, 
> this Jira is created as an umbrella to track all related efforts for the 
> migration, the scope contains
>  * Bug fixes
>  * Add missing features
>  * Migration tools that help to generate CS configs based on FS, validate 
> configs etc
>  * Documents
> this is part of CS component, the purpose is to make the migration process 
> smooth.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9660) Enhance documentation of Docker on YARN support

2019-07-03 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878079#comment-16878079
 ] 

Tan, Wangda commented on YARN-9660:
---

Thanks [~pbacsko] for working on this. All great improvements!

> Enhance documentation of Docker on YARN support
> ---
>
> Key: YARN-9660
> URL: https://issues.apache.org/jira/browse/YARN-9660
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, nodemanager
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-9660-001.patch
>
>
> Right now, using Docker on YARN has some hard requirements. If these 
> requirements are not met, then launching the containers will fail and and 
> error message will be printed. Depending on how familiar the user is with 
> Docker, it might or might not be easy for them to understand what went wrong 
> and how to fix the underlying problem.
> It would be important to explicitly document these requirements along with 
> the error messages.
> *#1: CGroups handler cannot be systemd*
> If docker deamon runs with systemd cgroups handler, we receive the following 
> error upon launching a container:
> {noformat}
> Container id: container_1561638268473_0006_01_02
> Exit code: 7
> Exception message: Launch container failed
> Shell error output: /usr/bin/docker-current: Error response from daemon: 
> cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice".
> See '/usr/bin/docker-current run --help'.
> Shell output: main : command provided 4
> main : run as user is johndoe
> main : requested yarn user is johndoe
> {noformat}
> Solution: switch to cgroupfs. Doing so can be OS-specific, but we can 
> document a {{systemcl}} example.
>  
> *#2: {{/bin/bash}} must be present on the {{$PATH}} inside the container*
> Some smaller images like "busybox" or "alpine" does not have {{/bin/bash}}. 
> It's because all commands under {{/bin}} are linked to {{/bin/busybox}} and 
> there's only {{/bin/sh}}.
> If we try to use these kind of images, we'll see the following error message:
> {noformat}
> Container id: container_1561638268473_0015_01_02
> Exit code: 7
> Exception message: Launch container failed
> Shell error output: /usr/bin/docker-current: Error response from daemon: oci 
> runtime error: container_linux.go:235: starting container process caused 
> "exec: \"bash\": executable file not found in $PATH".
> Shell output: main : command provided 4
> main : run as user is johndoe
> main : requested yarn user is johndoe
> {noformat}
>  
> *#3: {{find}} command must be available on the {{$PATH}}*
> It seems obvious that we have the {{find}} command, but even very popular 
> images like {{fedora}} requires that we install it separately.
> If we don't have {{find}} available, then {{launcher_container.sh}} fails 
> with:
> {noformat}
> [2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. 
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh:
>  line 44: find: command not found
> Last 4096 bytes of stderr.txt :
> [2019-07-01 03:51:25.053]Container exited with a non-zero exit code 127. 
> Error file: prelaunch.err.
> Last 4096 bytes of prelaunch.err :
> /tmp/hadoop-systest/nm-local-dir/usercache/systest/appcache/application_1561638268473_0017/container_1561638268473_0017_01_02/launch_container.sh:
>  line 44: find: command not found
> Last 4096 bytes of stderr.txt :
> {noformat}
> *#4 Add cmd-line example of how to tag local images*
> This is actually documented under "Privileged Container Security 
> Consideration", but an one-liner would be helpful. I had trouble running a 
> local docker image and tagging it appropriately. Just an example like 
> {{docker tag local_ubuntu local/ubuntu:latest}} is already very informative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic

2019-06-18 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867128#comment-16867128
 ] 

Tan, Wangda commented on YARN-9559:
---

Gotcha, make sense, thanks for the clarification. [~jhung]

> Create AbstractContainersLauncher for pluggable ContainersLauncher logic
> 
>
> Key: YARN-9559
> URL: https://issues.apache.org/jira/browse/YARN-9559
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9559.001.patch, YARN-9559.002.patch, 
> YARN-9559.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9559) Create AbstractContainersLauncher for pluggable ContainersLauncher logic

2019-06-18 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867030#comment-16867030
 ] 

Tan, Wangda commented on YARN-9559:
---

[~jhung], could you share what is the use cases of this? Why we want to make it 
abstract?

> Create AbstractContainersLauncher for pluggable ContainersLauncher logic
> 
>
> Key: YARN-9559
> URL: https://issues.apache.org/jira/browse/YARN-9559
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
>Priority: Major
> Attachments: YARN-9559.001.patch, YARN-9559.002.patch, 
> YARN-9559.003.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more

2019-06-12 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862433#comment-16862433
 ] 

Tan, Wangda edited comment on YARN-9327 at 6/12/19 8:22 PM:


This seems tricky, because the ResourcePBImpl doesn't have proper protection as 
I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387

I think we should make sure at least {{getProto}} and {{maybeInitBuilder}} 
protected by synchronized lock. Now the synchronized lock is only on 
{{mergeLocalToBuilder}}, which is not sufficient. 

This won't protect read stale value of resource information, but if we want to 
protect read/write resource information path, we need to carefully look at 
performance impact. 

Remove the synchronized static lock seems like a right fix, it looks like a 
mistake in previous patch.


was (Author: wangda):
This seems tricky, because the ResourcePBImpl doesn't have proper protection as 
I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387

I think we should make sure at least {{getProto}} and {{maybeInitBuilder 
}}protected by synchronized lock. Now the synchronized lock is only on 
{{mergeLocalToBuilder}}, which is not sufficient. 

This won't protect read stale value of resource information, but if we want to 
protect read/write resource information path, we need to carefully look at 
performance impact. 

Remove the synchronized static lock seems like a right fix, it looks like a 
mistake in previous patch.

> ProtoUtils#convertToProtoFormat block Application Master Service and many more
> --
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> 

[jira] [Commented] (YARN-9327) ProtoUtils#convertToProtoFormat block Application Master Service and many more

2019-06-12 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16862433#comment-16862433
 ] 

Tan, Wangda commented on YARN-9327:
---

This seems tricky, because the ResourcePBImpl doesn't have proper protection as 
I can see. A similar fix is https://issues.apache.org/jira/browse/YARN-2387

I think we should make sure at least {{getProto}} and {{maybeInitBuilder 
}}protected by synchronized lock. Now the synchronized lock is only on 
{{mergeLocalToBuilder}}, which is not sufficient. 

This won't protect read stale value of resource information, but if we want to 
protect read/write resource information path, we need to carefully look at 
performance impact. 

Remove the synchronized static lock seems like a right fix, it looks like a 
mistake in previous patch.

> ProtoUtils#convertToProtoFormat block Application Master Service and many more
> --
>
> Key: YARN-9327
> URL: https://issues.apache.org/jira/browse/YARN-9327
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Bibin A Chundatt
>Assignee: Bibin A Chundatt
>Priority: Critical
> Attachments: YARN-9327.001.patch
>
>
> {code}
>   public static synchronized ResourceProto convertToProtoFormat(Resource r) {
> return ResourcePBImpl.getProto(r);
>   }
> {code}
> {noformat}
> "IPC Server handler 41 on 23764" #324 daemon prio=5 os_prio=0 
> tid=0x7f181de72800 nid=0x222 waiting for monitor entry 
> [0x7ef153dad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils.convertToProtoFormat(ProtoUtils.java:404)
>   - waiting to lock <0x7ef2d8bcf6d8> (a java.lang.Class for 
> org.apache.hadoop.yarn.api.records.impl.pb.ProtoUtils)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:315)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:262)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:289)
>   at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:228)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:844)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$500(AllocateResponsePBImpl.java:72)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:810)
>   - locked <0x7f0fed96f500> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$7$1.next(AllocateResponsePBImpl.java:799)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>   at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>   at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:13810)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:158)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:198)
>   - eliminated <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:103)
>   - locked <0x7f0fed968a30> (a 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:878)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:824)
>   at 

[jira] [Commented] (YARN-6875) New aggregated log file format for YARN log aggregation.

2019-06-06 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858095#comment-16858095
 ] 

Tan, Wangda commented on YARN-6875:
---

[~larsfrancke], this is already usable in append-able file system like HDFS. 

I'd prefer to close this ticket and leave the remaining tasks open.

> New aggregated log file format for YARN log aggregation.
> 
>
> Key: YARN-6875
> URL: https://issues.apache.org/jira/browse/YARN-6875
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Xuan Gong
>Assignee: Xuan Gong
>Priority: Major
> Attachments: YARN-6875-NewLogAggregationFormat-design-doc.pdf
>
>
> T-file is the underlying log format for the aggregated logs in YARN. We have 
> seen several performance issues, especially for very large log files.
> We will introduce a new log format which have better performance for large 
> log files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9607) Auto-configuring rollover-size of IFile format for non-appendable filesystems

2019-06-06 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858066#comment-16858066
 ] 

Tan, Wangda commented on YARN-9607:
---

Thanks [~adam.antal],

I would vote to enforcing the roll size to zero if the scheme is known to be a 
non-appendable file system. (We may need a list of non-appendable FS to make it 
to be more comprehensive).

cc: [~ste...@apache.org]

> Auto-configuring rollover-size of IFile format for non-appendable filesystems
> -
>
> Key: YARN-9607
> URL: https://issues.apache.org/jira/browse/YARN-9607
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: log-aggregation, yarn
>Affects Versions: 3.3.0
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
>
> In YARN-9525, we made IFile format compatible with remote folders with s3a 
> scheme. In rolling fashioned log-aggregation IFile still fails with the 
> "append is not supported" error message, which is a known limitation of the 
> format by design. 
> There is a workaround though: setting the rollover size in the configuration 
> of the IFile format, in each rolling cycle a new aggregated log file will be 
> created, thus we eliminated the append from the process. Setting this config 
> globally would cause performance problems in the regular log-aggregation, so 
> I'm suggesting to enforcing this config to zero, if the scheme of the URI is 
> s3a (or any other non-appendable filesystem).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3213) Respect labels in Capacity Scheduler when computing user-limit

2019-06-05 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16857131#comment-16857131
 ] 

Tan, Wangda commented on YARN-3213:
---

This Jira is about respect the queue's user-limit for per-label basis. So far, 
YARN CS doesn't support specifies different user limit for different node 
partition. Please feel free to file a Jira and work on that if you have such 
needs, we can help with reviews.

 

cc: [~sunil.gov...@gmail.com]

> Respect labels in Capacity Scheduler when computing user-limit
> --
>
> Key: YARN-3213
> URL: https://issues.apache.org/jira/browse/YARN-3213
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>
> Now we can support node-labels in Capacity Scheduler, but user-limit 
> computing doesn't respect node-labels enough, we should fix that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-06-05 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16856807#comment-16856807
 ] 

Tan, Wangda commented on YARN-9525:
---

Good news, Patch looks good, thanks [~adam.antal] and [~pbacsko] for working on 
the patch and validate it! :) 

Have we tested the patch when rolling aggregation is enabled, and file system 
is appendable? Just want to make sure append-rolling scenario is not broken by 
it.

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch, 
> YARN-9525.002.patch, YARN-9525.003.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-05-30 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852175#comment-16852175
 ] 

Tan, Wangda commented on YARN-9525:
---

Thanks [~adam.antal],  

 

For the rolling issue, is it still existed once we change the rolling size to 0?
{code:java}
@Private
@VisibleForTesting
public long getRollOverLogMaxSize(Configuration conf) {
  return 1024L * 1024 * 1024 * conf.getInt(
  LOG_ROLL_OVER_MAX_FILE_SIZE_GB, 10);
}{code}

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9386) destroying yarn-service is allowed even though running state

2019-05-30 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852169#comment-16852169
 ] 

Tan, Wangda commented on YARN-9386:
---

Thanks [~kyungwan nam], IIUC, only app owner or queue/cluster admin can destroy 
an app, is it correct? To me I think it is same for stop + destroy and destroy, 
I don't find destroy a running service is substantially more dangerous than 
stop a running app and destroy it. Could you share your thoughts?

If we want support of more grandular permission. A better way is to define the 
rule access about who can do what operations, and operations include 
start/stop/destroy. 

> destroying yarn-service is allowed even though running state
> 
>
> Key: YARN-9386
> URL: https://issues.apache.org/jira/browse/YARN-9386
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: kyungwan nam
>Assignee: kyungwan nam
>Priority: Major
> Attachments: YARN-9386.001.patch, YARN-9386.002.patch
>
>
> It looks very dangerous to destroy a running app. It should not be allowed.
> {code}
> [yarn-ats@test ~]$ yarn app -list
> 19/03/12 17:48:49 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:48:50 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> Total number of applications (application-types: [], states: [SUBMITTED, 
> ACCEPTED, RUNNING] and tags: []):3
> Application-Id  Application-NameApplication-Type  
> User   Queue   State Final-State  
>ProgressTracking-URL
> application_1551250841677_0003fbyarn-service  
>ambari-qa default RUNNING   UNDEFINED  
>100% N/A
> application_1552379723611_0002   fb1yarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> application_1550801435420_0001 ats-hbaseyarn-service  
> yarn-ats default RUNNING   UNDEFINED  
>100% N/A
> [yarn-ats@test ~]$ yarn app -destroy fb1
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO client.RMProxy: Connecting to ResourceManager at 
> test1.com/10.1.1.11:8050
> 19/03/12 17:49:02 INFO client.AHSProxy: Connecting to Application History 
> server at test1.com/10.1.1.101:10200
> 19/03/12 17:49:02 INFO util.log: Logging initialized @1637ms
> 19/03/12 17:49:07 INFO client.ApiServiceClient: Successfully destroyed 
> service fb1
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9581) LogsCli getAMContainerInfoForRMWebService ignores rm2

2019-05-24 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847962#comment-16847962
 ] 

Tan, Wangda commented on YARN-9581:
---

Nice catch, thanks [~Prabhu Joseph].

> LogsCli getAMContainerInfoForRMWebService ignores rm2
> -
>
> Key: YARN-9581
> URL: https://issues.apache.org/jira/browse/YARN-9581
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
>
> Yarn Logs fails for a running job in case of RM HA with rm2 active. 
> {code}
> hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn  logs 
> -applicationId application_1558613472348_0004 -am 1
> 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History 
> server at prabhuYarn/172.27.23.55:10200
> 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> Unable to get AM container informations for the 
> application:application_1558613472348_0004
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> Error while authenticating with endpoint: 
> https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts
> Can not get AMContainers logs for the 
> application:application_1558613472348_0004 with the appOwner:hrt_qa
> {code}
> LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM 
> list yarn.resourcemanager.ha.rm-ids.
> {code}
> yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0));
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9521) RM filed to start due to system services

2019-05-22 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846230#comment-16846230
 ] 

Tan, Wangda commented on YARN-9521:
---

Thanks [~kyungwan nam] for the patch.  

cc: [~rohithsharma], [~billie.rina...@gmail.com] for patch reviewers.

> RM filed to start due to system services
> 
>
> Key: YARN-9521
> URL: https://issues.apache.org/jira/browse/YARN-9521
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.2
>Reporter: kyungwan nam
>Priority: Major
> Attachments: YARN-9521.001.patch
>
>
> when starting RM, listing system services directory has failed as follows.
> {code}
> 2019-04-30 17:18:25,441 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(114)) - System Service Directory 
> is configured to /services
> 2019-04-30 17:18:25,467 INFO  client.SystemServiceManagerImpl 
> (SystemServiceManagerImpl.java:serviceInit(120)) - UserGroupInformation 
> initialized to yarn (auth:SIMPLE)
> 2019-04-30 17:18:25,467 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service ResourceManager failed in 
> state STARTED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> Filesystem closed
> at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
> at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:869)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1228)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1265)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1265)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1316)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1501)
> Caused by: java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:473)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1639)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1217)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1233)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1200)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1179)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1175)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1187)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.list(SystemServiceManagerImpl.java:375)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.scanForUserServices(SystemServiceManagerImpl.java:282)
> at 
> org.apache.hadoop.yarn.service.client.SystemServiceManagerImpl.serviceStart(SystemServiceManagerImpl.java:126)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
> ... 13 more
> {code}
> it looks like due to the usage of filesystem cache.
> this issue does not happen, when I add "fs.hdfs.impl.disable.cache=true" to 
> yarn-site



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9576) ResourceUsageMultiNodeLookupPolicy may cause Application starve forever

2019-05-22 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846213#comment-16846213
 ] 

Tan, Wangda commented on YARN-9576:
---

[~jutia], actually this behavior is not caused by multi-node lookup policy, it 
is caused by resource fragmentation. There's no good solution for this except 
queue priority based preemption. See YARN-5864.

>  ResourceUsageMultiNodeLookupPolicy may cause Application starve forever
> 
>
> Key: YARN-9576
> URL: https://issues.apache.org/jira/browse/YARN-9576
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: tianjuan
>Priority: Major
>
> eems that ResourceUsageMultiNodeLookupPolicy in YARN-7494 may cause 
> Application starve forever
> for example, there are 10 nodes(h1,h2,...h9,h10), each has 8G memory in 
> cluster, and two queues A,B, each is configured with 50% capacity.
> firstly there are 10 jobs (each requests 6G respurce) is submited to queue A, 
> and each node of the 10 nodes will have a contianer allocated.
> Afterwards,  another job JobB which requests 3G resource is submited to queue 
> B, and there will be one container with 3G size reserved on node h1,
> with ResourceUsageMultiNodeLookupPolicy, the order policy will always be 
> h1,h2,..h9,h10, and there will always be one container re-reverved on node 
> h1, no other reservation happen,  JobB will hang forever, [~sunilg] what's 
> ypur thought about this situation?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-05-22 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846088#comment-16846088
 ] 

Tan, Wangda commented on YARN-9567:
---

[~Tao Yang]
{quote}Is it ok to support like this in UI1?  Please feel free to give your 
suggestions.
{quote}
 
Of course! 

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9569) Auto-created leaf queues do not honor cluster-wide min/max memory/vcores

2019-05-22 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846085#comment-16846085
 ] 

Tan, Wangda commented on YARN-9569:
---

Thanks [~ccondit], good catch. 

I remember the reason why we initialize CSConf w/o loading default is we try to 
not pollute configs. [~suma.shivaprasad] do you remember? Could you suggest 
what should be the proper fix?

> Auto-created leaf queues do not honor cluster-wide min/max memory/vcores
> 
>
> Key: YARN-9569
> URL: https://issues.apache.org/jira/browse/YARN-9569
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Affects Versions: 3.2.0
>Reporter: Craig Condit
>Priority: Major
>
> Auto-created leaf queues do not honor cluster-wide settings for maximum 
> CPU/vcores allocation.
> To reproduce:
>  # Set auto-create-child-queue.enabled=true for a parent queue.
>  # Set leaf-queue-template.maximum-allocation-mb=16384.
>  # Set yarn.resource-types.memory-mb.maximum-allocation=16384 in 
> resource-types.xml
>  # Launch a YARN app with a container requesting 16 GB RAM.
>  
> This scenario should work, but instead you get an error similar to this:
> {code:java}
> java.lang.IllegalArgumentException: Queue maximum allocation cannot be larger 
> than the cluster setting for queue root.auto.test max allocation per queue: 
>  cluster setting:    {code}
>  
> This seems to be caused by this code in 
> ManagedParentQueue.getLeafQueueConfigs:
> {code:java}
> CapacitySchedulerConfiguration leafQueueConfigTemplate = new
> CapacitySchedulerConfiguration(new Configuration(false), false);{code}
>  
> This initializes a new leaf queue configuration that does not read 
> resource-types.xml (or any other config). Later, this 
> CapacitySchedulerConfiguration instance calls 
> ResourceUtils.fetchMaximumAllocationFromConfig()  from its 
> getMaximumAllocationPerQueue() method and passes itself as the configuration 
> to use. Since the resource types are not present, ResourceUtils falls back to 
> compiled-in defaults of 8GB RAM, 4 cores.
>  
> I was able to work around this with a custom AutoCreatedQueueManagementPolicy 
> implementation which does something like this in init() and reinitialize():
> {code:java}
> for (Map.Entry entry : this.scheduler.getConfiguration()) {
> if (entry.getKey().startsWith("yarn.resource-types")) {
>   parentQueue.getLeafQueueTemplate().getLeafQueueConfigs()
> .set(entry.getKey(), entry.getValue());
>   }
> }
> {code}
> However, this is obviously a very hacky way to solve the problem.
> I can submit a proper patch if someone can provide some direction as to the 
> best way to proceed.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7494) Add muti-node lookup mechanism and pluggable nodes sorting policies to optimize placement decision

2019-05-22 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846081#comment-16846081
 ] 

Tan, Wangda commented on YARN-7494:
---

+ [~cheersyang]

> Add muti-node lookup mechanism and pluggable nodes sorting policies to 
> optimize placement decision
> --
>
> Key: YARN-7494
> URL: https://issues.apache.org/jira/browse/YARN-7494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7494.001.patch, YARN-7494.002.patch, 
> YARN-7494.003.patch, YARN-7494.004.patch, YARN-7494.005.patch, 
> YARN-7494.006.patch, YARN-7494.007.patch, YARN-7494.008.patch, 
> YARN-7494.009.patch, YARN-7494.010.patch, YARN-7494.11.patch, 
> YARN-7494.12.patch, YARN-7494.13.patch, YARN-7494.14.patch, 
> YARN-7494.15.patch, YARN-7494.16.patch, YARN-7494.17.patch, 
> YARN-7494.18.patch, YARN-7494.19.patch, YARN-7494.20.patch, 
> YARN-7494.v0.patch, YARN-7494.v1.patch, multi-node-designProposal.png
>
>
> Instead of single node, for effectiveness we can consider a multi node lookup 
> based on partition to start with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-05-21 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845054#comment-16845054
 ] 

Tan, Wangda commented on YARN-9525:
---

Thanks [~pbacsko] for the patch.

The change looks good to me, [~pbacsko], could you update what is the test you 
have done? 

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9567) Add diagnostics for outstanding resource requests on app attempts page

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844506#comment-16844506
 ] 

Tan, Wangda commented on YARN-9567:
---

This looks great! Huge thanks to [~Tao Yang] for pushing it! How can user 
request activities? Only through REST API? It might be better if we can make it 
as part of the UI. 

cc: [~akhilpb], [~sunil.gov...@gmail.com]: This might be easier to support in 
UI2 than UI1, and users can get a more integrated experience from the new UI.

+ Other folks might be interested: [~vinodkv], [~aiden_zhang], [~jeffreyz], 
[~yzhangal], [~liuxun323]. 

> Add diagnostics for outstanding resource requests on app attempts page
> --
>
> Key: YARN-9567
> URL: https://issues.apache.org/jira/browse/YARN-9567
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: no_diagnostic_at_first.png, 
> show_diagnostics_after_requesting_app_activities_REST_API.png
>
>
> Currently on app attempt page we can see outstanding resource requests, it 
> will be helpful for users to know why if we can join diagnostics of this app 
> with these. 
> Discussed with [~cheersyang], we can passively load diagnostics from cache of 
> completed app activities instead of actively triggering which may bring 
> uncontrollable risks.
> For example:
> (1) At first we can see no diagnostic in cache if app activities not 
> triggered below the outstanding requests.
> !no_diagnostic_at_first.png|width=793,height=248!
> (2) After requesting the application activities REST API, we can see 
> diagnostics now.
> !show_diagnostics_after_requesting_app_activities_REST_API.png|width=1046,height=276!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844496#comment-16844496
 ] 

Tan, Wangda commented on YARN-4946:
---

Thanks [~ccondit], 

I'm not entirely convinced that fix of this patch is correct, I don't see a 
strong need to backport this patch, and introducing a new config for this seems 
unnecessary. I will be fine with: 

a. Revert this patch and update MAPREDUCE-6415 (if any changes needed).

b. Keep this patch as-is since it solved a minor problem but also introduces 
risks.

Adding a config parameter seems over-kill to me, I would like to avoid it if 
possible.

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9570) application in pending-ordering-policy is not considered while container allocation

2019-05-20 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-9570:
--
Summary: application in pending-ordering-policy is not considered while 
container allocation  (was: pplication in pending-ordering-policy is not 
considered while container allocation)

> application in pending-ordering-policy is not considered while container 
> allocation
> ---
>
> Key: YARN-9570
> URL: https://issues.apache.org/jira/browse/YARN-9570
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Yesha Vora
>Priority: Major
>
> This is 5 node cluster with total 15GB capacity.
> 1) Configure Capacity scheduler and set max cluster priority=10
> 2) launch app1 with no priority and wait for it to occupy full cluster
> application_1558135983180_0001 is launched with Priority=0
> 3) launch app2 with priority=2 and check its in ACCEPTED state
> application_1558135983180_0002 is launched with Priority=2
> 4) launch app3 with priority=3 and check its in ACCEPTED state
> application_1558135983180_0003 is launched with Priority=2
> 5) kill container from app1
> 6) Verify app3 with higher priority goes to RUNNING state.
> When max-application-master-percentage is set to 0.1, app2 goes to RUNNING 
> state even though app3 has higher priority.
> Root cause:
> In CS LeafQueue, there's two ordering list:
> If the queue's total application master usage below 
> maxAMResourcePerQueuePercent, the app will be added to the "ordering-policy" 
> list.
> Otherwise, the app will be added to the "pending-ordering-policy" list.
> During allocation, only apps in "ordering-policy" are considered. 
> If there's any app finish, or queue config changed, or node add/remove 
> happen, "pending-ordering-policy" will be reconsidered, and some apps from 
> "pending-ordering-policy" will be added to "ordering-policy".
> This behavior leads to the issue of this JIRA:
> The cluster has 15GB resource, the max-application-master-percentage is set 
> to 0.1. So it can at most accept 2GB (rounded by 1GB) AM resource, which 
> equals to 2 applications.
> When app2 submitted, it will be added to ordering-policy.
> When app3 submitted, it will be added to pending-ordering-policy.
> When we kill app1, it won't finish immediately. Instead, it will still be 
> part of "odering-policy" until all containers of app1 released. (That makes 
> app3 stays in pending-ordering-policy).
> So any resource released by app1, app3 cannot pick up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844309#comment-16844309
 ] 

Tan, Wangda commented on YARN-9543:
---

Cool, thanks [~giovanni.fumarola] :)

> UI2 should handle missing ATSv2 gracefully
> --
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch
>
>
> Resource manager UI2 is throwing some console errors and a error page on 
> flows page.
> Suggested improvements:
>  * Disable or remove flows tab if ATSv2 is not available/installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4946) RM should not consider an application as COMPLETED when log aggregation is not in a terminal state

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844308#comment-16844308
 ] 

Tan, Wangda commented on YARN-4946:
---

Thanks [~snemeth], [~ccondit] for commenting. 

The question I was trying to find out is: should we backport this patch to 
older release? 

After digging into details, I'm wondering should we do this or not. YARN-7952 
should solve part of the problem: log aggregation status is saved on NM as 
well. So the only issue this Jira could solve is: if #apps grow greater than 
configured ZK state store limits, we will keep the apps if log aggregation is 
not finished yet. I agree with [~ccondit] mentioned, this exception (to keep 
app in state store) seems safe, however, if something bad happens, like log 
aggregation bug, or slowness of log aggregation HDFS cluster, etc. It will 
bring down RM.

My understanding of this problem is: if RM recovery is enabled (I believe most 
prod clusters do), an app is removed from state-store (which should be a long 
time/buffer for log aggregation). If the log aggregation still not finished, we 
should still remove the app from RM state store and move on.

The description of the Jira: 
{quote}When the RM "forgets" about an older completed Application (e.g. RM 
failover, enough time has passed, etc), the tool won't find the Application in 
the RM and will just assume that its log aggregation succeeded, even if it 
actually failed or is still running.
{quote}
Seems the right behavior when completed apps forgot by RM. 

Thoughts?

> RM should not consider an application as COMPLETED when log aggregation is 
> not in a terminal state
> --
>
> Key: YARN-4946
> URL: https://issues.apache.org/jira/browse/YARN-4946
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: log-aggregation
>Affects Versions: 2.8.0
>Reporter: Robert Kanter
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-4946.001.patch, YARN-4946.002.patch, 
> YARN-4946.003.patch, YARN-4946.004.patch
>
>
> MAPREDUCE-6415 added a tool that combines the aggregated log files for each 
> Yarn App into a HAR file.  When run, it seeds the list by looking at the 
> aggregated logs directory, and then filters out ineligible apps.  One of the 
> criteria involves checking with the RM that an Application's log aggregation 
> status is not still running and has not failed.  When the RM "forgets" about 
> an older completed Application (e.g. RM failover, enough time has passed, 
> etc), the tool won't find the Application in the RM and will just assume that 
> its log aggregation succeeded, even if it actually failed or is still running.
> We can solve this problem by doing the following:
> The RM should not consider an app to be fully completed (and thus removed 
> from its history) until the aggregation status has reached a terminal state 
> (e.g. SUCCEEDED, FAILED, TIME_OUT).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9543) UI2 should handle missing ATSv2 gracefully

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844212#comment-16844212
 ] 

Tan, Wangda commented on YARN-9543:
---

[~giovanni.fumarola], do you have the same request internally? It gonna be good 
if you or someone from MS can help with feature testing. Appreciated.

> UI2 should handle missing ATSv2 gracefully
> --
>
> Key: YARN-9543
> URL: https://issues.apache.org/jira/browse/YARN-9543
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: ATSv2, yarn-ui-v2
>Affects Versions: 3.1.2
>Reporter: Zoltan Siegl
>Assignee: Zoltan Siegl
>Priority: Major
> Attachments: YARN-9543.001.patch
>
>
> Resource manager UI2 is throwing some console errors and a error page on 
> flows page.
> Suggested improvements:
>  * Disable or remove flows tab if ATSv2 is not available/installed
>  * Handle all connection errors to ATSv2 gracefully



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844133#comment-16844133
 ] 

Tan, Wangda commented on YARN-9525:
---

Thanks [~pbacsko].

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: IFile-S3A-POC01.patch, YARN-9525-001.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9525) IFile format is not working against s3a remote folder

2019-05-20 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844097#comment-16844097
 ] 

Tan, Wangda commented on YARN-9525:
---

[~pbacsko], can you rename the patch to YARN-9525.001.poc.patch, and change the 
status to PA? So Jenkins can pick it up.

> IFile format is not working against s3a remote folder
> -
>
> Key: YARN-9525
> URL: https://issues.apache.org/jira/browse/YARN-9525
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.2
>Reporter: Adam Antal
>Assignee: Adam Antal
>Priority: Major
> Attachments: IFile-S3A-POC01.patch
>
>
> Using the IndexedFileFormat {{yarn.nodemanager.remote-app-log-dir}} 
> configured to an s3a URI throws the following exception during log 
> aggregation:
> {noformat}
> Cannot create writer for app application_1556199768861_0001. Skip log upload 
> this time. 
> java.io.IOException: java.io.FileNotFoundException: No such file or 
> directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:247)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:306)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:464)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:420)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:276)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://adamantal-log-test/logs/systest/ifile/application_1556199768861_0001/adamantal-3.gce.cloudera.com_8041
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2488)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2382)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2321)
>   at 
> org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:128)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1244)
>   at org.apache.hadoop.fs.FileContext$15.next(FileContext.java:1240)
>   at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>   at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1246)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:228)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>   at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:195)
>   ... 7 more
> {noformat}
> This stack trace point to 
> {{LogAggregationIndexedFileController$initializeWriter}} where we do the 
> following steps (in a non-rolling log aggregation setup):
> - create FSDataOutputStream
> - writing out a UUID
> - flushing
> - immediately after that we call a GetFileStatus to get the length of the log 
> file (the bytes we just wrote out), and that's where the failures happens: 
> the file is not there yet due to eventual consistency.
> Maybe we can get rid of that, so we can use IFile format against a s3a target.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9517) When aggregation is not enabled, can't see the container log

2019-04-29 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829570#comment-16829570
 ] 

Tan, Wangda commented on YARN-9517:
---

[~shurong.mai], thanks for putting a patch. However, I'm not sure why you 
closed this Jira? Is the patch or fix already in mentioned branches?

> When aggregation is not enabled, can't see the container log
> 
>
> Key: YARN-9517
> URL: https://issues.apache.org/jira/browse/YARN-9517
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.2.0, 2.3.0, 2.4.1, 2.5.2, 2.6.5, 3.2.0, 2.9.2, 2.8.5, 
> 2.7.7, 3.1.2
>Reporter: Shurong Mai
>Priority: Major
>  Labels: patch
> Attachments: YARN-9517.patch
>
>
> yarn-site.xml
> {code:java}
> 
> yarn.log-aggregation-enable
> false
> 
> {code}
>  
> When aggregation is not enabled, we click the "container log link"(in web 
> page 
> "http://xx:19888/jobhistory/attempts/job_1556431770792_0001/m/SUCCESSFUL;)
>  after a job is finished successfully.
> It will jump to the webpage displaying "Aggregation is not enabled. Try the 
> nodemanager at yy:48038" after we click, and the url is 
> "http://xx:19888/jobhistory/logs/yy:48038/container_1556431770792_0001_01_02/attempt_1556431770792_0001_m_00_0/hadoop;
> I also fund this problem in all hadoop version  2.x.y and 3.x.y and I submit 
> a patch which is  simple and can apply to this hadoop version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8193) YARN RM hangs abruptly (stops allocating resources) when running successive applications.

2019-04-26 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-8193:
--
Target Version/s: 2.9.2
Priority: Blocker  (was: Critical)

> YARN RM hangs abruptly (stops allocating resources) when running successive 
> applications.
> -
>
> Key: YARN-8193
> URL: https://issues.apache.org/jira/browse/YARN-8193
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zian Chen
>Assignee: Zian Chen
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8193-branch-2-001.patch, 
> YARN-8193-branch-2.9.0-001.patch, YARN-8193.001.patch, YARN-8193.002.patch
>
>
> When running massive queries successively, at some point RM just hangs and 
> stops allocating resources. At the point RM get hangs, YARN throw 
> NullPointerException  at RegularContainerAllocator.getLocalityWaitFactor.
> There's sufficient space given to yarn.nodemanager.local-dirs (not a node 
> health issue, RM didn't report any node being unhealthy). There is no fixed 
> trigger for this (query or operation).
> This problem goes away on restarting ResourceManager. No NM restart is 
> required. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9445) yarn.admin.acl is futile

2019-04-08 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812881#comment-16812881
 ] 

Tan, Wangda commented on YARN-9445:
---

[~shuzirra], [~snemeth],

Changing yarn.admin.acl could cause some more security issues (like allowing 
cluster ops to run jobs and consume all the resources which they're disallowed 
before), and it is an incompatible change to me. I suggest to not do that.

Changing default value of admin.acl and queue acls are also incompatible 
changes, but the later ones are important since they can potentially prevent 
cluster being hacked. It's better to start an email thread to discuss.

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9319) Fix compilation issue of handling typedef an existing name by gcc compiler

2019-02-21 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-9319:
--
Summary: Fix compilation issue of  handling typedef an existing name by gcc 
compiler  (was: Fix compilation issue of )

> Fix compilation issue of  handling typedef an existing name by gcc compiler
> ---
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9319) YARN-9060 does not compile

2019-02-21 Thread Tan, Wangda (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774430#comment-16774430
 ] 

Tan, Wangda commented on YARN-9319:
---

Committing this patch now, thanks [~tangzhankun] and everybody for review.

> YARN-9060 does not compile
> --
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9319) Fix compilation issue of

2019-02-21 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-9319:
--
Summary: Fix compilation issue of   (was: YARN-9060 does not compile)

> Fix compilation issue of 
> -
>
> Key: YARN-9319
> URL: https://issues.apache.org/jira/browse/YARN-9319
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
> Environment:  RHEL 6.8, CMake 3.2.0, Java 8u151, gcc version 4.4.7 
> 20120313 (Red Hat 4.4.7-17) (GCC)
>Reporter: Wei-Chiu Chuang
>Assignee: Zhankun Tang
>Priority: Blocker
> Attachments: YARN-9319-trunk.001.patch
>
>
> When I do: 
> mvn clean install -DskipTests -Pdist,native  -Dmaven.javadoc.skip=true
> It does not compile on my machine (RHEL 6.8, CMake 3.2.0, Java 8u151, gcc 
> version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC))
> {noformat}
> [WARNING] [ 54%] Built target test-container-executor
> [WARNING] Linking CXX static library libgtest.a
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -P 
> CMakeFiles/gtest.dir/cmake_clean_target.cmake
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_link_script 
> CMakeFiles/gtest.dir/link.txt --verbose=1
> [WARNING] /usr/bin/ar cq libgtest.a  
> CMakeFiles/gtest.dir/data/4/weichiu/hadoop/hadoop-common-project/hadoop-common/src/main/native/gtest/gtest-all.cc.o
> [WARNING] /usr/bin/ranlib libgtest.a
> [WARNING] make[2]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] /opt/toolchain/cmake-3.2.0/bin/cmake -E cmake_progress_report 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native/CMakeFiles
>   26
> [WARNING] [ 54%] Built target gtest
> [WARNING] make[1]: Leaving directory 
> `/data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native'
> [WARNING] In file included from 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c:27:
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/devices/devices-module.h:31:
>  error: redefinition of typedef 'update_cgroups_parameters_function'
> [WARNING] 
> /data/4/weichiu/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/modules/fpga/fpga-module.h:31:
>  note: previous declaration of 'update_cgroups_parameters_function' was here
> [WARNING] make[2]: *** 
> [CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o]
>  Error 1
> [WARNING] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
> [WARNING] make[1]: *** Waiting for unfinished jobs
> [WARNING] make: *** [all] Error 2
> {noformat}
> The code compiles once I revert YARN-9060.
> [~tangzhankun], [~sunilg] care to take a look?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9310) Test submarine maven module build

2019-02-16 Thread Tan, Wangda (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tan, Wangda updated YARN-9310:
--
Attachment: YARN-9310.001.patch

> Test submarine maven module build
> -
>
> Key: YARN-9310
> URL: https://issues.apache.org/jira/browse/YARN-9310
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Tan, Wangda
>Priority: Major
> Attachments: YARN-9310.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9310) Test submarine maven module build

2019-02-16 Thread Tan, Wangda (JIRA)
Tan, Wangda created YARN-9310:
-

 Summary: Test submarine maven module build
 Key: YARN-9310
 URL: https://issues.apache.org/jira/browse/YARN-9310
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Tan, Wangda






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5600) Add a parameter to ContainerLaunchContext to emulate yarn.nodemanager.delete.debug-delay-sec on a per-application basis

2016-11-10 Thread Tan, Wangda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655703#comment-15655703
 ] 

Tan, Wangda commented on YARN-5600:
---

This is a very useful feature without any doubt, thanks 
[~miklos.szeg...@cloudera.com] for working on this JIRA and thanks reviews from 
[~Naganarasimha] / [~templedf] for reviewing the patch. 

Apologize for my very late review, I only looked at API of the patch: Have you 
ever considered the other approach, which we can turn on debug-delay-sec by 
passing a pre-defined environment. The biggest benefit is, we don't need to 
update most applications to use this feature, for example, MR/Spark support 
specify environment. Making changes to all major applications to use this 
feature sounds like a big task.

As an example LinuxDockerContainerExecutor uses the approach which specify 
configurations by passing env var.

And in addition, it will be better to have a global max-debug-delay-sec in 
yarn-site (which could be MAX_INT by default), considering disk space and 
security, we may want application occupy disk beyond some specified time.

+ [~djp]

> Add a parameter to ContainerLaunchContext to emulate 
> yarn.nodemanager.delete.debug-delay-sec on a per-application basis
> ---
>
> Key: YARN-5600
> URL: https://issues.apache.org/jira/browse/YARN-5600
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Daniel Templeton
>Assignee: Miklos Szegedi
>  Labels: oct16-medium
> Attachments: YARN-5600.000.patch, YARN-5600.001.patch, 
> YARN-5600.002.patch, YARN-5600.003.patch, YARN-5600.004.patch, 
> YARN-5600.005.patch, YARN-5600.006.patch, YARN-5600.007.patch
>
>
> To make debugging application launch failures simpler, I'd like to add a 
> parameter to the CLC to allow an application owner to request delayed 
> deletion of the application's launch artifacts.
> This JIRA solves largely the same problem as YARN-5599, but for cases where 
> ATS is not in use, e.g. branch-2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5864) Capacity Scheduler preemption for fragmented cluster

2016-11-10 Thread Tan, Wangda (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655514#comment-15655514
 ] 

Tan, Wangda commented on YARN-5864:
---

Thanks [~curino] for sharing these insightful suggestions.

The problem you mentioned is totally true: we were putting lots of efforts to 
add features for various of resource constraints (such as limits, node 
partition, priority, etc.) but we paid less attention about how to make 
easier/consistent semantics.

I also agree that we do need to spend some time to think about what is the 
semantics that YARN scheduler should have. For example, the minimum guarantee 
of CS is queue should get at least their configured capacity, but a picky app 
could make an under-utilized queue waiting forever for the resource. And also 
as you mentioned above, non-preemptable queue can invalidate configured 
capacity as well.

However, I would argue that the scheduler is not able to run perfectly without 
invalidating all the constraints. It is not just a group of formulas we need to 
define and let the solver to optimize it, it involves lots of human's emotions 
and preferences. For example, user may not understand and glad to accept why a 
picky request cannot be allocated even if the queue/cluster have available 
capacity. And it may not be acceptable to a production cluster that a long 
running service for realtime queries cannot be launched because we don't want 
to kill some less-important batch jobs. My point is, if we can have these rules 
defined in the doc and user can know what happened from the UI/log, we can add 
them.

To improve these, I think your suggestion (1) will be more helpful and 
achievable in a short term, we can definitely remove some parameters, for 
example, existing user-limit definition is not good enough and 
user-limit-factor can always make a queue cannot fully utilize its capacity. 
And we can better define these semantics in doc and UI.

(2) Looks beautiful but it may not be able to solve the root problem directly: 
The first priority is to make our users feel happy to accept it instead of 
beautifully solving it in mathematics. For example, for the problem I put in 
description of the JIRA, I don't think (2) can get allocation without harming 
other applications. And in implementation's perspective, I'm not sure how to 
make a solver-based solution can handle both of fast allocation (we want to do 
allocation within milli-seconds for interactive queries) and good placement 
(such as gang scheduling with some other constraints like anti-affinity). It 
seems to me that we will sacrifice low latency to get better quality of 
placement for the option (2).

bq. This opens up many abuses, one that comes to mind ...
Actually this feature will be only used in a pretty controlled environment: 
Important long running services running in a separate queue, and admin/user 
agrees that it can preempt other batch jobs to get new containers. ACLs will be 
set to avoid normal user running inside these queues, all apps running in the 
queue should be trusted apps such as YARN native services (Slider), Spark, etc. 
And we can also make sure these apps will try best to respect other apps.
And please advice if you think we can improve the semantics of this feature.

Thanks,

> Capacity Scheduler preemption for fragmented cluster 
> -
>
> Key: YARN-5864
> URL: https://issues.apache.org/jira/browse/YARN-5864
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-5864.poc-0.patch
>
>
> YARN-4390 added preemption for reserved container. However, we found one case 
> that large container cannot be allocated even if all queues are under their 
> limit.
> For example, we have:
> {code}
> Two queues, a and b, capacity 50:50 
> Two nodes: n1 and n2, each of them have 50 resource 
> Now queue-a uses 10 on n1 and 10 on n2
> queue-b asks for one single container with resource=45. 
> {code} 
> The container could be reserved on any of the host, but no preemption will 
> happen because all queues are under their limits. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org