[jira] [Commented] (YARN-8468) Enable the use of queue based maximum container allocation limit and implement it in FairScheduler

2018-10-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651405#comment-16651405
 ] 

Weiwei Yang commented on YARN-8468:
---

Still same issue on branch-3.1, but recent trunk build seems fine.

I gave up on committing this to branch-3.1 :(

> Enable the use of queue based maximum container allocation limit and 
> implement it in FairScheduler
> --
>
> Key: YARN-8468
> URL: https://issues.apache.org/jira/browse/YARN-8468
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler, scheduler
>Affects Versions: 3.1.0
>Reporter: Antal Bálint Steinbach
>Assignee: Antal Bálint Steinbach
>Priority: Critical
> Fix For: 3.2.0
>
> Attachments: YARN-8468-branch-3.1.018.patch, 
> YARN-8468-branch-3.1.019.patch, YARN-8468-branch-3.1.020.patch, 
> YARN-8468-branch-3.1.021.patch, YARN-8468-branch-3.1.022.patch, 
> YARN-8468.000.patch, YARN-8468.001.patch, YARN-8468.002.patch, 
> YARN-8468.003.patch, YARN-8468.004.patch, YARN-8468.005.patch, 
> YARN-8468.006.patch, YARN-8468.007.patch, YARN-8468.008.patch, 
> YARN-8468.009.patch, YARN-8468.010.patch, YARN-8468.011.patch, 
> YARN-8468.012.patch, YARN-8468.013.patch, YARN-8468.014.patch, 
> YARN-8468.015.patch, YARN-8468.016.patch, YARN-8468.017.patch, 
> YARN-8468.018.patch
>
>
> When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" 
> to limit the overall size of a container. This applies globally to all 
> containers and cannot be limited by queue or and is not scheduler dependent.
> The goal of this ticket is to allow this value to be set on a per queue basis.
> The use case: User has two pools, one for ad hoc jobs and one for enterprise 
> apps. User wants to limit ad hoc jobs to small containers but allow 
> enterprise apps to request as many resources as needed. Setting 
> yarn.scheduler.maximum-allocation-mb sets a default value for maximum 
> container size for all queues and setting maximum resources per queue with 
> “maxContainerResources” queue config value.
> Suggested solution:
> All the infrastructure is already in the code. We need to do the following:
>  * add the setting to the queue properties for all queue types (parent and 
> leaf), this will cover dynamically created queues.
>  * if we set it on the root we override the scheduler setting and we should 
> not allow that.
>  * make sure that queue resource cap can not be larger than scheduler max 
> resource cap in the config.
>  * implement getMaximumResourceCapability(String queueName) in the 
> FairScheduler
>  * implement getMaximumResourceCapability(String queueName) in both 
> FSParentQueue and FSLeafQueue as follows
>  * expose the setting in the queue information in the RM web UI.
>  * expose the setting in the metrics etc for the queue.
>  * Enforce the use of queue based maximum allocation limit if it is 
> available, if not use the general scheduler level setting
>  ** Use it during validation and normalization of requests in 
> scheduler.allocate, app submit and resource request



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8877) Extend service spec to allow setting resource attributes

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651500#comment-16651500
 ] 

Hadoop QA commented on YARN-8877:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 4 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
7s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
12s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  7m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  7m  
9s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
1m  8s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch 
generated 1 new + 42 unchanged - 0 fixed = 43 total (was 42) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 
16s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 82m 18s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8877 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944076/YARN-8877.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux b79000dfed0a 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 
07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0bf8a11 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Assigned] (YARN-7756) AMRMProxyService cann't enable ’hadoop.security.authorization‘

2018-10-16 Thread leiqiang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leiqiang reassigned YARN-7756:
--

Assignee: leiqiang

> AMRMProxyService cann't enable ’hadoop.security.authorization‘
> --
>
> Key: YARN-7756
> URL: https://issues.apache.org/jira/browse/YARN-7756
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.9.0, 3.0.0
>Reporter: leiqiang
>Assignee: leiqiang
>Priority: Major
> Attachments: YARN-7756.v0.patch, YARN-7756.v1.patch
>
>
> after set hadoop.security.authorization=true, start AMRMProxyService  will 
> has such error
> {quote}org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter
>  failed in state STARTED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
> interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is not known.
>  org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.security.authorize.AuthorizationException: Protocol 
> interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is not known.
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:177)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:121)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:250)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>  at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:844)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>  at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>  at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1114)
>  at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>  at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1529)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1803)
>  at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1525)
>  at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1458)
>  Caused by: org.apache.hadoop.security.authorize.AuthorizationException: 
> Protocol interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is 
> not known.
>  at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>  at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>  at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109)
>  at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>  at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>  at com.sun.proxy.$Proxy36.registerApplicationMaster(Unknown Source)
>  at 
> org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:161)
>  ... 14 more
>  Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
>  Protocol interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is 
> not known.
>  at org.apache.hadoop.ipc.Client.call(Client.java:1476)
>  at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>  at com.sun.proxy.$Proxy35.registerApplicationMaster(Unknown Source)
>  at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:107)
>  ... 21 more
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Updated] (YARN-8873) Add CSI java-based client library

2018-10-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8873:
--
Attachment: YARN-8873.004.PATCH

> Add CSI java-based client library
> -
>
> Key: YARN-8873
> URL: https://issues.apache.org/jira/browse/YARN-8873
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8873.001.patch, YARN-8873.002.patch, 
> YARN-8873.003.patch, YARN-8873.004.PATCH
>
>
> Build a java-based client to talk to CSI drivers, through CSI gRPC services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651181#comment-16651181
 ] 

Wangda Tan commented on YARN-8875:
--

Thanks [~liuxun323]. 

More comments: 
1) Inside Index.md, should remove "Installation guide" / "Installation guide 
Chinese version" and link to HowToInstall.html. 

2) There're many links use .md, how ever, since Hadoop generates html link, you 
should use .html instead, like. 
{code}
[EN](InstallationGuide.md)  # you should use ...html
{code}

3) The image link doesn't work after doc generated. 
You should put it under resources/images like other images. Example: 
{code}
...
See below screenshot:
![alt text](./images/tensorboard-service.png "Tensorboard service")
...
{code}

4) Test site. 
You can use following command to test site:
{code}
mvn clean site:site -Preleasedocs; mvn site:stage 
-DstagingDirectory=/tmp/hadoop-site
{code}

Once it finishes, you can open /tmp/hadoop-site/hadoop-project/index.html from 
your browser and check the doc. Submarine can be found from left nav panel. 

> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8879:
---
Attachment: YARN-8879.002.patch

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8851:
---
Attachment: YARN-8851-WIP4-trunk.001.patch

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651487#comment-16651487
 ] 

Hadoop QA commented on YARN-8879:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 57s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 15s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core:
 The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  1s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 
25s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  1m 
56s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 69m 18s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8879 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944077/YARN-8879.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux a595ece4ddb1 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0bf8a11 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/22199/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22199/testReport/ |
| Max. process+thread count | 754 (vs. ulimit of 1) |
| modules | C: 

[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651242#comment-16651242
 ] 

Sunil Govindan commented on YARN-8879:
--

[~yuan_zac] With this, we will double validate that the principal name is not 
empty. However what will happen when keytab is not there. 

{{kerberosPrincipal.getKeytab()}}

Do we need to validate this as well ?

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8879.001.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8826) Fix lingering timeline collector after serviceStop in TimelineCollectorManager

2018-10-16 Thread Prabha Manepalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabha Manepalli updated YARN-8826:
---
Attachment: YARN-8826.v2.patch

> Fix lingering timeline collector after serviceStop in TimelineCollectorManager
> --
>
> Key: YARN-8826
> URL: https://issues.apache.org/jira/browse/YARN-8826
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Prabha Manepalli
>Assignee: Prabha Manepalli
>Priority: Trivial
> Attachments: YARN-8826.v1.patch, YARN-8826.v2.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8873) Add CSI java-based client library

2018-10-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651273#comment-16651273
 ] 

Weiwei Yang commented on YARN-8873:
---

UT failure was not caused by this patch, see YARN-8856 for more information.

> Add CSI java-based client library
> -
>
> Key: YARN-8873
> URL: https://issues.apache.org/jira/browse/YARN-8873
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8873.001.patch, YARN-8873.002.patch, 
> YARN-8873.003.patch
>
>
> Build a java-based client to talk to CSI drivers, through CSI gRPC services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8873) Add CSI java-based client library

2018-10-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8873:
--
Attachment: YARN-8873.003.patch

> Add CSI java-based client library
> -
>
> Key: YARN-8873
> URL: https://issues.apache.org/jira/browse/YARN-8873
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8873.001.patch, YARN-8873.002.patch, 
> YARN-8873.003.patch
>
>
> Build a java-based client to talk to CSI drivers, through CSI gRPC services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651286#comment-16651286
 ] 

Hadoop QA commented on YARN-8879:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
25s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 29s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 18s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core:
 The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} 
|
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 31s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 
15s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 69m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8879 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944062/YARN-8879.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5e4dac877b45 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0bf8a11 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/22195/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22195/testReport/ |
| Max. process+thread count | 754 (vs. ulimit of 1) |
| modules | C: 

[jira] [Created] (YARN-8887) Support isolation in pluggable device framework

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8887:
--

 Summary: Support isolation in pluggable device framework
 Key: YARN-8887
 URL: https://issues.apache.org/jira/browse/YARN-8887
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Devices isolation needs a complete description in API specs and a translator in 
the adapter to convert the requirements into uniform parameters passed to 
native container-executor. It should support both cgroups and Docker runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8851:
---
Summary: [Umbrella] A new pluggable device plugin framework to ease vendor 
plugin development  (was: [Umbrella] A new device plugin framework to ease 
vendor plugin development)

> [Umbrella] A new pluggable device plugin framework to ease vendor plugin 
> development
> 
>
> Key: YARN-8851
> URL: https://issues.apache.org/jira/browse/YARN-8851
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8851-WIP2-trunk.001.patch, 
> YARN-8851-WIP3-trunk.001.patch, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] 
> YARN_New_Device_Plugin_Framework_Design_Proposal.pdf
>
>
> At present, we support GPU/FPGA device in YARN through a native, coupling 
> way. But it's difficult for a vendor to implement such a device plugin 
> because the developer needs much knowledge of YARN internals. And this brings 
> burden to the community to maintain both YARN core and vendor-specific code.
> Here we propose a new device plugin framework to ease vendor device plugin 
> development and provide a more flexible way to integrate with YARN NM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8883) Provide an example of fake vendor plugin

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8883:
--

 Summary: Provide an example of fake vendor plugin
 Key: YARN-8883
 URL: https://issues.apache.org/jira/browse/YARN-8883
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8890) Port existing GPU module into pluggable device framework

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8890:
--

 Summary: Port existing GPU module into pluggable device framework
 Key: YARN-8890
 URL: https://issues.apache.org/jira/browse/YARN-8890
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Once we get pluggable device framework mature, we can port existing GPU related 
code into this new framework.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8880:
---
Description: 
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-classes
 
 {code}

The admin needs to know the register resource name of every plugin classes 
configured. And declare them 

  was:
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-classes
 
 {code}


> Add configurations for pluggable plugin framework
> -
>
> Key: YARN-8880
> URL: https://issues.apache.org/jira/browse/YARN-8880
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Added two configurations for the pluggable device framework.
> {code:java}
> 
>  yarn.nodemanager.pluggable-device-framework.enable
>  true/false
>  
>  
>  yarn.nodemanager.resource-plugins.pluggable-classes
>  
>  {code}
> The admin needs to know the register resource name of every plugin classes 
> configured. And declare them 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8880:
---
Description: 
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-classes
 
 {code}

The admin needs to know the register resource name of every plugin classes 
configured. And declare them in resource-types.xml.
Please note that the count value defined in node-resource.xml will be 
overridden by plugin.

  was:
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-classes
 
 {code}

The admin needs to know the register resource name of every plugin classes 
configured. And declare them 


> Add configurations for pluggable plugin framework
> -
>
> Key: YARN-8880
> URL: https://issues.apache.org/jira/browse/YARN-8880
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Added two configurations for the pluggable device framework.
> {code:java}
> 
>  yarn.nodemanager.pluggable-device-framework.enable
>  true/false
>  
>  
>  yarn.nodemanager.resource-plugins.pluggable-classes
>  
>  {code}
> The admin needs to know the register resource name of every plugin classes 
> configured. And declare them in resource-types.xml.
> Please note that the count value defined in node-resource.xml will be 
> overridden by plugin.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8891) Documentation of the pluggable device framework

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8891:
--

 Summary: Documentation of the pluggable device framework
 Key: YARN-8891
 URL: https://issues.apache.org/jira/browse/YARN-8891
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8879:
---
Attachment: (was: YARN-8879.patch)

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8879.001.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8879:
---
Attachment: YARN-8879.001.patch

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8879.001.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8881) Add basic pluggable device plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8881:
--

 Summary: Add basic pluggable device plugin framework
 Key: YARN-8881
 URL: https://issues.apache.org/jira/browse/YARN-8881
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


It includes adding support in "ResourcePluginManager" to enable the framework 
and load plugin classes based on configuration, an interface for the vendor to 
implement and the adapter to decouple plugin and YARN internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8826) Fix lingering timeline collector after serviceStop in TimelineCollectorManager

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651290#comment-16651290
 ] 

Hadoop QA commented on YARN-8826:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m  7s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  9s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
24s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
43s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 55m  0s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8826 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944067/YARN-8826.v2.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 0a8a32731b65 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 
17:03:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0bf8a11 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22196/testReport/ |
| Max. process+thread count | 339 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/22196/console |
| Powered by | Apache Yetus 0.8.0  

[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651307#comment-16651307
 ] 

Zac Zhou commented on YARN-8879:


Thanks,  [~sunilg], 

There is logic for kerberosPrincipal.getKeytab() validation already.

Make test case validate keytab more clearly in the 002 patch.

 

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8885) Support NM APIs to query device resource allocation

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8885:
--

 Summary: Support NM APIs to query device resource allocation
 Key: YARN-8885
 URL: https://issues.apache.org/jira/browse/YARN-8885
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Supprot REST API in NM for user to query allocation

*_nodemanager_address:port/ws/v1/node/resources/\{resource_name}_*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8877) Extend service spec to allow setting resource attributes

2018-10-16 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-8877:
--
Attachment: YARN-8877.001.patch

> Extend service spec to allow setting resource attributes
> 
>
> Key: YARN-8877
> URL: https://issues.apache.org/jira/browse/YARN-8877
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
> Attachments: YARN-8877.001.patch
>
>
> Extend yarn native service spec to support setting resource attributes in the 
> spec file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-16 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651349#comment-16651349
 ] 

Weiwei Yang commented on YARN-8513:
---

Hi [~leftnoteasy]/[~cyfdecyf]/[~hustnn]/[~Tao Yang]

When this issue created with the logs attached, it looks very suspicious there 
is a bug in {{CS#allocateContainersToNode}}, causing it never breaks out from 
following while loop,
{code:java}
while (canAllocateMore(...)) {...}
{code}
then I suggested [~cyfdecyf] to change the property 
{{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} 
from *-1* (the default value) to *10*. Which seems to work-around the issue. I 
think we should change the default value to 10, to have this sort of infinite 
greedy-lookup is not safe. I am going to submit a patch to change the default 
value.

Thoughts?

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8880) Add configurations for pluggable plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8880:
--

 Summary: Add configurations for pluggable plugin framework
 Key: YARN-8880
 URL: https://issues.apache.org/jira/browse/YARN-8880
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


Added two configurations for the pluggable device framework.
{code:java}

 
yarn.nodemanager.resource-plugins.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-class
 
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8884) Support monitoring of device resource through plugin API

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8884:
--

 Summary: Support monitoring of device resource through plugin API
 Key: YARN-8884
 URL: https://issues.apache.org/jira/browse/YARN-8884
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


In the current design, the device resource count is reported by plugin when NM 
starts but won't got update even when there're devices broken. We should 
support monitoring and update the device resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8888) Support device topology scheduling

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-:
--

 Summary: Support device topology scheduling
 Key: YARN-
 URL: https://issues.apache.org/jira/browse/YARN-
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang
Assignee: Zhankun Tang


An easy way for vendor plugin to describe topology information should be 
provided in Device spec and the topology information will be used in the device 
shared local scheduler to boost performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8889) Added well-defined interface in container-executor to support vendor plugins isolation request

2018-10-16 Thread Zhankun Tang (JIRA)
Zhankun Tang created YARN-8889:
--

 Summary: Added well-defined interface in container-executor to 
support vendor plugins isolation request
 Key: YARN-8889
 URL: https://issues.apache.org/jira/browse/YARN-8889
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhankun Tang


Because of different container runtime, the isolation request from vendor 
device plugin may be raised before container launch (cgroups operations) or at 
container launch (Docker runtime).

An easy to use interface in container-executor should be provided to support 
above requirements.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8880:
---
Description: 
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-class
 
 {code}

  was:
Added two configurations for the pluggable device framework.
{code:java}

 
yarn.nodemanager.resource-plugins.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-class
 
 {code}


> Add configurations for pluggable plugin framework
> -
>
> Key: YARN-8880
> URL: https://issues.apache.org/jira/browse/YARN-8880
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Added two configurations for the pluggable device framework.
> {code:java}
> 
>  yarn.nodemanager.pluggable-device-framework.enable
>  true/false
>  
>  
>  yarn.nodemanager.resource-plugins.pluggable-class
>  
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework

2018-10-16 Thread Zhankun Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhankun Tang updated YARN-8880:
---
Description: 
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-classes
 
 {code}

  was:
Added two configurations for the pluggable device framework.
{code:java}

 yarn.nodemanager.pluggable-device-framework.enable
 true/false
 
 
 yarn.nodemanager.resource-plugins.pluggable-class
 
 {code}


> Add configurations for pluggable plugin framework
> -
>
> Key: YARN-8880
> URL: https://issues.apache.org/jira/browse/YARN-8880
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Major
>
> Added two configurations for the pluggable device framework.
> {code:java}
> 
>  yarn.nodemanager.pluggable-device-framework.enable
>  true/false
>  
>  
>  yarn.nodemanager.resource-plugins.pluggable-classes
>  
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652044#comment-16652044
 ] 

Hadoop QA commented on YARN-8798:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
22s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m  8s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 16s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine: 
The patch generated 4 new + 25 unchanged - 0 fixed = 29 total (was 25) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 34s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-yarn-submarine in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 55m 25s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8798 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944163/YARN-8798-trunk.002.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ba38624d20b6 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0c2914e |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/22203/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-submarine.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22203/testReport/ |
| Max. process+thread count | 306 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine 
U: 

[jira] [Updated] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8798:
-
Target Version/s: 3.2.0
Priority: Critical  (was: Major)

> [Submarine] Job should not be submitted if "--input_path" option is missing
> ---
>
> Key: YARN-8798
> URL: https://issues.apache.org/jira/browse/YARN-8798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch
>
>
> If a user doesn't set "–input_path" option, the job will still be submitted. 
> Here is my command to run the job:
> {code:java}
> yarn jar 
> $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  -verbose \
>  -wait_job_finish \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
>  --name tf-job-001 \
>  --docker_image tangzhankun/tensorflow \
>  --worker_resources memory=4G,vcores=2 \
>  --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
> --train-steps=5"{code}
> Due to lack of invalidity check, the job is still submitted. We should add a 
> check on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8892:
-
Attachment: YARN-8892.001.patch

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8892:
-
Fix Version/s: (was: 3.2.0)

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Sunil Govindan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunil Govindan updated YARN-8892:
-
Target Version/s: 3.2.0

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8862:
---
Attachment: YARN-8862-YARN-7402.v4.patch

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, 
> YARN-8862-YARN-7402.v4.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7086) Release all containers aynchronously

2018-10-16 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065
 ] 

Manikandan R edited comment on YARN-7086 at 10/16/18 4:57 PM:
--

 
 [~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing 
log level. With these changes, ran the test cases again and measurements (in 
ms) between different runs for each cases doesn't differ drastically. In 
addition to three cases, since original intent of this Jira is to release 
container asynchronously to avoid potential deadlocks, added 4th case of 
releasing container asynchronously for every single container sequentially just 
to understand the difference between multiple container list traversal vs 
handling single container separately. Based on the below results, 2nd case - 
multiple container list traversal is not only reduce the performance but 
increase the complexity of the code. With 4th case, code changes are simple and 
clean. Though 4th case time taken is high compared to 1st & 3rd case, can we 
pick 4th case given that we want to release containers async? Thoughts? 

 
||Run||Existing code||With Patch
 (Async release + multiple container list traversal)||With Patch
 (Not Async release + multiple container list traversal) ||With Patch 
 (Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 


was (Author: maniraj...@gmail.com):
 
[~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log 
level. With these changes, ran the test cases again and measurements (in ms) 
between different runs for each cases doesn't differ drastically. In addition 
to three cases, since original intent of this Jira is to release container 
asynchronously, added 4th case of releasing container asynchronously for every 
single container sequentially just to understand the difference between 
multiple container list traversal vs handling single container separately. 
Based on the below results, 2nd case - multiple container list traversal is not 
only reduce the performance but increase the complexity of the code. With 4th 
case, code changes are simple and clean. Though 4th case time taken is high 
compared to 1st & 3rd case, can we pick 4th case given that we want to release 
containers async? Thoughts? 

 
||Run||Existing code||With Patch
(Async release + multiple container list traversal)||With Patch
(Not Async release + multiple container list traversal) ||With Patch 
(Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8875:
-
Attachment: YARN-8875.005.patch

> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652101#comment-16652101
 ] 

Wangda Tan commented on YARN-8875:
--

Fixed doc issues I mentioned above. (005) 

> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8647) Add a flag to disable move app between queues

2018-10-16 Thread Eric Payne (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652189#comment-16652189
 ] 

Eric Payne commented on YARN-8647:
--

{quote}{quote}Addition flag to disable required
{quote}
{quote}IMO we shouldn't require another flag to disable it as we are already 
checking for all the permissions.
{quote}
we want to disable the feature of move queue on cluster level instead of 
disabling for few users
{quote}
[~saruntek], I understand that it would be easier for an admin to just set the 
flag and forget it. However, I would argue that if a multi-tenant cluster has 
hundreds of users, it's time to manage the usage more closely by utilizing ACL 
permissions on each queue. In general, I would prefer to not add code to the 
scheduler when there is already a pre-designed alternative.

> Add a flag to disable move app between queues
> -
>
> Key: YARN-8647
> URL: https://issues.apache.org/jira/browse/YARN-8647
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.7.3
>Reporter: sarun singla
>Assignee: Abhishek Modi
>Priority: Critical
>
> For large clusters where we have a number of users submitting application, we 
> can result into scenarios where app developers try to move the queues for 
> their applications using something like 
> {code:java}
> yarn application -movetoqueue  -queue {code}
> Today there is no way of disabling the feature if one does not want 
> application developers to use  the feature.
> *Solution:*
> We should probably add an option to disable move queue feature from RM side 
> on the cluster level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8481:
---
Issue Type: Sub-task  (was: Bug)
Parent: YARN-5597

> AMRMProxyPolicies should accept heartbeat response from new/unknown 
> subclusters
> ---
>
> Key: YARN-8481
> URL: https://issues.apache.org/jira/browse/YARN-8481
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Minor
> Fix For: 2.10.0, 3.2.0, 2.9.2
>
> Attachments: YARN-8481.v1.patch
>
>
> Currently BroadcastAMRMProxyPolicy assumes that we only span the application 
> to the sub-clusters instructed by itself via _splitResourceRequests_. 
> However, with AMRMProxy HA, second attempts of the application might come up 
> with multiple sub-clusters initially without consulting the AMRMProxyPolicy 
> at all. This leads to exceptions in _notifyOfResponse._ It should simply 
> allow the new/unknown sub-cluster heartbeat responses. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271
 ] 

Wangda Tan edited comment on YARN-8489 at 10/16/18 7:41 PM:


[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.


was (Author: leftnoteasy):
[~eyang],

Basically there're four models in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271
 ] 

Wangda Tan edited comment on YARN-8489 at 10/16/18 7:42 PM:


[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the same 
service. I don't hear open source community like jupyter has support of this 
(connecting to a running distributed TF job and use it as executor). And I 
didn't see TF claims to support this or plan to support.

And even if TF/notebook community support this case next year or so, notebook 
and executors should belong to two separate services just like relationship 
between Jupyter / Spark.


was (Author: leftnoteasy):
[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8879:
-
Priority: Critical  (was: Major)

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Critical
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8879:
-
Target Version/s: 3.2.0

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Critical
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652046#comment-16652046
 ] 

Wangda Tan commented on YARN-8798:
--

+1, thanks [~tangzhankun]. [~sunilg], please let me know if you have any 
concerns to put it to 3.2.0.

 

> [Submarine] Job should not be submitted if "--input_path" option is missing
> ---
>
> Key: YARN-8798
> URL: https://issues.apache.org/jira/browse/YARN-8798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch
>
>
> If a user doesn't set "–input_path" option, the job will still be submitted. 
> Here is my command to run the job:
> {code:java}
> yarn jar 
> $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  -verbose \
>  -wait_job_finish \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
>  --name tf-job-001 \
>  --docker_image tangzhankun/tensorflow \
>  --worker_resources memory=4G,vcores=2 \
>  --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
> --train-steps=5"{code}
> Due to lack of invalidity check, the job is still submitted. We should add a 
> check on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing

2018-10-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652060#comment-16652060
 ] 

Sunil Govindan commented on YARN-8798:
--

Thanks [~leftnoteasy]. I think its better to put it to 3.2 as well.

> [Submarine] Job should not be submitted if "--input_path" option is missing
> ---
>
> Key: YARN-8798
> URL: https://issues.apache.org/jira/browse/YARN-8798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch
>
>
> If a user doesn't set "–input_path" option, the job will still be submitted. 
> Here is my command to run the job:
> {code:java}
> yarn jar 
> $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  -verbose \
>  -wait_job_finish \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
>  --name tf-job-001 \
>  --docker_image tangzhankun/tensorflow \
>  --worker_resources memory=4G,vcores=2 \
>  --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
> --train-steps=5"{code}
> Due to lack of invalidity check, the job is still submitted. We should add a 
> check on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652136#comment-16652136
 ] 

Hadoop QA commented on YARN-8892:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
34m  1s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 50m 23s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8892 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944171/YARN-8892.001.patch |
| Optional Tests |  dupname  asflicense  mvnsite  |
| uname | Linux 99daf898a191 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 753f149 |
| maven | version: Apache Maven 3.3.9 |
| Max. process+thread count | 307 (vs. ulimit of 1) |
| modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/22204/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652163#comment-16652163
 ] 

Wangda Tan commented on YARN-8489:
--

[~eyang], 

I have thought about this, but it seems to me both existing readiness check are 
insufficient. 

In YARN service, dependency is for launch order as well as readiness. It has to 
be a DAG.

However In TF for example, master and ps are not depends on each other for 
launch time, but once master succeeded or failed, we should give the same state 
to job. And once ps failed, we should mark job is failed as well.

Maybe "dominant" is not the best field to add, for TF training use cases, it 
seems sufficient. But if we want better extensibility, we can add a 
ServiceControlPlugin into service master, which app master can specify their 
own implementation. Which should be good for people who wants to integrate to 
service framework. 

Suggestions? [~billie.rinaldi], [~gsaha].

 

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner

2018-10-16 Thread Giovanni Matteo Fumarola (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652313#comment-16652313
 ] 

Giovanni Matteo Fumarola commented on YARN-8862:


Thanks [~botong] for the patch.

NIT: GlobalPolicyGenerator#serviceStop() is missing a NP check.

Otherwise, it is +1.

> [GPG] add Yarn Registry cleanup in ApplicationCleaner
> -
>
> Key: YARN-8862
> URL: https://issues.apache.org/jira/browse/YARN-8862
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8862-YARN-7402.v1.patch, 
> YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch
>
>
> In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in 
> secondary sub-clusters. Because of potential more app attempts later, 
> AMRMProxy cannot kill the UAM and delete the tokens when one local attempt 
> finishes. So similar to the StateStore application table, we need 
> ApplicationCleaner in GPG to cleanup the finished app entries in Yarn 
> Registry. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652173#comment-16652173
 ] 

Hadoop QA commented on YARN-8875:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
31m 49s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 50s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
25s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 46m 44s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8875 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944177/YARN-8875.005.patch |
| Optional Tests |  dupname  asflicense  mvnsite  |
| uname | Linux 86d6bf7e71ed 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 
17:03:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 753f149 |
| maven | version: Apache Maven 3.3.9 |
| Max. process+thread count | 340 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine 
U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/22205/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271
 ] 

Wangda Tan commented on YARN-8489:
--

[~eyang],

Basically there're four models in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652053#comment-16652053
 ] 

Sunil Govindan commented on YARN-8892:
--

cc [~leftnoteasy] cud u pls review

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7086) Release all containers aynchronously

2018-10-16 Thread Manikandan R (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065
 ] 

Manikandan R commented on YARN-7086:


 
[~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log 
level. With these changes, ran the test cases again and measurements (in ms) 
between different runs for each cases doesn't differ drastically. In addition 
to three cases, since original intent of this Jira is to release container 
asynchronously, added 4th case of releasing container asynchronously for every 
single container sequentially just to understand the difference between 
multiple container list traversal vs handling single container separately. 
Based on the below results, 2nd case - multiple container list traversal is not 
only reduce the performance but increase the complexity of the code. With 4th 
case, code changes are simple and clean. Though 4th case time taken is high 
compared to 1st & 3rd case, can we pick 4th case given that we want to release 
containers async? Thoughts? 

 
||Run||Existing code||With Patch
(Async release + multiple container list traversal)||With Patch
(Not Async release + multiple container list traversal) ||With Patch 
(Async Release for each container separately)||
|1|496|1430 |444|1067|
|2|490|1604 |453 |1401|
|3|427|1133 |438|972|
|4|482|1342 |429 |1228|
|5|459|1106 |412 |1176|
|Average of 5 runs|470.8|1323|435.2|1168.8|

 

> Release all containers aynchronously
> 
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Arun Suresh
>Assignee: Manikandan R
>Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch, 
> YARN-7086.Perf-test-case.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652121#comment-16652121
 ] 

Sunil Govindan commented on YARN-8875:
--

+1.

> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8448) AM HTTPS Support

2018-10-16 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652181#comment-16652181
 ] 

Robert Kanter commented on YARN-8448:
-

{{TestCapacityOverTimePolicy}} failure is unrelated.  I'm not sure why cetest 
failed (it doesn't give any details), and it passes on my machine.

> AM HTTPS Support
> 
>
> Key: YARN-8448
> URL: https://issues.apache.org/jira/browse/YARN-8448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8448.001.patch, YARN-8448.002.patch, 
> YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, 
> YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, 
> YARN-8448.009.patch, YARN-8448.010.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652248#comment-16652248
 ] 

Eric Yang commented on YARN-8489:
-

[~leftnoteasy] {quote}master and ps are not depends on each other for launch 
time{quote}

While the launch statement is correct, but it is not true for Tensorflow run 
time.  For master (jupyter notebook) to send any workload to parameter server, 
parameter server must be running.  There is an implicit dependency that can be 
defined for master depends on ps to improve usability.

{quote}And once ps failed, we should mark job is failed as well.{quote}

Parameter server is on the critical path, but it is not completely true that 
one ps fail, we may want to abort the service.  The running job needs to be 
terminated, but mapping Tensorflow task to YARN container is a problematic 
design.  I am most concerned about this in submarine implementation of 
Tensorflow.  Especially, the people sit in front of jupyter notebook can 
observe that parameter server has failed, and use other parameter servers and 
continue to work.  It would be bad user experience, if jupyter notebook and all 
work suddenly disappear when one ps server failed.  It may be nice to have a 
method to clean up the service, when the single critical component has failed.  
By using yarn app -destroy, this can happen at the time that user is ready to 
make a change, instead of losing all state right away to keep system clean.  
Dominant component logic nor the plugin approach are not the right methods to 
address the design problem in submarine working model because AM state machine 
is currently incomplete, any plugin to override AM state machine seems like 
pouring gas on flames.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie

2018-10-16 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8868:

Attachment: YARN-8810.002.patch

> Set HTTPOnly attribute to Cookie
> 
>
> Key: YARN-8868
> URL: https://issues.apache.org/jira/browse/YARN-8868
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8868.001.patch
>
>
> 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, 
> but fails to set the HttpOnly flag to true.
> 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and 
> 388, but fails to set the HttpOnly flag to true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie

2018-10-16 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8868:

Attachment: (was: YARN-8810.002.patch)

> Set HTTPOnly attribute to Cookie
> 
>
> Key: YARN-8868
> URL: https://issues.apache.org/jira/browse/YARN-8868
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8868.001.patch
>
>
> 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, 
> but fails to set the HttpOnly flag to true.
> 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and 
> 388, but fails to set the HttpOnly flag to true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Sunil Govindan (JIRA)
Sunil Govindan created YARN-8892:


 Summary: YARN UI2 doc improvement to update security status
 Key: YARN-8892
 URL: https://issues.apache.org/jira/browse/YARN-8892
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Sunil Govindan
Assignee: Sunil Govindan
 Fix For: 3.2.0


UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile

2018-10-16 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8810:

Attachment: YARN-8810.002.patch

> Yarn Service: discrepancy between hashcode and equals of ConfigFile
> ---
>
> Key: YARN-8810
> URL: https://issues.apache.org/jira/browse/YARN-8810
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Minor
> Attachments: YARN-8810.001.patch, YARN-8810.002.patch
>
>
> The {{ConfigFile}} class {{equals}} method doesn't check the equality of 
> {{properties}}. The {{hashCode}} does include the {{properties}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652036#comment-16652036
 ] 

Wangda Tan commented on YARN-8879:
--

+1, thanks [~sunilg], please go ahead and get it committed.

> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Critical
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652090#comment-16652090
 ] 

Hudson commented on YARN-8879:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15225 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15225/])
YARN-8879. Kerberos principal is needed when submitting a submarine job. 
(sunilg: rev 753f149fd3f5acf9a98cfc780d7899e307c19002)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/utils/ServiceApiUtil.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/utils/TestServiceApiUtil.java


> Kerberos principal is needed when submitting a submarine job
> 
>
> Key: YARN-8879
> URL: https://issues.apache.org/jira/browse/YARN-8879
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Critical
> Fix For: 3.2.0, 3.3.0
>
> Attachments: YARN-8879.001.patch, YARN-8879.002.patch
>
>
> when I submitted a submarine job like this:
> {code:java}
>  ./yarn jar 
> /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  --env DOCKER_JAVA_HOME=/opt/java \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
>  --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
>  --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \
>  --input_path hdfs://mldev/tmp/cifar-10-data \
>  --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
>  --num_ps 1 \
>  --ps_resources memory=4G,vcores=2,gpu=0 \
>  --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
>  --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \
>  --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
>  --num_workers 2 \
>  --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
> --data-dir=hdfs://mldev/tmp/cifar-10-data 
> --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
> --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"  {code}
>  
> The following error as got:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Kerberos 
> principal or keytab is missing.
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467)
> at 
> org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
> at 
> org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
> at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile

2018-10-16 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652263#comment-16652263
 ] 

Chandni Singh commented on YARN-8810:
-

Added a unit test and a fix to {{DefaultComponentsFinder}} in patch 2

> Yarn Service: discrepancy between hashcode and equals of ConfigFile
> ---
>
> Key: YARN-8810
> URL: https://issues.apache.org/jira/browse/YARN-8810
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Minor
> Attachments: YARN-8810.001.patch, YARN-8810.002.patch
>
>
> The {{ConfigFile}} class {{equals}} method doesn't check the equality of 
> {{properties}}. The {{hashCode}} does include the {{properties}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8449) RM HA for AM HTTPS Support

2018-10-16 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-8449:

Attachment: YARN-8449.001.patch

> RM HA for AM HTTPS Support
> --
>
> Key: YARN-8449
> URL: https://issues.apache.org/jira/browse/YARN-8449
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8449.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8870) [Submarine] Add submarine installation scripts

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652469#comment-16652469
 ] 

Hudson commented on YARN-8870:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15231/])
YARN-8870. [Submarine] Add submarine installation scripts. (Xun Liu via 
(wangda: rev 46d6e0016610ced51a76189daeb3ad0e3dbbf94c)
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/nvidia.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/docker/docker.service
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/nvidia-docker.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/submarine/submarine.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/docker/daemon.json
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/install.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/calico/calico-node.service
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/menu.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/install.conf
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/docker.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/etcd/etcd.service
* (edit) hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/utils.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/hadoop/container-executor.cfg
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/hadoop.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/download-server.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/etcd.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/calico.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/environment.sh
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/calico/calicoctl.cfg
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/submarine.sh


> [Submarine] Add submarine installation scripts
> --
>
> Key: YARN-8870
> URL: https://issues.apache.org/jira/browse/YARN-8870
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8870.001.patch, YARN-8870.004.patch, 
> YARN-8870.005.patch, YARN-8870.006.patch, YARN-8870.007.patch
>
>
> In order to reduce the deployment difficulty of Hadoop
> {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel 
> modification and other components, I specially developed this installation 
> script to deploy Hadoop \{Submarine}
> runtime environment, providing one-click installation Scripts, which can also 
> be used to install, uninstall, start, and stop individual components step by 
> step.
>  
> {color:#ff}design d{color}{color:#FF}ocument:{color} 
> [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652468#comment-16652468
 ] 

Hudson commented on YARN-8875:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15231/])
YARN-8875. [Submarine] Add documentation for submarine installation (wangda: 
rev ed08dd3b0c9cec20373e8ca4e34d6526bd759943)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md


> [Submarine] Add documentation for submarine installation script details
> ---
>
> Key: YARN-8875
> URL: https://issues.apache.org/jira/browse/YARN-8875
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Xun Liu
>Assignee: Xun Liu
>Priority: Critical
> Attachments: YARN-8875.001.patch, YARN-8875.002.patch, 
> YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch
>
>
> YARN-8870: submarine installation guide



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652470#comment-16652470
 ] 

Hudson commented on YARN-8798:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15231/])
YARN-8798. [Submarine] Job should not be submitted if --input_path (wangda: rev 
143d74775b2b62884090fdd88874134b9eab2888)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/main/java/org/apache/hadoop/yarn/submarine/client/cli/param/RunJobParameters.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/test/java/org/apache/hadoop/yarn/submarine/client/cli/TestRunJobCliParsing.java


> [Submarine] Job should not be submitted if "--input_path" option is missing
> ---
>
> Key: YARN-8798
> URL: https://issues.apache.org/jira/browse/YARN-8798
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch
>
>
> If a user doesn't set "–input_path" option, the job will still be submitted. 
> Here is my command to run the job:
> {code:java}
> yarn jar 
> $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
>  job run \
>  -verbose \
>  -wait_job_finish \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
>  --name tf-job-001 \
>  --docker_image tangzhankun/tensorflow \
>  --worker_resources memory=4G,vcores=2 \
>  --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
> --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
> --train-steps=5"{code}
> Due to lack of invalidity check, the job is still submitted. We should add a 
> check on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652471#comment-16652471
 ] 

Hudson commented on YARN-8892:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15231/])
YARN-8892. YARN UI2 doc changes to update security status (verified (wangda: 
rev 538250db26ce0b261bb74053348cddfc2d65cf52)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnUI2.md


> YARN UI2 doc changes to update security status (verified under security 
> environment)
> 
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652405#comment-16652405
 ] 

Wangda Tan commented on YARN-8892:
--

+1, committing, thanks [~sunilg].

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Major
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8892:
-
Priority: Blocker  (was: Major)

> YARN UI2 doc improvement to update security status
> --
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420
 ] 

Eric Yang commented on YARN-8489:
-

[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and this is 
[explained|https://www.tensorflow.org/extend/architecture] in official 
[distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "worker",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "launch_command": "python worker.py",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
}
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
   job_name=FLAGS.job_name,
   task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:") as sess:
  for _ in range(1):
sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420
 ] 

Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM:
---

[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "worker",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "launch_command": "python worker.py",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
}
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
   job_name=FLAGS.job_name,
   task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:") as sess:
  for _ in range(1):
sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?


was (Author: eyang):
[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and this is 
[explained|https://www.tensorflow.org/extend/architecture] in official 
[distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": 

[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Component/s: federation
 amrmproxy

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8842) Update QueueMetrics with custom resource values

2018-10-16 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652443#comment-16652443
 ] 

Haibo Chen commented on YARN-8842:
--

+1 on the latest patch. I'll fix the one minor indentation checkstyle issues at 
the time of my commit.

> Update QueueMetrics with custom resource values 
> 
>
> Key: YARN-8842
> URL: https://issues.apache.org/jira/browse/YARN-8842
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8842.001.patch, YARN-8842.002.patch, 
> YARN-8842.003.patch, YARN-8842.004.patch, YARN-8842.005.patch, 
> YARN-8842.006.patch, YARN-8842.007.patch, YARN-8842.008.patch, 
> YARN-8842.009.patch, YARN-8842.010.patch, YARN-8842.011.patch, 
> YARN-8842.012.patch
>
>
> This is the 2nd dependent jira of YARN-8059.
> As updating the metrics is an independent step from handling preemption, this 
> jira only deals with the queue metrics update of custom resources.
> The following metrics should be updated: 
> * allocated resources
> * available resources
> * pending resources
> * reserved resources
> * aggregate seconds preempted



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8892:
-
Target Version/s: 3.2.0, 3.1.2  (was: 3.2.0)

> YARN UI2 doc changes to update security status (verified under security 
> environment)
> 
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie

2018-10-16 Thread Chandni Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated YARN-8868:

Attachment: old_rm_ui.png
new_rm_ui.png

> Set HTTPOnly attribute to Cookie
> 
>
> Key: YARN-8868
> URL: https://issues.apache.org/jira/browse/YARN-8868
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8868.001.patch, new_rm_ui.png, old_rm_ui.png
>
>
> 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, 
> but fails to set the HttpOnly flag to true.
> 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and 
> 388, but fails to set the HttpOnly flag to true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8868) Set HTTPOnly attribute to Cookie

2018-10-16 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652651#comment-16652651
 ] 

Chandni Singh commented on YARN-8868:
-

[~sunilg] I checked the patch in a kerberized cluster. Both new RM UI and old 
RM UI is visible.
Attached is the screenshot.

> Set HTTPOnly attribute to Cookie
> 
>
> Key: YARN-8868
> URL: https://issues.apache.org/jira/browse/YARN-8868
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Attachments: YARN-8868.001.patch
>
>
> 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, 
> but fails to set the HttpOnly flag to true.
> 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and 
> 388, but fails to set the HttpOnly flag to true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8448) AM HTTPS Support for AM communication with RMWeb proxy

2018-10-16 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8448:
-
Summary: AM HTTPS Support for AM communication with RMWeb proxy  (was: AM 
HTTPS Support)

> AM HTTPS Support for AM communication with RMWeb proxy
> --
>
> Key: YARN-8448
> URL: https://issues.apache.org/jira/browse/YARN-8448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8448.001.patch, YARN-8448.002.patch, 
> YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, 
> YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, 
> YARN-8448.009.patch, YARN-8448.010.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8448) AM HTTPS Support for AM communication with RMWeb proxy

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652431#comment-16652431
 ] 

Hudson commented on YARN-8448:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #15230 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15230/])
YARN-8448. AM HTTPS Support for AM communication with RMWeb proxy. (haibochen: 
rev c2288ac45b748b4119442c46147ccc324926c340)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestProxyCA.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/LinuxContainerRuntimeConstants.java
* (edit) 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/Credentials.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/TestDockerContainerRuntime.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/ProxyCA.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerRelaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/util.h
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestProxyCAManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/ProxyCAManager.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DefaultLinuxContainerRuntime.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/executor/ContainerStartContext.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/AMSecretKeys.java
* (edit) 
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/security/ssl/KeyStoreTestUtil.java
* (edit) 

[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652547#comment-16652547
 ] 

Eric Yang commented on YARN-8489:
-

[~leftnoteasy] Notebook can communicate to ps or workers via grpc the same.  
The example was trying to grpc access to a worker instead of making assumption 
that notebook is PS.  PS helps to build the task that workers are going to 
execute more efficiently.  Data scientist specify the cluster spec in notebook, 
parameter server partitions the models and tasks to increase workers 
effectiveness.   

We digressed from original goal of this JIRA.  My point is dependency 
expression and refine YARN service state machine can achieve what you are 
proposing with additional switch.  Additional switch may have unforeseen 
consequence to existing operations.  For example, what happen if during upgrade 
the dominant component is offline.  Should the service terminate and clean up?  
How about flex dominant component to lesser nodes?  What is the order to 
evaluate dominant component and component dependencies?  How to handle restart 
policy in place of dominant component?  It would be helpful to draw a state 
diagram to explain the proposal to see if this idea is worth pursuing. 

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8778) Add Command Line interface to invoke interactive docker shell

2018-10-16 Thread Eric Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8778:

Attachment: YARN-8778.003.patch

> Add Command Line interface to invoke interactive docker shell
> -
>
> Key: YARN-8778
> URL: https://issues.apache.org/jira/browse/YARN-8778
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zian Chen
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8778.001.patch, YARN-8778.002.patch, 
> YARN-8778.003.patch
>
>
> CLI will be the mandatory interface we are providing for a user to use 
> interactive docker shell feature. We will need to create a new class 
> “InteractiveDockerShellCLI” to read command line into the servlet and pass 
> all the way down to docker executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Issue Type: Sub-task  (was: Task)
Parent: YARN-5597

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652645#comment-16652645
 ] 

Wangda Tan commented on YARN-8513:
--

Sounds like a plan, default value set to 100 may make more sense. thanks 
[~cheersyang]

> CapacityScheduler infinite loop when queue is near fully utilized
> -
>
> Key: YARN-8513
> URL: https://issues.apache.org/jira/browse/YARN-8513
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, yarn
>Affects Versions: 3.1.0, 2.9.1
> Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>Reporter: Chen Yufei
>Priority: Major
> Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, 
> jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, 
> yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, 
> yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully 
> utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM 
> restart, it can recover running jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the 
> following log messages (more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
> assignedContainer queue=root usedCapacity=0.99816763 
> absoluteUsedCapacity=0.99816763 used= 
> cluster=}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
>  assignedContainer application attempt=appattempt_1530619767030_1652_01 
> container=null 
> queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
>  clusterResource= type=NODE_LOCAL 
> requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while 
> the same configuration works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a 
> similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile

2018-10-16 Thread Chandni Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652662#comment-16652662
 ] 

Chandni Singh commented on YARN-8810:
-

[~eyang] thanks for reviewing and merging the patch

> Yarn Service: discrepancy between hashcode and equals of ConfigFile
> ---
>
> Key: YARN-8810
> URL: https://issues.apache.org/jira/browse/YARN-8810
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.3.0
>
> Attachments: YARN-8810.001.patch, YARN-8810.002.patch
>
>
> The {{ConfigFile}} class {{equals}} method doesn't check the equality of 
> {{properties}}. The {{hashCode}} does include the {{properties}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8582) Documentation for AM HTTPS Support

2018-10-16 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated YARN-8582:

Attachment: YARN-8582.004.patch

> Documentation for AM HTTPS Support
> --
>
> Key: YARN-8582
> URL: https://issues.apache.org/jira/browse/YARN-8582
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8582.001.patch, YARN-8582.002.patch, 
> YARN-8582.003.patch, YARN-8582.004.patch
>
>
> Documentation for YARN-6586.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8582) Documentation for AM HTTPS Support

2018-10-16 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652681#comment-16652681
 ] 

Robert Kanter commented on YARN-8582:
-

The 004 patch:
- Update doc patch with config property description as in the final YARN-8448 
patch
- Fixed a missed OFF to NONE

> Documentation for AM HTTPS Support
> --
>
> Key: YARN-8582
> URL: https://issues.apache.org/jira/browse/YARN-8582
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: docs
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8582.001.patch, YARN-8582.002.patch, 
> YARN-8582.003.patch, YARN-8582.004.patch
>
>
> Documentation for YARN-6586.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile

2018-10-16 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652386#comment-16652386
 ] 

Hadoop QA commented on YARN-8810:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
28s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 25s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
19s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 17s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 
10s{color} | {color:green} hadoop-yarn-services-core in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 35s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 |
| JIRA Issue | YARN-8810 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944194/YARN-8810.002.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 5354df747e22 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 
08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / d59ca43 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/22206/testReport/ |
| Max. process+thread count | 753 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/22206/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Yarn Service: 

[jira] [Commented] (YARN-8448) AM HTTPS Support

2018-10-16 Thread Haibo Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652392#comment-16652392
 ] 

Haibo Chen commented on YARN-8448:
--

I ran the cestest locally and it did not fail for me either. +1 on the latest 
patch. Will check it in shortly.

> AM HTTPS Support
> 
>
> Key: YARN-8448
> URL: https://issues.apache.org/jira/browse/YARN-8448
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8448.001.patch, YARN-8448.002.patch, 
> YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, 
> YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, 
> YARN-8448.009.patch, YARN-8448.010.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8842) Expose metrics for custom resource types in QueueMetrics

2018-10-16 Thread Haibo Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haibo Chen updated YARN-8842:
-
Summary: Expose metrics for custom resource types in QueueMetrics  (was: 
Update QueueMetrics with custom resource values )

> Expose metrics for custom resource types in QueueMetrics
> 
>
> Key: YARN-8842
> URL: https://issues.apache.org/jira/browse/YARN-8842
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8842.001.patch, YARN-8842.002.patch, 
> YARN-8842.003.patch, YARN-8842.004.patch, YARN-8842.005.patch, 
> YARN-8842.006.patch, YARN-8842.007.patch, YARN-8842.008.patch, 
> YARN-8842.009.patch, YARN-8842.010.patch, YARN-8842.011.patch, 
> YARN-8842.012.patch
>
>
> This is the 2nd dependent jira of YARN-8059.
> As updating the metrics is an independent step from handling preemption, this 
> jira only deals with the queue metrics update of custom resources.
> The following metrics should be updated: 
> * allocated resources
> * available resources
> * pending resources
> * reserved resources
> * aggregate seconds preempted



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-8893:
---
Attachment: YARN-8893.v1.patch

> [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
> ---
>
> Key: YARN-8893
> URL: https://issues.apache.org/jira/browse/YARN-8893
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: amrmproxy, federation
>Reporter: Botong Huang
>Assignee: Botong Huang
>Priority: Major
> Attachments: YARN-8893.v1.patch
>
>
> Fix thread leak in AMRMClientRelayer and UAM client used by 
> FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8449) RM HA for AM HTTPS Support

2018-10-16 Thread Robert Kanter (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652461#comment-16652461
 ] 

Robert Kanter commented on YARN-8449:
-

The 001 patch adds the code for storing and retrieving the CA Certificate and 
Private Key in and from the {{RMStateStore}}.  The design doc had mentioned 
also storing the public key, but that's not necessary because the public key 
can be obtained from the Certificate.  The code mirrors the way existing things 
interact with the {{RMStateStore}}.  Also added/updated tests.

> RM HA for AM HTTPS Support
> --
>
> Key: YARN-8449
> URL: https://issues.apache.org/jira/browse/YARN-8449
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8449.001.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile

2018-10-16 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652631#comment-16652631
 ] 

Hudson commented on YARN-8810:
--

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15234 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/15234/])
YARN-8810.  Fixed a YARN service bug in comparing ConfigFile object. 
(eyang: rev 3bfd214a59a60263aff67850c4d646c64fd76a01)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/api/records/ConfigFile.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/UpgradeComponentsFinder.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/TestDefaultUpgradeComponentsFinder.java


> Yarn Service: discrepancy between hashcode and equals of ConfigFile
> ---
>
> Key: YARN-8810
> URL: https://issues.apache.org/jira/browse/YARN-8810
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.3.0
>
> Attachments: YARN-8810.001.patch, YARN-8810.002.patch
>
>
> The {{ConfigFile}} class {{equals}} method doesn't check the equality of 
> {{properties}}. The {{hashCode}} does include the {{properties}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652684#comment-16652684
 ] 

Eric Yang commented on YARN-8489:
-

{quote}If it is never, dominant field will be ignored. Otherwise dominant field 
is allowed.{quote}

If we go by what you proposed, user expectation of dominant field and restart 
policy will not be right.  Earlier comment was proposing to clean up other 
components, when the dominant component finished.  The dominant component could 
be a batch job that should not be repeated.  Ignore does not sound like the 
right solution here.

Dependent component state changed to FAILED to signal other components to 
terminate seems like a more intuitive approach to address the state transition 
problem.  This will ensure restart policy or upgrade trigged state change 
requires no addition insertion of logic to safe guard dominant component.

{quote}
Component.state: 
- Transition to SUCCEEDED && component.dominant == true: Set service state to 
SUCCEEDED. 
- Transition to FAILED && component.dominant == true. Set service state to 
FAILED. 
{quote}

This looks like you want the service to report successful state or failure 
state based on the "important" component status instead of every component 
report SUCCEEDED to get service state SUCCEEDED.  A safer approach to enable 
this logic is to have a boolean flag in component level to indicate 
"report_as_service_state":true.  This requires no alteration to state 
transition logic, but add a check in the end.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client

2018-10-16 Thread Botong Huang (JIRA)
Botong Huang created YARN-8893:
--

 Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM 
client
 Key: YARN-8893
 URL: https://issues.apache.org/jira/browse/YARN-8893
 Project: Hadoop YARN
  Issue Type: Task
Reporter: Botong Huang
Assignee: Botong Huang


Fix thread leak in AMRMClientRelayer and UAM client used by 
FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)

2018-10-16 Thread Wangda Tan (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wangda Tan updated YARN-8892:
-
Summary: YARN UI2 doc changes to update security status (verified under 
security environment)  (was: YARN UI2 doc improvement to update security status)

> YARN UI2 doc changes to update security status (verified under security 
> environment)
> 
>
> Key: YARN-8892
> URL: https://issues.apache.org/jira/browse/YARN-8892
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Sunil Govindan
>Assignee: Sunil Govindan
>Priority: Blocker
> Attachments: YARN-8892.001.patch
>
>
> UI2 is now tested under kerberized env as well. update this in the doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >