[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687457#comment-16687457
 ] 

Zac Zhou edited comment on YARN-8960 at 11/15/18 4:18 AM:
----------------------------------------------------------

Add a parameter, named distribute_keytab, which can be used to specify whether 
to distribute local keytab across the cluster. 

A submarine job can be submitted like this:
{code:java}
./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://mldev/tmp/cifar-10-data \
--checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
--num_ps 1 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
--ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 2 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
--keytab /tmp/keytabs/hadoop.keytab \
--principal hadoop/ad...@corp.com \
--distribute_keytab{code}




 

 


was (Author: yuan_zac):
Add a parameter, named distribute_keytab, which can be used to specify whether 
to distribute local keytab across the cluster. 

A submarine job can be submitted like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 --env DOCKER_JAVA_HOME=/opt/java \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
 --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
 --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
 --input_path hdfs://mldev/tmp/cifar-10-data \
 --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
 --num_ps 1 \
 --ps_resources memory=4G,vcores=2,gpu=0 \
 --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
 --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
 --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
 --num_workers 2 \
 --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 --keytab /tmp/keytabs/hadoop.keytab \
 --principal hadoop/ad...@corp.com \
 --distribute_keytab

 

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8960
>                 URL: https://issues.apache.org/jira/browse/YARN-8960
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Zac Zhou
>            Assignee: Zac Zhou
>            Priority: Major
>         Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to