[ https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687457#comment-16687457 ]
Zac Zhou edited comment on YARN-8960 at 11/15/18 4:18 AM: ---------------------------------------------------------- Add a parameter, named distribute_keytab, which can be used to specify whether to distribute local keytab across the cluster. A submarine job can be submitted like this: {code:java} ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ --env DOCKER_JAVA_HOME=/opt/java \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \ --input_path hdfs://mldev/tmp/cifar-10-data \ --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ --num_ps 1 \ --ps_resources memory=4G,vcores=2,gpu=0 \ --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://mldev/tmp/cifar-10-data --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \ --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ --num_workers 2 \ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://mldev/tmp/cifar-10-data --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \ --keytab /tmp/keytabs/hadoop.keytab \ --principal hadoop/ad...@corp.com \ --distribute_keytab{code} was (Author: yuan_zac): Add a parameter, named distribute_keytab, which can be used to specify whether to distribute local keytab across the cluster. A submarine job can be submitted like this: ./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ --env DOCKER_JAVA_HOME=/opt/java \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \ --input_path hdfs://mldev/tmp/cifar-10-data \ --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ --num_ps 1 \ --ps_resources memory=4G,vcores=2,gpu=0 \ --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://mldev/tmp/cifar-10-data --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \ --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ --num_workers 2 \ --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://mldev/tmp/cifar-10-data --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \ --keytab /tmp/keytabs/hadoop.keytab \ --principal hadoop/ad...@corp.com \ --distribute_keytab > [Submarine] Can't get submarine service status using the command of "yarn app > -status" under security environment > ----------------------------------------------------------------------------------------------------------------- > > Key: YARN-8960 > URL: https://issues.apache.org/jira/browse/YARN-8960 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Zac Zhou > Assignee: Zac Zhou > Priority: Major > Attachments: YARN-8960.001.patch, YARN-8960.002.patch, > YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, > YARN-8960.006.patch > > > After submitting a submarine job, we tried to get service status using the > following command: > yarn app -status ${service_name} > But we got the following error: > HTTP error code : 500 > > The stack in resourcemanager log is : > {code} > ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {} > java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ... > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal > specified in the persisted service definitio > n, fail to connect to AM. > at > org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500) > at > org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > ... 68 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org