Thanks a lot for your reply. 
Sunil,
I was trying to follow the steps from: 
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md

to run the tensorflow standalone using submarine. I have installed hadoop 
3.3.0-SNAPSHOT. 
However, when I run the:yarn jar 
path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
   job run --name tf-job-001 --verbose --docker_image 
hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
   --input_path hdfs://default/dataset/cifar-10-data \
   --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
   --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
   --num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
   --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && 
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% 
--train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 
--sync" \
   --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
command, I get the following error:2018-11-07 21:48:55,831 INFO  [main] 
client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to Application 
History server at /128.105.144.236:10200Exception in thread "main" 
java.lang.IllegalArgumentException: Unacceptable no of cpus specified, either 
zero or negative for component master (or at the global level)        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
        at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
        at 
org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
        at 
org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
        at 
org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)   
     at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)       
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
     at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)        at 
org.apache.hadoop.util.RunJar.run(RunJar.java:323)        at 
org.apache.hadoop.util.RunJar.main(RunJar.java:236)

It seems that I don't configure somewhere some corresponding resources for a 
master component. However I have a hard time understanding where and what to 
configure. I also looked at the design document you pointed 
at:https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

and it has a --master_resources flag. However this is not available in 3.3.0.
Could you please advise how to proceed with this?
Thank you,- Robert

    On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung 
<jyhung2...@gmail.com> wrote:  
 
 Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY 
(TensorFlow on YARN) which is a platform built for this purpose.

Jonathan
________________________________
From: Sunil G <sun...@apache.org>
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-dev@hadoop.apache.org; yarn-dev-h...@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil


On Wed, Nov 7, 2018 at 10:34 AM Robert Grandl <rgra...@yahoo.com.invalid>
wrote:

>  Hi all,
> I am wondering if there is any stable support to run distributed
> TensorFlow atop YARN at the moment.
> I found this blog post from Hortonworks. It seems this it is possible
> starting YARN 3.1.0.
> https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
>
>
> Also I found some more recent JIRAs:
> https://issues.apache.org/jira/browse/YARN-8220
> https://issues.apache.org/jira/browse/YARN-8135
> which suggests to use something called submarine.
>
> However, I could not find any proper documentation or instructions to use
> any of these.
>
> Can someone help me with this?
> Otherwise, it is any better support to run any other machine learning
> framework with YARN?
> Thank you in advance,- Robert
>  

Reply via email to