[ 
https://issues.apache.org/jira/browse/SUBMARINE-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824087#comment-16824087
 ] 

Szilard Nemeth commented on SUBMARINE-54:
-----------------------------------------

Hi [~tangzhankun], [~sunilg]!

I played around with the single node training job on my cluster: 

1. The command to start the Submarine job was: 

{code:java}
/opt/hadoop/bin/yarn jar /home/systest/hadoop-yarn-submarine-3.3.0-SNAPSHOT.jar 
job run \
--name tf-job-001 --verbose --docker_image hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
--input_path hdfs://default/dataset/cifar-10-data \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--num_workers 1 --worker_resources memory=5G,vcores=2 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && 
python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% 
--train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 
--sync" \
--tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3

{code}

2. I have the following error logs from the serviceam.log file: 

{code:java}
2019-04-23 05:14:42,679 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE master-0 : 
container_e17_1556021556746_0001_01_000002] failed without retry, 
exitStatus=ContainerStatus: [ContainerId: 
container_e17_1556021556746_0001_01_000002, ExecutionType: GUARANTEED, State: 
COMPLETE, Capability: <memory:5120, vCores:2>, Diagnostics: [2019-04-23 
05:14:42.182]Exception from container-launch.
Container id: container_e17_1556021556746_0001_01_000002
Exit code: 1

[2019-04-23 05:14:42.229]Container exited with a non-zero exit code 1. Error 
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr.txt :
./run-PRIMARY_WORKER.sh: line 9: /hadoop-3.1.0/bin/hadoop: No such file or 
directory
./run-PRIMARY_WORKER.sh: line 16: cd: 
/test/models/tutorials/image/cifar10_estimator: No such file or directory


[2019-04-23 05:14:42.229]Container exited with a non-zero exit code 1. Error 
file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr.txt :
./run-PRIMARY_WORKER.sh: line 9: /hadoop-3.1.0/bin/hadoop: No such file or 
directory
./run-PRIMARY_WORKER.sh: line 16: cd: 
/test/models/tutorials/image/cifar10_estimator: No such file or directory


, ExitStatus: 1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE]

{code}

Could this be because of missing Docker configuration? There's no clear 
indication to that.
Also I don't get what is this value for? 

{code:java}
DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 
{code}



By the way, it's not too easy to configure YARN to run submarine on a cluster 
everytime we make a change in the code.
Could it be a more long-term solution to develop an integration test 
environment to cover basic cases like single node / distributed TF training 
jobs are starting up without issues? 
I would happily work on this, but not right now of course.
Also, testing on cluster would take more time, as I need to configure docker on 
the hosts. 
I guess we can't proceed with SUBMARINE-52 until I have a working end-to-end 
submarine job.

[~sunilg]: What's your opinion?

Thanks! 


> Add test coverage for YarnServiceJobSubmitter and make it ready for extension 
> for PyTorch
> -----------------------------------------------------------------------------------------
>
>                 Key: SUBMARINE-54
>                 URL: https://issues.apache.org/jira/browse/SUBMARINE-54
>             Project: Hadoop Submarine
>          Issue Type: Sub-task
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: SUBMARINE-54.001.patch, SUBMARINE-54.002.patch, 
> SUBMARINE-54.003.patch, SUBMARINE-54.004.patch, SUBMARINE-54.005.patch, 
> SUBMARINE-54.006.patch, SUBMARINE-54.007.patch, SUBMARINE-54.008.patch, 
> SUBMARINE-54.009.patch, SUBMARINE-54.009.patch
>
>
> This crucial class has no associated test yet. We need to improve this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to