[ https://issues.apache.org/jira/browse/SUBMARINE-54?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824087#comment-16824087 ]
Szilard Nemeth commented on SUBMARINE-54: ----------------------------------------- Hi [~tangzhankun], [~sunilg]! I played around with the single node training job on my cluster: 1. The command to start the Submarine job was: {code:java} /opt/hadoop/bin/yarn jar /home/systest/hadoop-yarn-submarine-3.3.0-SNAPSHOT.jar job run \ --name tf-job-001 --verbose --docker_image hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \ --input_path hdfs://default/dataset/cifar-10-data \ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \ --num_workers 1 --worker_resources memory=5G,vcores=2 \ --worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \ --tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3 {code} 2. I have the following error logs from the serviceam.log file: {code:java} 2019-04-23 05:14:42,679 [Component dispatcher] INFO instance.ComponentInstance - [COMPINSTANCE master-0 : container_e17_1556021556746_0001_01_000002] failed without retry, exitStatus=ContainerStatus: [ContainerId: container_e17_1556021556746_0001_01_000002, ExecutionType: GUARANTEED, State: COMPLETE, Capability: <memory:5120, vCores:2>, Diagnostics: [2019-04-23 05:14:42.182]Exception from container-launch. Container id: container_e17_1556021556746_0001_01_000002 Exit code: 1 [2019-04-23 05:14:42.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr.txt : ./run-PRIMARY_WORKER.sh: line 9: /hadoop-3.1.0/bin/hadoop: No such file or directory ./run-PRIMARY_WORKER.sh: line 16: cd: /test/models/tutorials/image/cifar10_estimator: No such file or directory [2019-04-23 05:14:42.229]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr.txt : ./run-PRIMARY_WORKER.sh: line 9: /hadoop-3.1.0/bin/hadoop: No such file or directory ./run-PRIMARY_WORKER.sh: line 16: cd: /test/models/tutorials/image/cifar10_estimator: No such file or directory , ExitStatus: 1, IP: null, Host: null, ExposedPorts: , ContainerSubState: DONE] {code} Could this be because of missing Docker configuration? There's no clear indication to that. Also I don't get what is this value for? {code:java} DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 {code} By the way, it's not too easy to configure YARN to run submarine on a cluster everytime we make a change in the code. Could it be a more long-term solution to develop an integration test environment to cover basic cases like single node / distributed TF training jobs are starting up without issues? I would happily work on this, but not right now of course. Also, testing on cluster would take more time, as I need to configure docker on the hosts. I guess we can't proceed with SUBMARINE-52 until I have a working end-to-end submarine job. [~sunilg]: What's your opinion? Thanks! > Add test coverage for YarnServiceJobSubmitter and make it ready for extension > for PyTorch > ----------------------------------------------------------------------------------------- > > Key: SUBMARINE-54 > URL: https://issues.apache.org/jira/browse/SUBMARINE-54 > Project: Hadoop Submarine > Issue Type: Sub-task > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Attachments: SUBMARINE-54.001.patch, SUBMARINE-54.002.patch, > SUBMARINE-54.003.patch, SUBMARINE-54.004.patch, SUBMARINE-54.005.patch, > SUBMARINE-54.006.patch, SUBMARINE-54.007.patch, SUBMARINE-54.008.patch, > SUBMARINE-54.009.patch, SUBMARINE-54.009.patch > > > This crucial class has no associated test yet. We need to improve this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)