idealboy commented on issue #7412: About van when using distribute training URL: https://github.com/apache/incubator-mxnet/issues/7412#issuecomment-321456391 my environment setting about mxnet dist-sync on two machines are below, two machines are ssh-able: (1)10.15.240.189: export DMLC_NUM_WORKER=1 export MXNET_GPU_WORKER_NTHREADS=8 export MXNET_CPU_WORKER_NTHREADS=8 export MXNET_CPU_PRIORITY_NTHREADS=8 export MXNET_GPU_COPY_NTHREADS=2 export DMLC_NUM_SERVER=1 export DMLC_PS_ROOT_URI=10.15.240.189 export DMLC_PS_ROOT_PORT=3000 export DMLC_ROLE=scheduler export DMLC_INTERFACE="eth0" (2)10.155.133.82 export DMLC_ROLE=worker export DMLC_WORKER_NUM=1 export DMLC_SERVER_NUM=1 export DMLC_PS_ROOT_URI=10.15.240.189 export DMLC_PS_ROOT_PORT=8000 export DMLC_ROLE=worker export DMLC_INTERFACE="eth0" my program is: python ../../tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python train_mnist.py --network lenet --gpus 0 --kv-store dist_sync ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services