idealboy commented on issue #7396: **!!!Error with dist-sync on two
machines, Thank you
URL:
https://github.com/apache/incubator-mxnet/issues/7396#issuecomment-321313876
This is the output when "export PS_VERBOSE=2"
[00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={
cmd=ADD_NODE, node={ role=server, ip=10.15.240.189, port=52099, is_recovery=0 }
}
[00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={
cmd=ADD_NODE, node={ role=worker, ip=10.15.240.189, port=39902, is_recovery=0 }
}
[00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={
cmd=ADD_NODE, node={ role=server, ip=10.15.133.82, port=35715, is_recovery=0 } }
[00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={
cmd=ADD_NODE, node={ role=worker, ip=10.15.133.82, port=37993, is_recovery=0 } }
[00:38:29] src/van.cc:235: assign rank=9 to node role=worker,
ip=10.15.240.189, port=39902, is_recovery=0
[00:38:29] src/van.cc:235: assign rank=8 to node role=server,
ip=10.15.240.189, port=52099, is_recovery=0
[00:38:29] src/van.cc:235: assign rank=10 to node role=server,
ip=10.15.133.82, port=35715, is_recovery=0
[00:38:29] src/van.cc:235: assign rank=11 to node role=worker,
ip=10.15.133.82, port=37993, is_recovery=0
[00:38:29] src/van.cc:136: ? => 9. Meta: request=0, timestamp=0, control={
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902,
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker,
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1,
ip=10.15.240.189, port=9118, is_recovery=0 } }
[00:38:29] src/van.cc:136: ? => 11. Meta: request=0, timestamp=1, control={
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902,
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker,
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1,
ip=10.15.240.189, port=9118, is_recovery=0 } }
[00:38:29] src/van.cc:136: ? => 8. Meta: request=0, timestamp=2, control={
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902,
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker,
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1,
ip=10.15.240.189, port=9118, is_recovery=0 } }
[00:38:29] src/van.cc:136: ? => 10. Meta: request=0, timestamp=3, control={
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902,
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker,
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1,
ip=10.15.240.189, port=9118, is_recovery=0 } }
[00:38:29] src/van.cc:251: the scheduler is connected to 2 workers and 2
servers
[00:38:29] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={
cmd=BARRIER, barrier_group=7 }
[00:38:29] src/van.cc:291: Barrier count for 7 : 1
[00:38:29] src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={
cmd=BARRIER, barrier_group=7 }
[00:38:29] src/van.cc:291: Barrier count for 7 : 2
[00:38:29] src/van.cc:136: [00:38:29] src/van.cc:161: 1 => 1. Meta:
request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }
[00:38:29] src/van.cc:291: Barrier count for 7 : 3
? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7
}
Now, it seems the program is blobked at somewhere, but I don't know how to
debug this problerm
I try to run "python train_mnist.py --network lenet --gpus 0 " on
10.15.133.82 in /tmp/mxnet, IT begin to train normally
but I run "python train_mnist.py --network lenet --gpus 0" on 10.15.240.189
in /tmp/mxnet, error occurs : src/van.cc:76: Check failed: (my_node_.port) !=
(-1) bind failed
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
With regards,
Apache Git Services