[GitHub] idealboy commented on issue #7396: **********!!!Error with dist-sync on two machines, Thank you

2017-08-09 Thread git
idealboy commented on issue #7396: **!!!Error with dist-sync on two 
machines, Thank you
URL: 
https://github.com/apache/incubator-mxnet/issues/7396#issuecomment-321455058
 
 
   thr program is looping in van.cc:Start function  when mx.kvstore.create
   
   while(!ready_){ LOG(INFO) << "wait ready"}
   
   why ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] idealboy commented on issue #7396: **********!!!Error with dist-sync on two machines, Thank you

2017-08-09 Thread git
idealboy commented on issue #7396: **!!!Error with dist-sync on two 
machines, Thank you
URL: 
https://github.com/apache/incubator-mxnet/issues/7396#issuecomment-321313876
 
 
   This is the output when "export PS_VERBOSE=2"
   
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.240.189, port=52099, is_recovery=0 } 
}
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.240.189, port=39902, is_recovery=0 } 
}
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=server, ip=10.15.133.82, port=35715, is_recovery=0 } }
   [00:38:29] src/van.cc:161: ? => 1. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, ip=10.15.133.82, port=37993, is_recovery=0 } }
   [00:38:29] src/van.cc:235: assign rank=9 to node role=worker, 
ip=10.15.240.189, port=39902, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=8 to node role=server, 
ip=10.15.240.189, port=52099, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=10 to node role=server, 
ip=10.15.133.82, port=35715, is_recovery=0
   [00:38:29] src/van.cc:235: assign rank=11 to node role=worker, 
ip=10.15.133.82, port=37993, is_recovery=0
   [00:38:29] src/van.cc:136: ? => 9. Meta: request=0, timestamp=0, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 11. Meta: request=0, timestamp=1, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 8. Meta: request=0, timestamp=2, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:136: ? => 10. Meta: request=0, timestamp=3, control={ 
cmd=ADD_NODE, node={ role=worker, id=9, ip=10.15.240.189, port=39902, 
is_recovery=0 role=server, id=8, ip=10.15.240.189, port=52099, is_recovery=0 
role=server, id=10, ip=10.15.133.82, port=35715, is_recovery=0 role=worker, 
id=11, ip=10.15.133.82, port=37993, is_recovery=0 role=scheduler, id=1, 
ip=10.15.240.189, port=9118, is_recovery=0 } }
   [00:38:29] src/van.cc:251: the scheduler is connected to 2 workers and 2 
servers
   [00:38:29] src/van.cc:161: 9 => 1. Meta: request=1, timestamp=1, control={ 
cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 1
   [00:38:29] src/van.cc:161: 8 => 1. Meta: request=1, timestamp=1, control={ 
cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 2
   [00:38:29] src/van.cc:136: [00:38:29] src/van.cc:161: 1 => 1. Meta: 
request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 }
   [00:38:29] src/van.cc:291: Barrier count for 7 : 3
   ? => 1. Meta: request=1, timestamp=4, control={ cmd=BARRIER, barrier_group=7 
}
   
   
   Now, it seems the program is blobked at somewhere, but I don't know how to 
debug this problerm
   
   I try to run "python train_mnist.py --network lenet --gpus 0 " on 
10.15.133.82 in /tmp/mxnet, IT begin to train normally
   
   but I run "python train_mnist.py --network lenet --gpus 0" on 10.15.240.189 
in /tmp/mxnet, error occurs :  src/van.cc:76: Check failed: (my_node_.port) != 
(-1) bind failed
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services