GoodJoey opened a new issue #10017: When doing the distributed training, the bandwidth becomes the bottleneck. URL: https://github.com/apache/incubator-mxnet/issues/10017 is there any experience/magic solutions when using mxnet to do distribute training? say, is there any method, force the worker communicates with the parameter server on the same node to save the bandwidth? or something like ring-allreduce(horovod)?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services