[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-15 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346 @eric-haibin-lin Hi haibin, I have already finished code modification according to your comments.

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-15 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346 @eric-haibin-lin Hi haibin, I have already finished code modification according to your comments.

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-10 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387983559 @eric-haibin-lin Currently, in nightly test-all.sh, dist_sync_kvstore.py is added but it's under MXNet

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-08 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387378979 @rahul003 Already finished code modification according to your comments. I add mpich as default mpi and I

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894 @rahul003 For GPU, I agree with your comment. But the majority code of this PR is the infrastructure

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894 @rahul003 For GPU, I agree with your comment. Currently we leave the place holder for GPU for future

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894 @rahul003 For GPU, I agree with your comment. Currently we leave the place holder for GPU for future

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386497664 @rahul003 For mpich, if you directly install ubuntu package of mpich, it's header file and lib file is

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809 @rahul003 The build instruction is as follows: USE_DIST_KVSTORE = 1 USE_MPI_DIST_KVSTORE = 1

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386487375 @rahul003 Local Batch Size: 64 means every node's batch size is 64 so global batch size is 64 * 8 = 512.

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809 @rahul003 The build instruction is in the design doc. USE_DIST_KVSTORE = 1 USE_MPI_DIST_KVSTORE

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809 @rahul003 The build instruction is in the design doc. USE_DIST_KVSTORE = 1 USE_MPI_DIST_KVSTORE

[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809 @rahul003 The build instruction is in the design doc. USE_DIST_KVSTORE = 1 USE_MPI_DIST_KVSTORE