threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346
@eric-haibin-lin
Hi haibin, I have already finished code modification according to your
comments.
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346
@eric-haibin-lin
Hi haibin, I have already finished code modification according to your
comments.
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387983559
@eric-haibin-lin Currently, in nightly test-all.sh, dist_sync_kvstore.py is
added but it's under MXNet
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387378979
@rahul003 Already finished code modification according to your comments. I
add mpich as default mpi and I
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
@rahul003
For GPU, I agree with your comment. But the majority code of this PR is the
infrastructure
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
@rahul003
For GPU, I agree with your comment. Currently we leave the place holder for
GPU for future
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
@rahul003
For GPU, I agree with your comment. Currently we leave the place holder for
GPU for future
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386497664
@rahul003 For mpich, if you directly install ubuntu package of mpich, it's
header file and lib file is
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
@rahul003
The build instruction is as follows:
USE_DIST_KVSTORE = 1
USE_MPI_DIST_KVSTORE = 1
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386487375
@rahul003 Local Batch Size: 64 means every node's batch size is 64 so global
batch size is 64 * 8 = 512.
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
@rahul003
The build instruction is in the design doc.
USE_DIST_KVSTORE = 1
USE_MPI_DIST_KVSTORE
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
@rahul003
The build instruction is in the design doc.
USE_DIST_KVSTORE = 1
USE_MPI_DIST_KVSTORE
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
@rahul003
The build instruction is in the design doc.
USE_DIST_KVSTORE = 1
USE_MPI_DIST_KVSTORE
13 matches
Mail list logo