[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-15 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346
 
 
   @eric-haibin-lin  
   Hi haibin, I have already finished code modification according to your 
comments. Any question, please let me know. Thanks!
   Note: Upon our internal code review (intel), I changed the following naming:
   dist_sync_mpi -> dist_sync_allreduce
   mpi_collectives -> collectives
   MPI_Wrapper -> COLL_Wrapper
   Because the collectives can be implemented not only in MPI library. (e.g. 
nccl library) 
   The corresponding design doc has already been updated. 
   
https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-15 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-389170346
 
 
   @eric-haibin-lin  
   Hi haibin, I have already finished code modification according to your 
comments. Any question, please let me know. Thanks!
   Note: Upon our internal code review (intel), I changed the following naming:
   dist_sync_mpi -> dist_sync_allreduce
   mpi_collectives -> collectives
   MPI_Wrapper -> COLL_Wrapper
   Because the collectives can be implemented not only in MPI library. (e.g. 
nccl library)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-10 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387983559
 
 
   @eric-haibin-lin  Currently, in nightly test-all.sh, dist_sync_kvstore.py is 
added but it's under MXNet GPU build.  We will add dist_sync_mpi_kvstore.py 
there and run it in CI. In nightly, this test script will be enabled until GPU 
is supported. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-08 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-387378979
 
 
   @rahul003 Already finished code modification according to your comments. I 
add mpich as default mpi and I tried it works fine.  Let me know if there's any 
questions.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
 
 
   @rahul003 
   For GPU, I agree with your comment. But the majority code of this PR is the 
infrastructure of adding allreduce into MXNet which is shared by both CPU and 
GPU. Currently we leave the place holder for GPU for future extension. We don't 
run into any issue on GPU and we enable CPU firstly simply because we 
   currently have a lot CPU multi-node environment  We can discuss further 
about how to add GPU extension. @pengzhao-intel  Patric will shed more lights 
upon it.
   
   For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 
machine)  Yes, we trained all on CPU.
   
   In general, all reduce performance should be similar for openmpi and mpich. 
Intel mpi has better performance on all reduce performance, but it's not free 
software though it's run-time part is free.  I agree you that we select openmpi 
as default mpi if no one objects.  (we will download open mpi zip in 3rd party 
and compile it.)
   
   For proto3, I tested original kvstore type dist_sync, it works fine for 
PS-Lite.  Moreover, we just use protobuf 3.5.1. For PS-Lite it still uses 
proto2. (just need to specify its version explicitly.)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
 
 
   @rahul003 
   For GPU, I agree with your comment.  Currently we leave the place holder for 
GPU for future extension. @pengzhao-intel  Patric will shed more lights upon it.
   
   For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 
machine)  Yes, we trained all on CPU.
   
   In general, all reduce performance should be similar for openmpi and mpich. 
Intel mpi has better performance on all reduce performance, but it's not free 
software though it's run-time part is free.  I agree you that we select openmpi 
as default mpi if no one objects.  (we will download open mpi zip in 3rd party 
and compile it.)
   
   For proto3, I tested original kvstore type dist_sync, it works fine for 
PS-Lite.  Moreover, we just use protobuf 3.5.1. For PS-Lite it still uses 
proto2. (just need to specify its version explicitly.)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-04 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386539894
 
 
   @rahul003 
   For GPU, I agree with your comment.  Currently we leave the place holder for 
GPU for future extension. @pengzhao-intel  Patric will shed more lights upon it.
   
   For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 
machine)  Yes, we trained all on CPU.
   
   In general, all reduce performance should be similar for openmpi and mpich. 
Intel mpi has better performance on all reduce performance, but it's not free 
software though it's run-time part is free.  I agree you that we select openmpi 
as default mpi if no one objects.  (we will download open mpi zip in 3rd party 
and compile it.)
   
   For proto3, I tested original kvstore type dist_sync, it works fine for 
PS-Lite. 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386497664
 
 
   @rahul003  For mpich, if you directly install ubuntu package of mpich, it's 
header file and lib file is not in the same sub folder. I suggest to download 
mpich (compile and build). (https://www.mpich.org/downloads/)
   The built mpich has the same release dir as open mpi and intel mpi.
   [zhouhaiy@mlt-ace build]$ pwd
   /home/zhouhaiy/mpich2/build
   [zhouhaiy@mlt-ace build]$ ls
   bin  include  lib  share
   I already tried it and mxnet can build with mpich2. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
 
 
   @rahul003 
   The build instruction is as follows: 
   USE_DIST_KVSTORE = 1
   USE_MPI_DIST_KVSTORE = 1
MPI_ROOT=/usr/lib/openmpi
   We let the end user to select which mpi to use. (openmpi, mpich, or intel 
mpi.) That's why we don't include src as 3rd party lib.  You can check horovod, 
they play the same trick.  https://github.com/uber/horovod#install
   So the end user need to install MPI separately.
   Can you try latest open mpi?   We tried both open mpi and intel mpi, their 
release dir structure looks like following:
   /home/zhouhaiy/openmpi/build
   [zhouhaiy@mlt-ace build]$ ls
   bin  etc  include  lib  share
   Looks like mpich release dir is not same as open mpi, I will have a check. 
   
   Certainly, we can also do the following logic: 
   If env MPI_ROOT is set, we use this mpi lib version from this env, otherwise 
we download open source 3rd party mpi source code, compile build and mxnet 
depends upon it. 
   
   Which one do you prefer? Need consensus.
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386487375
 
 
   @rahul003 Local Batch Size: 64 means every node's batch size is 64 so global 
batch size is 64 * 8 = 512. 
   Currently,  the result is based upon CPU. For resnet50, we tried, the 
scaling efficiency close to 99%.  Our currently implementation covers CPU and 
leaves place holder for GPU. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
 
 
   @rahul003 
   The build instruction is in the design doc. 
   USE_DIST_KVSTORE = 1
   USE_MPI_DIST_KVSTORE = 1
MPI_ROOT=/usr/lib/openmpi
   We let the end user to select which mpi to use. (openmpi, mpich, or intel 
mpi.) That's why we don't include src as 3rd party lib.  You can check horovod, 
they play the same trick.  https://github.com/uber/horovod#install
   So the end user need to install MPI separately.
   Can you try latest open mpi?   We tried both open mpi and intel mpi, their 
release dir structure looks like following:
   /home/zhouhaiy/openmpi/build
   [zhouhaiy@mlt-ace build]$ ls
   bin  etc  include  lib  share
   Looks like mpich release dir is not same as open mpi, I will have a check. 
   
   Certainly, we can also do the following logic: 
   If env MPI_ROOT is set, we use this mpi lib version from this env, otherwise 
we download open source 3rd party mpi source code, compile build and mxnet 
depends upon it. 
   
   Which one do you prefer? Need consensus.
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
 
 
   @rahul003 
   The build instruction is in the design doc. 
   USE_DIST_KVSTORE = 1
   USE_MPI_DIST_KVSTORE = 1
MPI_ROOT=/usr/lib/openmpi
   We let the end user to select which mpi to use. (openmpi, mpich, or intel 
mpi.) That's why we don't include src as 3rd party lib.  You can check horovod, 
they play the same trick.  https://github.com/uber/horovod#install
   So the end user need to install MPI separately.
   Can you try latest open mpi?   We tried both open mpi and intel mpi, their 
release dir structure looks like following:
   /home/zhouhaiy/openmpi/build
   [zhouhaiy@mlt-ace build]$ ls
   bin  etc  include  lib  share
   Looks like mpich release dir is not same as open mpi, I will have a check. 
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce

2018-05-03 Thread GitBox
threeleafzerg commented on issue #10696: [MXNET-366]Extend MXNet Distributed 
Training by MPI AllReduce
URL: https://github.com/apache/incubator-mxnet/pull/10696#issuecomment-386484809
 
 
   @rahul003 
   The build instruction is in the design doc. 
   USE_DIST_KVSTORE = 1
   USE_MPI_DIST_KVSTORE = 1
MPI_ROOT=/usr/lib/openmpi
   We let the end user to select which mpi to use. (openmpi, mpich, or intel 
mpi.) That's why we don't include src as 3rd party lib.  You can check horovod, 
they play the same trick.  https://github.com/uber/horovod#install
   So the end user need to install MPI separately.
   Can you try latest open mpi?   We tried both open mpi and intel mpi, their 
release dir structure looks like following:
   /home/zhouhaiy/openmpi/build
   [zhouhaiy@mlt-ace build]$ ls
   bin  etc  include  lib  share
   
   Looks like mpich release dir is not same as open mpi, I will have a check. 
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services