Re: Horovod-MXNet Integration
Congrats on the Horovod integration everyone. That's really great to hear. On Wed, Jan 30, 2019 at 10:08 AM Lin Yuan wrote: > > Hi Yuan, > > Thanks for your interest. We have just supported MXNet in Horovod and are > working on performance tuning and adding more examples. We are definitely > interested in further extending it's support with Kubeflow. > > Let's set up some time to have a more detailed discussion. > > Best, > > Lin > > On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang wrote: > > > Hi, > > > > It's great to see MXNet-Horovod integration got merged: > > https://github.com/uber/horovod/pull/542 > > > > Is there any future plan for this? I've been working on Kubeflow's > > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it > > would > > be interesting to see an example of using Horovod + MXNet + Kubeflow using > > MPI Operator. Feel free to reach out (@terrytangyuan > > <https://github.com/terrytangyuan>) if you encounter any issues. > > > > Best, > > Yuan > > > > > > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan wrote: > > > > > Hi Mu, > > > > > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on > > > releasing MXNet-Horovod integration in production. We have made some > > > changes on both MXNet and Horovod sides. The changes on MXNet side have > > > mostly been merged and we are working to merge code to horovod repo. We > > > will send a design doc to you for review again next week. > > > > > > Thanks for your feedback, > > > > > > Lin > > > > > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li wrote: > > > > > > > Thanks for your contribution, Carl. > > > > > > > > I remember I left a comment on the proposal, but today I found it was > > > > disappeared. My suggestion is trying best to not change the existing > > API. > > > > The reason is that we need to change all trainers on the frontend that > > > uses > > > > the existing kvstore APIs, which may cause confusion to users. > > > > > > > > The current proposal wants add the following 4 APIs into kvstore: > > > > > > > > > > > >- > > > > > > > >kv.pushpull > > > >- > > > > > > > >kv.broadcast > > > >- > > > > > > > >kv.local_rank > > > >- > > > > > > > >kv.num_local_workers > > > > > > > > > > > > Pushpull can be done with a sequential push and pull, you can do > > nothing > > > in > > > > push and put all workloads into pushpull. Broadcast can be implemented > > by > > > > pull. > > > > > > > > What's local workers? GPUs in the single machine? If so, we can query > > it > > > > directly. > > > > > > > > > > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang wrote: > > > > > > > > > Hi, > > > > > > > > > > Currently, MXNet distributed can only be done using parameter server. > > > > > Horovod is an open-source distributed training framework that has > > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > > > > propose to add Horovod support to MXNet. This will help our users > > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > > > > proposal on cwiki: > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > > > > > > > Please feel free to let me know if you have any suggestions or > > > feedback. > > > > > > > > > > Regards, > > > > > Carl > > > > > > > > > > > > > >
Re: Horovod-MXNet Integration
Hi Yuan, Thanks for your interest. We have just supported MXNet in Horovod and are working on performance tuning and adding more examples. We are definitely interested in further extending it's support with Kubeflow. Let's set up some time to have a more detailed discussion. Best, Lin On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang wrote: > Hi, > > It's great to see MXNet-Horovod integration got merged: > https://github.com/uber/horovod/pull/542 > > Is there any future plan for this? I've been working on Kubeflow's > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it > would > be interesting to see an example of using Horovod + MXNet + Kubeflow using > MPI Operator. Feel free to reach out (@terrytangyuan > <https://github.com/terrytangyuan>) if you encounter any issues. > > Best, > Yuan > > > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan wrote: > > > Hi Mu, > > > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on > > releasing MXNet-Horovod integration in production. We have made some > > changes on both MXNet and Horovod sides. The changes on MXNet side have > > mostly been merged and we are working to merge code to horovod repo. We > > will send a design doc to you for review again next week. > > > > Thanks for your feedback, > > > > Lin > > > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li wrote: > > > > > Thanks for your contribution, Carl. > > > > > > I remember I left a comment on the proposal, but today I found it was > > > disappeared. My suggestion is trying best to not change the existing > API. > > > The reason is that we need to change all trainers on the frontend that > > uses > > > the existing kvstore APIs, which may cause confusion to users. > > > > > > The current proposal wants add the following 4 APIs into kvstore: > > > > > > > > >- > > > > > >kv.pushpull > > >- > > > > > >kv.broadcast > > >- > > > > > >kv.local_rank > > >- > > > > > >kv.num_local_workers > > > > > > > > > Pushpull can be done with a sequential push and pull, you can do > nothing > > in > > > push and put all workloads into pushpull. Broadcast can be implemented > by > > > pull. > > > > > > What's local workers? GPUs in the single machine? If so, we can query > it > > > directly. > > > > > > > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang wrote: > > > > > > > Hi, > > > > > > > > Currently, MXNet distributed can only be done using parameter server. > > > > Horovod is an open-source distributed training framework that has > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > > > propose to add Horovod support to MXNet. This will help our users > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > > > proposal on cwiki: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > > > > > Please feel free to let me know if you have any suggestions or > > feedback. > > > > > > > > Regards, > > > > Carl > > > > > > > > > >
Re: Horovod-MXNet Integration
Hi, It's great to see MXNet-Horovod integration got merged: https://github.com/uber/horovod/pull/542 Is there any future plan for this? I've been working on Kubeflow's MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it would be interesting to see an example of using Horovod + MXNet + Kubeflow using MPI Operator. Feel free to reach out (@terrytangyuan <https://github.com/terrytangyuan>) if you encounter any issues. Best, Yuan On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan wrote: > Hi Mu, > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on > releasing MXNet-Horovod integration in production. We have made some > changes on both MXNet and Horovod sides. The changes on MXNet side have > mostly been merged and we are working to merge code to horovod repo. We > will send a design doc to you for review again next week. > > Thanks for your feedback, > > Lin > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li wrote: > > > Thanks for your contribution, Carl. > > > > I remember I left a comment on the proposal, but today I found it was > > disappeared. My suggestion is trying best to not change the existing API. > > The reason is that we need to change all trainers on the frontend that > uses > > the existing kvstore APIs, which may cause confusion to users. > > > > The current proposal wants add the following 4 APIs into kvstore: > > > > > >- > > > >kv.pushpull > >- > > > >kv.broadcast > >- > > > >kv.local_rank > >- > > > >kv.num_local_workers > > > > > > Pushpull can be done with a sequential push and pull, you can do nothing > in > > push and put all workloads into pushpull. Broadcast can be implemented by > > pull. > > > > What's local workers? GPUs in the single machine? If so, we can query it > > directly. > > > > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang wrote: > > > > > Hi, > > > > > > Currently, MXNet distributed can only be done using parameter server. > > > Horovod is an open-source distributed training framework that has > > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > > propose to add Horovod support to MXNet. This will help our users > > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > > proposal on cwiki: > > > > > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > > > Please feel free to let me know if you have any suggestions or > feedback. > > > > > > Regards, > > > Carl > > > > > >
Re: Horovod-MXNet Integration
Hi Mu, Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on releasing MXNet-Horovod integration in production. We have made some changes on both MXNet and Horovod sides. The changes on MXNet side have mostly been merged and we are working to merge code to horovod repo. We will send a design doc to you for review again next week. Thanks for your feedback, Lin On Wed, Oct 31, 2018 at 12:03 PM Mu Li wrote: > Thanks for your contribution, Carl. > > I remember I left a comment on the proposal, but today I found it was > disappeared. My suggestion is trying best to not change the existing API. > The reason is that we need to change all trainers on the frontend that uses > the existing kvstore APIs, which may cause confusion to users. > > The current proposal wants add the following 4 APIs into kvstore: > > >- > >kv.pushpull >- > >kv.broadcast >- > >kv.local_rank >- > >kv.num_local_workers > > > Pushpull can be done with a sequential push and pull, you can do nothing in > push and put all workloads into pushpull. Broadcast can be implemented by > pull. > > What's local workers? GPUs in the single machine? If so, we can query it > directly. > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang wrote: > > > Hi, > > > > Currently, MXNet distributed can only be done using parameter server. > > Horovod is an open-source distributed training framework that has > > shown 2x speedup compared to TensorFlow using Parameter Server. We > > propose to add Horovod support to MXNet. This will help our users > > achieve goal of linear scalability to 256 GPUs and beyond. Design > > proposal on cwiki: > > > > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > > > Please feel free to let me know if you have any suggestions or feedback. > > > > Regards, > > Carl > > >
Re: Horovod-MXNet Integration
Thanks for your contribution, Carl. I remember I left a comment on the proposal, but today I found it was disappeared. My suggestion is trying best to not change the existing API. The reason is that we need to change all trainers on the frontend that uses the existing kvstore APIs, which may cause confusion to users. The current proposal wants add the following 4 APIs into kvstore: - kv.pushpull - kv.broadcast - kv.local_rank - kv.num_local_workers Pushpull can be done with a sequential push and pull, you can do nothing in push and put all workloads into pushpull. Broadcast can be implemented by pull. What's local workers? GPUs in the single machine? If so, we can query it directly. On Fri, Sep 14, 2018 at 4:46 PM Carl Yang wrote: > Hi, > > Currently, MXNet distributed can only be done using parameter server. > Horovod is an open-source distributed training framework that has > shown 2x speedup compared to TensorFlow using Parameter Server. We > propose to add Horovod support to MXNet. This will help our users > achieve goal of linear scalability to 256 GPUs and beyond. Design > proposal on cwiki: > > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration > > Please feel free to let me know if you have any suggestions or feedback. > > Regards, > Carl >
Horovod-MXNet Integration
Hi, Currently, MXNet distributed can only be done using parameter server. Horovod is an open-source distributed training framework that has shown 2x speedup compared to TensorFlow using Parameter Server. We propose to add Horovod support to MXNet. This will help our users achieve goal of linear scalability to 256 GPUs and beyond. Design proposal on cwiki: https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration Please feel free to let me know if you have any suggestions or feedback. Regards, Carl