subject:"Horovod\-MXNet Integration"

Re: Horovod-MXNet Integration

2019-01-30 Thread Aaron Markham

Congrats on the Horovod integration everyone. That's really great to hear.

On Wed, Jan 30, 2019 at 10:08 AM Lin Yuan  wrote:
>
> Hi Yuan,
>
> Thanks for your interest. We have just supported MXNet in Horovod and are
> working on performance tuning and adding more examples. We are definitely
> interested in further extending it's support with Kubeflow.
>
> Let's set up some time to have a more detailed discussion.
>
> Best,
>
> Lin
>
> On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang  wrote:
>
> > Hi,
> >
> > It's great to see MXNet-Horovod integration got merged:
> > https://github.com/uber/horovod/pull/542
> >
> > Is there any future plan for this? I've been working on Kubeflow's
> > MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it
> > would
> > be interesting to see an example of using Horovod + MXNet + Kubeflow using
> > MPI Operator. Feel free to reach out (@terrytangyuan
> > <https://github.com/terrytangyuan>) if you encounter any issues.
> >
> > Best,
> > Yuan
> >
> >
> > On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan  wrote:
> >
> > > Hi Mu,
> > >
> > > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> > > releasing MXNet-Horovod integration in production. We have made some
> > > changes on both MXNet and Horovod sides. The changes on MXNet side have
> > > mostly been merged and we are working to merge code to horovod repo. We
> > > will send a design doc to you for review again next week.
> > >
> > > Thanks for your feedback,
> > >
> > > Lin
> > >
> > > On Wed, Oct 31, 2018 at 12:03 PM Mu Li  wrote:
> > >
> > > > Thanks for your contribution, Carl.
> > > >
> > > > I remember I left a comment on the proposal, but today I found it was
> > > > disappeared. My suggestion is trying best to not change the existing
> > API.
> > > > The reason is that we need to change all trainers on the frontend that
> > > uses
> > > > the existing kvstore APIs, which may cause confusion to users.
> > > >
> > > > The current proposal wants add the following 4 APIs into kvstore:
> > > >
> > > >
> > > >-
> > > >
> > > >kv.pushpull
> > > >-
> > > >
> > > >kv.broadcast
> > > >-
> > > >
> > > >kv.local_rank
> > > >-
> > > >
> > > >kv.num_local_workers
> > > >
> > > >
> > > > Pushpull can be done with a sequential push and pull, you can do
> > nothing
> > > in
> > > > push and put all workloads into pushpull. Broadcast can be implemented
> > by
> > > > pull.
> > > >
> > > > What's local workers? GPUs in the single machine? If so, we can query
> > it
> > > > directly.
> > > >
> > > >
> > > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Currently, MXNet distributed can only be done using parameter server.
> > > > > Horovod is an open-source distributed training framework that has
> > > > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > > > propose to add Horovod support to MXNet. This will help our users
> > > > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > > > proposal on cwiki:
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > > > >
> > > > > Please feel free to let me know if you have any suggestions or
> > > feedback.
> > > > >
> > > > > Regards,
> > > > > Carl
> > > > >
> > > >
> > >
> >

Re: Horovod-MXNet Integration

2019-01-30 Thread Lin Yuan

Hi Yuan,

Thanks for your interest. We have just supported MXNet in Horovod and are
working on performance tuning and adding more examples. We are definitely
interested in further extending it's support with Kubeflow.

Let's set up some time to have a more detailed discussion.

Best,

Lin

On Wed, Jan 30, 2019 at 7:42 AM Yuan Tang  wrote:

> Hi,
>
> It's great to see MXNet-Horovod integration got merged:
> https://github.com/uber/horovod/pull/542
>
> Is there any future plan for this? I've been working on Kubeflow's
> MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it
> would
> be interesting to see an example of using Horovod + MXNet + Kubeflow using
> MPI Operator. Feel free to reach out (@terrytangyuan
> <https://github.com/terrytangyuan>) if you encounter any issues.
>
> Best,
> Yuan
>
>
> On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan  wrote:
>
> > Hi Mu,
> >
> > Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> > releasing MXNet-Horovod integration in production. We have made some
> > changes on both MXNet and Horovod sides. The changes on MXNet side have
> > mostly been merged and we are working to merge code to horovod repo. We
> > will send a design doc to you for review again next week.
> >
> > Thanks for your feedback,
> >
> > Lin
> >
> > On Wed, Oct 31, 2018 at 12:03 PM Mu Li  wrote:
> >
> > > Thanks for your contribution, Carl.
> > >
> > > I remember I left a comment on the proposal, but today I found it was
> > > disappeared. My suggestion is trying best to not change the existing
> API.
> > > The reason is that we need to change all trainers on the frontend that
> > uses
> > > the existing kvstore APIs, which may cause confusion to users.
> > >
> > > The current proposal wants add the following 4 APIs into kvstore:
> > >
> > >
> > >-
> > >
> > >kv.pushpull
> > >-
> > >
> > >kv.broadcast
> > >-
> > >
> > >kv.local_rank
> > >-
> > >
> > >kv.num_local_workers
> > >
> > >
> > > Pushpull can be done with a sequential push and pull, you can do
> nothing
> > in
> > > push and put all workloads into pushpull. Broadcast can be implemented
> by
> > > pull.
> > >
> > > What's local workers? GPUs in the single machine? If so, we can query
> it
> > > directly.
> > >
> > >
> > > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang  wrote:
> > >
> > > > Hi,
> > > >
> > > > Currently, MXNet distributed can only be done using parameter server.
> > > > Horovod is an open-source distributed training framework that has
> > > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > > propose to add Horovod support to MXNet. This will help our users
> > > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > > proposal on cwiki:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > > >
> > > > Please feel free to let me know if you have any suggestions or
> > feedback.
> > > >
> > > > Regards,
> > > > Carl
> > > >
> > >
> >
>

Re: Horovod-MXNet Integration

2019-01-30 Thread Yuan Tang

Hi,

It's great to see MXNet-Horovod integration got merged:
https://github.com/uber/horovod/pull/542

Is there any future plan for this? I've been working on Kubeflow's
MPI-Operator (https://github.com/kubeflow/mpi-operator) lately and it would
be interesting to see an example of using Horovod + MXNet + Kubeflow using
MPI Operator. Feel free to reach out (@terrytangyuan
<https://github.com/terrytangyuan>) if you encounter any issues.

Best,
Yuan


On Fri, Nov 2, 2018 at 6:51 PM Lin Yuan  wrote:

> Hi Mu,
>
> Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
> releasing MXNet-Horovod integration in production. We have made some
> changes on both MXNet and Horovod sides. The changes on MXNet side have
> mostly been merged and we are working to merge code to horovod repo. We
> will send a design doc to you for review again next week.
>
> Thanks for your feedback,
>
> Lin
>
> On Wed, Oct 31, 2018 at 12:03 PM Mu Li  wrote:
>
> > Thanks for your contribution, Carl.
> >
> > I remember I left a comment on the proposal, but today I found it was
> > disappeared. My suggestion is trying best to not change the existing API.
> > The reason is that we need to change all trainers on the frontend that
> uses
> > the existing kvstore APIs, which may cause confusion to users.
> >
> > The current proposal wants add the following 4 APIs into kvstore:
> >
> >
> >-
> >
> >kv.pushpull
> >-
> >
> >kv.broadcast
> >-
> >
> >kv.local_rank
> >-
> >
> >kv.num_local_workers
> >
> >
> > Pushpull can be done with a sequential push and pull, you can do nothing
> in
> > push and put all workloads into pushpull. Broadcast can be implemented by
> > pull.
> >
> > What's local workers? GPUs in the single machine? If so, we can query it
> > directly.
> >
> >
> > On Fri, Sep 14, 2018 at 4:46 PM Carl Yang  wrote:
> >
> > > Hi,
> > >
> > > Currently, MXNet distributed can only be done using parameter server.
> > > Horovod is an open-source distributed training framework that has
> > > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > > propose to add Horovod support to MXNet. This will help our users
> > > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > > proposal on cwiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> > >
> > > Please feel free to let me know if you have any suggestions or
> feedback.
> > >
> > > Regards,
> > > Carl
> > >
> >
>

Re: Horovod-MXNet Integration

2018-11-02 Thread Lin Yuan

Hi Mu,

Darren (@yuxihu <https://github.com/yuxihu>) and I have been working on
releasing MXNet-Horovod integration in production. We have made some
changes on both MXNet and Horovod sides. The changes on MXNet side have
mostly been merged and we are working to merge code to horovod repo. We
will send a design doc to you for review again next week.

Thanks for your feedback,

Lin

On Wed, Oct 31, 2018 at 12:03 PM Mu Li  wrote:

> Thanks for your contribution, Carl.
>
> I remember I left a comment on the proposal, but today I found it was
> disappeared. My suggestion is trying best to not change the existing API.
> The reason is that we need to change all trainers on the frontend that uses
> the existing kvstore APIs, which may cause confusion to users.
>
> The current proposal wants add the following 4 APIs into kvstore:
>
>
>-
>
>kv.pushpull
>-
>
>kv.broadcast
>-
>
>kv.local_rank
>-
>
>kv.num_local_workers
>
>
> Pushpull can be done with a sequential push and pull, you can do nothing in
> push and put all workloads into pushpull. Broadcast can be implemented by
> pull.
>
> What's local workers? GPUs in the single machine? If so, we can query it
> directly.
>
>
> On Fri, Sep 14, 2018 at 4:46 PM Carl Yang  wrote:
>
> > Hi,
> >
> > Currently, MXNet distributed can only be done using parameter server.
> > Horovod is an open-source distributed training framework that has
> > shown 2x speedup compared to TensorFlow using Parameter Server. We
> > propose to add Horovod support to MXNet. This will help our users
> > achieve goal of linear scalability to 256 GPUs and beyond. Design
> > proposal on cwiki:
> >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
> >
> > Please feel free to let me know if you have any suggestions or feedback.
> >
> > Regards,
> > Carl
> >
>

Re: Horovod-MXNet Integration

2018-10-31 Thread Mu Li

Thanks for your contribution, Carl.

I remember I left a comment on the proposal, but today I found it was
disappeared. My suggestion is trying best to not change the existing API.
The reason is that we need to change all trainers on the frontend that uses
the existing kvstore APIs, which may cause confusion to users.

The current proposal wants add the following 4 APIs into kvstore:

   -

   kv.pushpull
   -

   kv.broadcast
   -

   kv.local_rank
   -

   kv.num_local_workers

Pushpull can be done with a sequential push and pull, you can do nothing in
push and put all workloads into pushpull. Broadcast can be implemented by
pull.

What's local workers? GPUs in the single machine? If so, we can query it
directly.

On Fri, Sep 14, 2018 at 4:46 PM Carl Yang  wrote:

> Hi,
>
> Currently, MXNet distributed can only be done using parameter server.
> Horovod is an open-source distributed training framework that has
> shown 2x speedup compared to TensorFlow using Parameter Server. We
> propose to add Horovod support to MXNet. This will help our users
> achieve goal of linear scalability to 256 GPUs and beyond. Design
> proposal on cwiki:
>
> https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration
>
> Please feel free to let me know if you have any suggestions or feedback.
>
> Regards,
> Carl
>

Horovod-MXNet Integration

2018-09-14 Thread Carl Yang

Hi,

Currently, MXNet distributed can only be done using parameter server.
Horovod is an open-source distributed training framework that has
shown 2x speedup compared to TensorFlow using Parameter Server. We
propose to add Horovod support to MXNet. This will help our users
achieve goal of linear scalability to 256 GPUs and beyond. Design
proposal on cwiki:

https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration

Please feel free to let me know if you have any suggestions or feedback.

Regards,
Carl

Re: Horovod-MXNet Integration

Re: Horovod-MXNet Integration

Re: Horovod-MXNet Integration

Re: Horovod-MXNet Integration

Re: Horovod-MXNet Integration

Horovod-MXNet Integration

6 matches

Site Navigation

Mail list logo

Footer information