subject:"\[apache\/incubator\-mxnet\] \[RFC\] Unified API for Distributed Data Parallel Training \(#16795\)"

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-12-07 Thread Haibin Lin

I do expect the API to change in the future. Currently @szhengac @zhongyuchen and I are exploring APIs for gradient compression with a few algorithms, and we may bring back the best practice back to MXNet. -- You are receiving this because you are subscribed to this thread. Reply to this

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Leonard Lausen

Would it make sense to add optional support for sparse ndarrays and gradient compression in `AbstractKVStore`? You mentioned not all frameworks support it. Do you expect the API to change in the future? -- You are receiving this because you are subscribed to this thread. Reply to this email

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Haibin Lin

I did mean use case 2,3,4. Initialization is done in the constructor `kv.__init__()`, and for horovod it could be simply a `hvd.init()` call. I have not discussed problem 1 for too much details. horovod uses mpirun to setup connection and launch processes, while byteps/p3 and native kvstore

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Lin Yuan

In the Limitation, I suppose you meant 'use case 1,3,4', right? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/16795#issuecomment-553085374

[apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Haibin Lin

## Background Data parallel training is the most common distributed training technique when it comes to multiple GPUs or multiple hosts. Currently, several communication backends provide functionalities for communicating tensors across devices/hosts for data parallel training. For MXNet

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

[apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

5 matches

Site Navigation

Mail list logo

Footer information