Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Leonard Lausen
Would it make sense to add optional support for sparse ndarrays and gradient compression in `AbstractKVStore`? You mentioned not all frameworks support it. Do you expect the API to change in the future? -- You are receiving this because you are subscribed to this thread. Reply to this email

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Haibin Lin
I did mean use case 2,3,4. Initialization is done in the constructor `kv.__init__()`, and for horovod it could be simply a `hvd.init()` call. I have not discussed problem 1 for too much details. horovod uses mpirun to setup connection and launch processes, while byteps/p3 and native kvstore

Re: [apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Lin Yuan
In the Limitation, I suppose you meant 'use case 1,3,4', right? -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-mxnet/issues/16795#issuecomment-553085374

[apache/incubator-mxnet] [RFC] Unified API for Distributed Data Parallel Training (#16795)

2019-11-12 Thread Haibin Lin
## Background Data parallel training is the most common distributed training technique when it comes to multiple GPUs or multiple hosts. Currently, several communication backends provide functionalities for communicating tensors across devices/hosts for data parallel training. For MXNet

[apache/incubator-mxnet] [Numpy] [WIP] [RFC] Sample_n op for DeepNumpy (#16793)

2019-11-12 Thread Xi Wang
## Description Current design of DeepNumpy's random module follows native numpy in terms of the interpretation of the parameter `size`. More specifically, `size` indicates the final output size of the sampling operation. Parameter tensors, if narrower or smaller than `size`, will be