[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-10 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-380004688 Amalgamation build failed after I updated mshadow. I've updated the makefile for amalgamation to correct the build.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-09 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-379848578 @eric-haibin-lin @piiswrong Is this good to be merged?

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-06 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-379178912 I added the nightly tests we had for distributed kvstore as integration tests to the CI. And the build has passed.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-05 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-378971955 Thanks, Do you know why the integration test doesn't show up in the CI steps?

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-05 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-378954521 So that means we can't merge it in for this release? This is an important feature which users are eager to use. We have

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-04 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-378779102 An update, so the USE_DIST_KVSTORE flag has been added to the build, and this PR passes those builds on CPU and GPU.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-04-04 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-378678940 I've been running into scala issues when I introduce the dist flag to the builds. With @Roshrini's help the build phase

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-31 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377736427 Added multi-precision mode support. When optimizer's multi-precision field is True, then server maintains weights in

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-31 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377736427 Added multi-precision mode support. When optimizer's multi-precision field is True, then server maintains weights in

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-29 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377029019 @mli The behavior is as follows If the grads pushed to kvstore are of type `dtype` - Send grads in `dtype` form

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-29 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377372041 @mli It looks like CopyFromTo already uses multiple threads, isn't it?

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-29 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377372041 @mli So if we cast to fp32, then we cast all received updates, as well as cast the weight after updating the weight so

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-28 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-377029019 @mli The behavior is as follows If the grads pushed to kvstore are of type `dtype` - Send grads in `dtype` form

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-27 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-376699363 For reference: I'm adding the USE_DIST_KVSTORE flag to CI in the below PR

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-27 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-376699363 For reference: I'm adding the USE_DIST_KVSTORE flag to CI here

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375835656 @piiswrong Could you please review? This is an

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375784506 Yeah we should turn it on This is an automated message

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375784506 Yeah we should turn it on for all builds This is an

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375804869 Thanks @cjolivier01 and @haojin2 , reduced some multiplications, added const to more places and changed couple of

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375784506 Yeah we should turn it on for all builds. @marcoabreu Would submitting a regular PR (from a non-committer) use that

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375781478 @marcoabreu I realized now that the reason CI passed earlier is because USE_DIST_KVSTORE is not on.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375781478 @marcoabreu I realized now that the reason CI passed earlier is because USE_DIST_KVSTORE is not on in any CI build

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375587770 @mli Please also review the change to char parameter server we discussed. And for reference, the results section.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375587044 I've changed my approach and added results section. Reviewers if you had already seen the code, please re-review it :)

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-23 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-375578807 @marcoabreu I had noticed a slowdown, which I have now fixed and reopened the PR

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-21 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-374994401 @marcoabreu I see something weird in the CI build for this. I made some changes to address things raised by lint.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-21 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-374994401 @marcoabreu I see something weird in the CI build for this. I made some changes to address things raised by lint.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-21 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-374994401 @marcoabreu I see something weird in the CI build for this. I made some changes to address things raised by lint.

[GitHub] rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training

2018-03-21 Thread GitBox
rahul003 commented on issue #10183: [MXNET-120] Float16 support for distributed training URL: https://github.com/apache/incubator-mxnet/pull/10183#issuecomment-374844876 @solin319 This allows us to send some keys in fp16 and some in fp32. By using parameter server with type char, we also