rahul003 opened a new pull request #8342: [WIP] 2bit gradient compression URL: https://github.com/apache/incubator-mxnet/pull/8342 ## Description ## Implements 2bit gradient compression by quantizing each value in gradient array to 2bits using two user specified thresholds, one for positive and one for negative values. @eric-haibin-lin @piiswrong @reminisce @anirudh2290 @bhavinthaker @madjam @cjolivier01 Please review. This is a work in progress. I'm currently running this with different kind of models to get performance results. ### Important files to review Operator - two_bit_quantize-inl.h - two_bit_quantize.cc KVStore local - comm.h KVStore dist - kvstore_dist.h - kvstore_dist_server.h Documentation about gradient compression - kvstore.py - two_bit_quantize.cc ## Checklist ## ### Essentials ### - [ ] Passed code style checking (`make lint`) - [ ] Changes are complete (i.e. I finished coding on this PR) - [ ] All changes have test coverage - [ ] For user-facing API changes, API doc string has been updated. - [ ] To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change ### Changes ### - [ ] two-bit-quantize and dequantize operators - [ ] Reduce operation in kvstore_local / comm.h - [ ] Distributed kvstore changes at worker and server - [ ] Tests for operator, local kvstore, distributed kvstore with predefined and random data. The results have been compared with expected values by implementing this logic in python. - [ ] API changes for Kvstore, Module and Trainer in python ## Comments ## ### Problem When training large scale deep learning models especially with distributed training, communication becomes a bottleneck for networks whose computation is not high compared to the communication. ### Approach We can try to address this by quantizing the gradients before sending and dequantizing it at the receiver's end. The sender would retain the quantization error and add it to the next iteration, effectively delaying small updates to positions in the gradient. Specifically in this PR, currently 2bit quantization has been implemented. ### Two bit quantization Use two thresholds to quantize the data, one positive threshold and one negative threshold. Any positive value greater than or equal to the positive threshold is set to one value (say 01), any negative value lower than or equal to the negative threshold is set to second value (say 10), and others are set to third value (say 0). We need three values to represent data in this fashion and hence two bits. We understand this leads to one bit going waste, but that's an optimization for later, as it complicates the operators. The error in quantization is stored as residual and carried over to the next iteration. This is added in the next iteration to the gradient before quantizing. An example below with thresholds of -2.0 and 2.0 ![Quantization at work](https://i.imgur.com/AtBVg92.png) ### Format of compressed gradient The first two elements are the thresholds used for quantization. The third element is the size of the original array. These values are required to dequantize the gradient. Any element from the 4th element, represents compressed gradient. Each value from the 4th element, represents upto 16 elements in the original array. For the example above, we get ```compr = [ -2.0, 2.0, 8, 6.1215606E-28]``` Note that the binary representation of the last element is ```00 01 00 10 01 00 00 10 0000000000000000``` ### Local kvstore When using local kvstore, gradients compression only happens when using device communication. When gradients are pushed, before summing them up (Reduce), quantization and dequantization happen. Example: Say we have 4 GPUs, and the gradients are being summed up on GPU0. Each device quantizes gradients, then sends quantized gradient to GPU0, which performs dequantization of this data before merging it with values from other GPUs. Note that here, there is no need to quantize gradients from GPU0 itself, but it is still being done so that there is no bias for the samples which were processed by GPU0. **Please let me know if this is not a good idea.** ### Dist kvstore When the set_compress method for kvstore is called, each worker sets those compress params and one worker sends these params to all servers. From then on, when before each value is pushed to the server, it is quantized. The server dequantizes the data and stores it as an array of the original size. When values are pulled from the server, it returns an array of the original size. The same happens when each server is handling shards of the data. ### Usage The reason I used a dictionary compress_params for the arguments was to ensure uniformity when we extend this to other quantization techniques. This is because each technique would take different type and number of parameters. #### Operators ``` compr = mx.nd.contrib.create_2bit(grad) mx.nd.contrib.quantize_2bit(grad, residual, compr, neg_threshold, pos_threshold) mx.nd.contrib.dequantize_2bit(compr, decompr) ``` #### KVstore ``` kv = mx.kv.create('dist_sync') kv.set_compress({'compress':'2bit', 'pos_threshold':0.5, 'neg_threshold':-0.5}) ``` #### Module ``` mod = mx.mod.Module(net, compress_params={'compress':'2bit', 'pos_threshold':0.5, 'neg_threshold':-0.5}) ``` #### Gluon Trainer ``` trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1}, compress_params={'compress':'2bit', 'pos_threshold':0.5, 'neg_threshold':-0.5}) ``` ### Questions 1. [Refer kvstore local section above] Should I not do quantization and dequantization for the gradient on the GPU which is performing reduction? 2. On running nvprof for the GPU operator with large arrays, most time is taken by the MapPlanKernel and cudaStreamSynchronize like below. For small arrays, the quantize, dequantize operations are on top. Any suggestions to reduce the time taken by these operations in the operator? ``` Time(%) Time Calls Avg Min Max Name 48.17% 11.9536s 50002 239.06us 223.23us 357.70us void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, unsigned int, mshadow::Shape<int=2>, int=1) ... ==24796== API calls: Time(%) Time Calls Avg Min Max Name 63.81% 26.3938s 200002 131.97us 4.2860us 6.0096ms cudaStreamSynchronize ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services