crazy-cat opened a new pull request #7393: add depthwise convolution's gpu 
version optimization
URL: https://github.com/apache/incubator-mxnet/pull/7393
 
 
   
   As the cudnn is not optimized for depthwise convolution, we optimized the 
gpu version of depthwise 2D convolution. 
   The  training effect is as follows:
   **cudnn version mobilenet training in imagenet** 
   ```
   cd example/image-classification/;
   python train_imagenet.py --network mobilenet --gpus=0 
--data-train=./train_480_q95.rec --data-nthreads 8
   
   INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, 
data_nthreads=8, data_train='./train_480_q95.rec', data_val=None, 
disp_batches=20, dtype='float32', gpus='0', image_shape='3,224,224', 
kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, 
lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, 
max_random_l=50, max_random_rotate_angle=10, max_random_s=50, 
max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, 
model_prefix=None, mom=0.9, monitor=0, network='mobilenet', num_classes=1000, 
num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', 
pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', 
test_io=0, top_k=0, wd=0.0001)
   [10:03:41] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: 
./train_480_q95.rec, use 7 threads for decoding..
   [10:03:45] src/operator/././cudnn_algoreg-inl.h:65: Running performance 
tests to find the best convolution algorithm, this can take a while... (setting 
env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]        Speed: 133.85 samples/sec       
accuracy=0.000744
   INFO:root:Epoch[0] Batch [40]        Speed: 135.98 samples/sec       
accuracy=0.001953
   INFO:root:Epoch[0] Batch [60]        Speed: 135.47 samples/sec       
accuracy=0.000391
   INFO:root:Epoch[0] Batch [80]        Speed: 132.32 samples/sec       
accuracy=0.001563
   INFO:root:Epoch[0] Batch [100]       Speed: 134.01 samples/sec       
accuracy=0.001953
   ```
   
   **our version mobilenet training in imagenet** 
   
   ```
   cd example/image-classification/;
   python train_imagenet.py --network mobilenet --gpus=0 
--data-train=./train_480_q95.rec --data-nthreads 8
   
   INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, 
data_nthreads=8, data_train='./train_480_q95.rec', data_val=None, 
disp_batches=20, dtype='float32', gpus='0', image_shape='3,224,224', 
kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, 
lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, 
max_random_l=50, max_random_rotate_angle=10, max_random_s=50, 
max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, 
model_prefix=None, mom=0.9, monitor=0, network='mobilenet', num_classes=1000, 
num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', 
pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', 
test_io=0, top_k=0, wd=0.0001)
   [09:59:19] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: 
./train_480_q95.rec, use 7 threads for decoding..
   [09:59:25] src/operator/././cudnn_algoreg-inl.h:65: Running performance 
tests to find the best convolution algorithm, this can take a while... (setting 
env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   INFO:root:Epoch[0] Batch [20]        Speed: 476.02 samples/sec       
accuracy=0.000372
   INFO:root:Epoch[0] Batch [40]        Speed: 489.77 samples/sec       
accuracy=0.001563
   INFO:root:Epoch[0] Batch [60]        Speed: 495.26 samples/sec       
accuracy=0.000781
   INFO:root:Epoch[0] Batch [80]        Speed: 494.94 samples/sec       
accuracy=0.001563
   INFO:root:Epoch[0] Batch [100]       Speed: 494.81 samples/sec       
accuracy=0.002734
   ```
   
   The defaule depthwise conv will go in optimized version, you can **change 
depthwise_conv_off to True  in symbols/mobilenet.py** if you want to use cudnn 
version.
   ```
   ...
       conv = mx.sym.Convolution(data=data, num_filter=num_filter, 
kernel=kernel, num_group=num_group, stride=stride, pad=pad, no_bias=True,  
depthwise_conv_off=True,
                       name='%s%s_conv2d' %(name, suffix))
   ...
   ```
   **Hardware :**
   `TITAN X (Pascal) + Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz * 16 + 128GMem
   `
   **Software :**
   `cuda8.0 + cudnn5.1
   `
   As described above, we get about 3-4 times speed compared the cudnn version. 
About the test, we  have compared the result in every depthwise layer with the 
conv version.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to