crazy-cat opened a new pull request #7393: add depthwise convolution's gpu version optimization URL: https://github.com/apache/incubator-mxnet/pull/7393 As the cudnn is not optimized for depthwise convolution, we optimized the gpu version of depthwise 2D convolution. The training effect is as follows: **cudnn version mobilenet training in imagenet** ``` cd example/image-classification/; python train_imagenet.py --network mobilenet --gpus=0 --data-train=./train_480_q95.rec --data-nthreads 8 INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=8, data_train='./train_480_q95.rec', data_val=None, disp_batches=20, dtype='float32', gpus='0', image_shape='3,224,224', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='mobilenet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001) [10:03:41] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: ./train_480_q95.rec, use 7 threads for decoding.. [10:03:45] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) INFO:root:Epoch[0] Batch [20] Speed: 133.85 samples/sec accuracy=0.000744 INFO:root:Epoch[0] Batch [40] Speed: 135.98 samples/sec accuracy=0.001953 INFO:root:Epoch[0] Batch [60] Speed: 135.47 samples/sec accuracy=0.000391 INFO:root:Epoch[0] Batch [80] Speed: 132.32 samples/sec accuracy=0.001563 INFO:root:Epoch[0] Batch [100] Speed: 134.01 samples/sec accuracy=0.001953 ``` **our version mobilenet training in imagenet** ``` cd example/image-classification/; python train_imagenet.py --network mobilenet --gpus=0 --data-train=./train_480_q95.rec --data-nthreads 8 INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, data_nthreads=8, data_train='./train_480_q95.rec', data_val=None, disp_batches=20, dtype='float32', gpus='0', image_shape='3,224,224', kv_store='device', load_epoch=None, lr=0.1, lr_factor=0.1, lr_step_epochs='30,60', max_random_aspect_ratio=0.25, max_random_h=36, max_random_l=50, max_random_rotate_angle=10, max_random_s=50, max_random_scale=1, max_random_shear_ratio=0.1, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='mobilenet', num_classes=1000, num_epochs=80, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, random_crop=1, random_mirror=1, rgb_mean='123.68,116.779,103.939', test_io=0, top_k=0, wd=0.0001) [09:59:19] src/io/iter_image_recordio_2.cc:135: ImageRecordIOParser2: ./train_480_q95.rec, use 7 threads for decoding.. [09:59:25] src/operator/././cudnn_algoreg-inl.h:65: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable) INFO:root:Epoch[0] Batch [20] Speed: 476.02 samples/sec accuracy=0.000372 INFO:root:Epoch[0] Batch [40] Speed: 489.77 samples/sec accuracy=0.001563 INFO:root:Epoch[0] Batch [60] Speed: 495.26 samples/sec accuracy=0.000781 INFO:root:Epoch[0] Batch [80] Speed: 494.94 samples/sec accuracy=0.001563 INFO:root:Epoch[0] Batch [100] Speed: 494.81 samples/sec accuracy=0.002734 ``` The defaule depthwise conv will go in optimized version, you can **change depthwise_conv_off to True in symbols/mobilenet.py** if you want to use cudnn version. ``` ... conv = mx.sym.Convolution(data=data, num_filter=num_filter, kernel=kernel, num_group=num_group, stride=stride, pad=pad, no_bias=True, depthwise_conv_off=True, name='%s%s_conv2d' %(name, suffix)) ... ``` **Hardware :** `TITAN X (Pascal) + Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz * 16 + 128GMem ` **Software :** `cuda8.0 + cudnn5.1 ` As described above, we get about 3-4 times speed compared the cudnn version. About the test, we have compared the result in every depthwise layer with the conv version. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
With regards, Apache Git Services