Hi, Chris!
Stas here - I've gathered that performance data.
Sure thing, I can be wrong, but please elaborate a bit on what we are missing.
Be assured, intentional misdirection was never a case.
Thanks a lot for being constructive.
> Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp,
> depending which one is linked in).
We never ever considered turning MKL off. We are on the same page here - MKL is
crucial for the performance.
Why should we? There's a GOMP-linked version of MKL, that we can use.
What we did - we measured, if using compilers default OpenMP implementation
instead of referenced source code distribution of OpenMP makes anything slower.
We have found the impact to be hardly measurable.
The difference between GOMP and iOMP is <5% on our benchmarks, most of the time
less than that.
We just suggest to simplify the build of mxnet, by removing the unnecessary
dependency.
During that we discovered for example the following amazing issue:
https://github.com/apache/incubator-mxnet/issues/14087
Best Regards
Stas
On 18.06.19, 18:24, "Chris Olivier" wrote:
I am very reluctant to feed the trolls again, and this will be teh last
time I address Pedro or Anton on the subject, but since I think the numbers
being presented are incorrect (either by te builders not really
understanding what they are building, or possibly intentional misdirection):
Turning Intel OMP on and off (and MKL as well, since it tends to pull in
omp, depending which one is linked in).
There is a HUGE difference. This is consistent with my experience before
when it was added.
default mnist:
python ../example/image-classification/train_mnist.py
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
gpus=None, image_shape='1, 28, 28', initializer='default',
kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
monitor=0, network='mlp', num_classes=10, num_epochs=20,
num_examples=6, num_layers=None, optimizer='sgd',
profile_server_suffix='', profile_worker_suffix='', save_period=1,
test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
INTEL OMP:
ldd libmxnet.so | grep omp
libomp.so =>
/home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
(0x7f978fde7000)
:root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
accuracy=0.780012
INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec
accuracy=0.920469
INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec
accuracy=0.928281
INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec
accuracy=0.942813
INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec
accuracy=0.938750
INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec
accuracy=0.946562
INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec
accuracy=0.953281
INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec
accuracy=0.951562
INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec
accuracy=0.957500
INFO:root:Epoch[0] Train-accuracy=0.925423
INFO:root:Epoch[0] Time cost=3.806
INFO:root:Epoch[0] Validation-accuracy=0.962580
INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
accuracy=0.968131
INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec
accuracy=0.966250
LIBGOMP:
ldd libmxnet.so | grep omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f25c25dd000)
INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
accuracy=0.782488
INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 samples/sec
accuracy=0.907813
INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 samples/sec
accuracy=0.927188
INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 samples/sec
accuracy=0.937969
INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 samples/sec
accuracy=0.942187
INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 samples/sec
accuracy=0.950156
INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 samples/sec
accuracy=0.947969
INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 samples/sec
accuracy=0.953750
INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 samples/sec
accuracy=0.953125
That being said, there's other issued beyond speed. The DEFAULT build from
makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
it has no issues? This seems highly suspicious. All I see is a lot of
hand-waving and conjecture