Re: OMP

2019-06-19 Thread Tsukrov, Stanislav
Hi, Chris!

Stas here - I've gathered that performance data.
Sure thing, I can be wrong, but please elaborate a bit on what we are missing.
Be assured, intentional misdirection was never a case.

Thanks a lot for being constructive. 

> Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, 
> depending which one is linked in).

We never ever considered turning MKL off. We are on the same page here - MKL is 
crucial for the performance. 
Why should we? There's a GOMP-linked version of MKL, that we can use.

What we did - we measured, if using compilers default OpenMP implementation 
instead of referenced source code distribution of OpenMP makes anything slower.
We have found the impact to be hardly measurable. 
The difference between GOMP and iOMP is <5% on our benchmarks, most of the time 
less than that. 

We just suggest to simplify the build of mxnet, by removing the unnecessary 
dependency.

During that we discovered for example the following amazing issue:
https://github.com/apache/incubator-mxnet/issues/14087

Best Regards

Stas

On 18.06.19, 18:24, "Chris Olivier"  wrote:

I am very reluctant to feed the trolls again, and this will be teh last
time I address Pedro or Anton on the subject, but since I think the numbers
being presented are incorrect (either by te builders not really
understanding what they are building, or possibly intentional misdirection):

Turning Intel OMP on and off (and MKL as well, since it tends to pull in
omp, depending which one is linked in).
There is a HUGE difference.  This is consistent with my experience before
when it was added.


default mnist:

python ../example/image-classification/train_mnist.py
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
gpus=None, image_shape='1, 28, 28', initializer='default',
kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
monitor=0, network='mlp', num_classes=10, num_epochs=20,
num_examples=6, num_layers=None, optimizer='sgd',
profile_server_suffix='', profile_worker_suffix='', save_period=1,
test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)

INTEL OMP:

ldd libmxnet.so | grep omp
libomp.so =>
/home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
(0x7f978fde7000)

:root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
accuracy=0.780012
INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
accuracy=0.920469
INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
accuracy=0.928281
INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
accuracy=0.942813
INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
accuracy=0.938750
INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
accuracy=0.946562
INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
accuracy=0.953281
INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
accuracy=0.951562
INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
accuracy=0.957500
INFO:root:Epoch[0] Train-accuracy=0.925423
INFO:root:Epoch[0] Time cost=3.806
INFO:root:Epoch[0] Validation-accuracy=0.962580
INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
accuracy=0.968131
INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
accuracy=0.966250


LIBGOMP:

ldd libmxnet.so | grep omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f25c25dd000)

INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
 accuracy=0.782488
INFO:root:Epoch[0] Batch [100-200]  Speed: 3551.32 samples/sec
 accuracy=0.907813
INFO:root:Epoch[0] Batch [200-300]  Speed: 1991.00 samples/sec
 accuracy=0.927188
INFO:root:Epoch[0] Batch [300-400]  Speed: 2175.45 samples/sec
 accuracy=0.937969
INFO:root:Epoch[0] Batch [400-500]  Speed: 1644.95 samples/sec
 accuracy=0.942187
INFO:root:Epoch[0] Batch [500-600]  Speed: 6444.58 samples/sec
 accuracy=0.950156
INFO:root:Epoch[0] Batch [600-700]  Speed: 7842.16 samples/sec
 accuracy=0.947969
INFO:root:Epoch[0] Batch [700-800]  Speed: 9412.07 samples/sec
 accuracy=0.953750
INFO:root:Epoch[0] Batch [800-900]  Speed: 12707.58 samples/sec
accuracy=0.953125

That being said, there's other issued beyond speed.  The DEFAULT build from
makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
it has no issues?  This seems highly suspicious.  All I see is a lot of
hand-waving and conjecture 

Re: Embedded World 2019 Robotics Demo

2019-02-28 Thread Tsukrov, Stanislav
Guys, that's amazing!

Stas

On 28.02.19, 09:19, "Isabel Drost-Fromm"  wrote:

Hi,

First of all thank you for the summary, that sounds awesome. Would be great 
to hear more stories like this shared here - for the stories that can be shared.


Am 28. Februar 2019 08:03:38 MEZ schrieb Thomas DELTEIL 
:
>To answer your question Isabel, this project was a joint cooperation
>between a few MXNet team members at AWS, including Anton, Pavel and
>myself
>and some employees at the QT (the C++ library) company, in their
>industrial
>automation department.

Thanks for the information - just names of those individuals who deserve 
credit would have been more than enough. Any chance to draw any of those who 
preferred to remain unnamed here into the project*? 

Isabel


* I guess I'm old school but in my experience giving credit publicly is a 
great way to increase contributor motivation. Sorry for soundings like your 
notorious volunteer/ casual contributor recruiter ;)
-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.





Re: Benchmarking MXNet with different compilers and different OpenMP implementations (results)

2019-02-14 Thread Tsukrov, Stanislav
Thanks Aaron for the feedback.

> As for your next steps, would you propose that cmake be brought up to parity? 
Yes. sse2 in cmake vs sse3 in make is a minor example without high impact. 
There are others.

> It seems strange that it causes slowness and if so, it shouldn't be 
> recommended for now.
There are some issues in the cmake-files code, that should be fixed. Some of 
them are workarounded for the benchmark.

Best Regards

Stas

On 14.02.19, 14:09, "Anton Chernov"  wrote:

Thank you, Aaron, for your interest on the topic.

My main previous proposal still stands: remove bundled OpenMP submodule and
use OpenMP provided by the environment [1]. This might lead to performance
degradation in some cases where an old OpenMP library is used or thread
affinity wasn't set properly. But that would be a problem of the
environment, not MXNet.

I described some alternative solutions in [1] as part of this [2] thread.
Tricking the linker with symlinks in both cases should allow to avoid
multiple OpenMP implementations linked simultaneously to MXNet. Windows
questions would be still open.

Best
Anton

[1] https://github.com/apache/incubator-mxnet/pull/12160
[2]

https://lists.apache.org/thread.html/007d8db15a1782e1b20896a4050b62710d4ff0908c67b94af7cb0f8b@%3Cdev.mxnet.apache.org%3E
[3]

https://lists.apache.org/thread.html/4827f0f742b6e7e070da350ea81226d059401527f3072ce8b33c1fdf@%3Cdev.mxnet.apache.org%3E


вт, 12 февр. 2019 г. в 16:39, Aaron Markham :

> This is really great research. I've often wondered what the difference
> really is, and why it has to be so complicated. It seems the answer is
> there isn't much difference and it shouldn't be as complex.
> As for your next steps, would you propose that cmake be brought up to
> parity? It seems strange that it causes slowness and if so, it shouldn't 
be
> recommended for now.
> Also, testing for windows compliers might be quite important as install
> stats suggest a significant portion of windows users. Wouldn't this nudge
> the decision of what to use as a rule going forward?
> I ran into this submodule openmp issue on windows myself. How does that 
get
> fixed? Do we have to repackage all of the submodules to make sure they use
> the recommended implementation or they use what the system expects?
>
> Cheers,
> Aaron
>
> On Tue, Feb 12, 2019, 04:37 Anton Chernov  wrote:
>
> > Dear MXNet community,
> >
> > Due to multiple problems related to OpenMP and stale proposed change [1]
> we
> > have been working on gathering performance data on the impact of using
> > different OpenMP implementations with MXNet (great thanks to Stanislav
> > Tsukrov for the hard work). The results can be found here [2].
> >
> > As a short summary of the investigation: The difference between 
different
> > compilers is insignificant. Native OpenMP implementations (more or less
> > recent) perform equally (<5% difference). See more details in the
> document.
> >
> > Please review the document and share your thoughts on the topic.
> >
> > Thanks!
> >
> > Best
> > Anton
> >
> > [1]
> >
> >
> 
https://lists.apache.org/thread.html/4827f0f742b6e7e070da350ea81226d059401527f3072ce8b33c1fdf@
> > 
> > [2] https://cwiki.apache.org/confluence/x/2wclBg
> >
>