Re: OMP

2019-06-19 Thread kellen sunderland
"if you’re linking in two then you’re doing something wrong." Correct,
that's one thing I believe we've got consensus on.  So let's call that out
as a bug to be fixed.

Let's move forward with some reproducible numbers and then discuss the pros
/ cons of which particular OMP implementation we should use.

On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy 
wrote:

> Hi Chris
>
> I would ask you to have a bit of patience and help us with your
> experience in this matter. Nobody is ignoring anything, I think we are
> individually gathering feedbacks and trying to understand the multiple
> contributions done to this topic including yours, then go step by
> step, understand what is going on and run experiments and report back
> to the list or the corresponding github item. It was suggested by
> Kellen to prepare some containers, this takes effort.
>
> Regarding your final comment, most of us also have many other things
> to do and responsibilities even if our daytime jobs might involve
> MXNet in some form or another. I think that's part of the privilege
> and responsibility of working close with an open source project and
> the magic of collaboration across organizations. Let's all be patient
> and take some time to understand and reason about this topic which is
> not simple. Since we decided to step back and gather more data let's
> take time and do it properly.
>
> Personally I hope to find time to look again into this issue before
> the end of the week.
>
> Thanks.
>
> Pedro.
>
> On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier 
> wrote:
> >
> > if you’re linking in two then you’re doing something wrong. You can see
> by
> > my email yesterday that only one is linked in. This is also the case with
> > the mkl version built by the Makefile — only the Intel OMP library is
> used
> > (no libgomp).
> >
> > That being said, Do you have clear evidence that using Intel OMP is both
> > problematic and the situation isn’t fixable?  The burden of proof is on
> the
> > ones requesting the change — it is not my responsibility to justify the
> > current state.  There must be something “terrible” and unfixable to
> justify
> > a change.  I have seen no proof of this in all this time.
> >
> > On a side note, I mentioned a couple of things in my email yesterday that
> > still are not being responded to (they were also ignored in the last
> > incarnation of this “discussion” — I have much experience in this matter
> to
> > assume “discussion” is a waste of my time, seeing and I am not paid to
> > “work on” mxnet like y’all are).
> >
> > -C
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > I've also quite often seen two versions of OpenMP linked.  I think we
> can
> > > all agree we probably want to avoid linking in two libraries that do
> > > effectively the same thing.
> > >
> > > The performance questions should be fairly straight forward to
> demonstrate
> > > right?  Could we just collaborate on a few minimal Dockerfiles that
> show
> > > (or don't show) Intel OpenMP performance speedups with the workloads
> Chris
> > > is referencing?
> > >
> > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > > stanislav.tsuk...@gmail.com> wrote:
> > >
> > > > Hi, Chris!
> > > >
> > > > Stas here - I've gathered that performance data.
> > > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > > missing.
> > > > Be assured, intentional misdirection was never a case.
> > > >
> > > > Thanks a lot for being constructive.
> > > >
> > > > > Turning Intel OMP on and off (and MKL as well, since it tends to
> pull
> > > in
> > > > omp, depending which one is linked in).
> > > >
> > > > We never ever considered turning MKL off. We are on the same page
> here -
> > > > MKL is crucial for the performance.
> > > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > > >
> > > > What we did - we measured, if using compilers default OpenMP
> > > > implementation instead of referenced source code distribution of
> OpenMP
> > > > makes anything slower.
> > > > We have found the impact to be hardly measurable.
> > > > The difference between GOMP and iOMP is <5% on our benchmarks, most
> of
> > > the
> > > > time less than that.
> > > >
> > > > We just suggest to simplify the build of mxnet, by removing the
> > > > unnecessary dependency.
> > > >
> > > > During that we discovered for example the following amazing issue:
> > > > https://github.com/apache/incubator-mxnet/issues/14087
> > > >
> > > > Best Regards
> > > >
> > > > Stas
> > > >
> > > > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> > > >
> > > > I am very reluctant to feed the trolls again, and this will be
> teh
> > > last
> > > > time I address Pedro or Anton on the subject, but since I think
> the
> > > > numbers
> > > > being presented are incorrect (either by te builders not really
> > > > understanding what they are building, or possibly 

Re: OMP

2019-06-19 Thread Pedro Larroy
Hi Chris

I would ask you to have a bit of patience and help us with your
experience in this matter. Nobody is ignoring anything, I think we are
individually gathering feedbacks and trying to understand the multiple
contributions done to this topic including yours, then go step by
step, understand what is going on and run experiments and report back
to the list or the corresponding github item. It was suggested by
Kellen to prepare some containers, this takes effort.

Regarding your final comment, most of us also have many other things
to do and responsibilities even if our daytime jobs might involve
MXNet in some form or another. I think that's part of the privilege
and responsibility of working close with an open source project and
the magic of collaboration across organizations. Let's all be patient
and take some time to understand and reason about this topic which is
not simple. Since we decided to step back and gather more data let's
take time and do it properly.

Personally I hope to find time to look again into this issue before
the end of the week.

Thanks.

Pedro.

On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier  wrote:
>
> if you’re linking in two then you’re doing something wrong. You can see by
> my email yesterday that only one is linked in. This is also the case with
> the mkl version built by the Makefile — only the Intel OMP library is used
> (no libgomp).
>
> That being said, Do you have clear evidence that using Intel OMP is both
> problematic and the situation isn’t fixable?  The burden of proof is on the
> ones requesting the change — it is not my responsibility to justify the
> current state.  There must be something “terrible” and unfixable to justify
> a change.  I have seen no proof of this in all this time.
>
> On a side note, I mentioned a couple of things in my email yesterday that
> still are not being responded to (they were also ignored in the last
> incarnation of this “discussion” — I have much experience in this matter to
> assume “discussion” is a waste of my time, seeing and I am not paid to
> “work on” mxnet like y’all are).
>
> -C
>
>
>
>
>
>
> On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > I've also quite often seen two versions of OpenMP linked.  I think we can
> > all agree we probably want to avoid linking in two libraries that do
> > effectively the same thing.
> >
> > The performance questions should be fairly straight forward to demonstrate
> > right?  Could we just collaborate on a few minimal Dockerfiles that show
> > (or don't show) Intel OpenMP performance speedups with the workloads Chris
> > is referencing?
> >
> > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> > stanislav.tsuk...@gmail.com> wrote:
> >
> > > Hi, Chris!
> > >
> > > Stas here - I've gathered that performance data.
> > > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > > missing.
> > > Be assured, intentional misdirection was never a case.
> > >
> > > Thanks a lot for being constructive.
> > >
> > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > > omp, depending which one is linked in).
> > >
> > > We never ever considered turning MKL off. We are on the same page here -
> > > MKL is crucial for the performance.
> > > Why should we? There's a GOMP-linked version of MKL, that we can use.
> > >
> > > What we did - we measured, if using compilers default OpenMP
> > > implementation instead of referenced source code distribution of OpenMP
> > > makes anything slower.
> > > We have found the impact to be hardly measurable.
> > > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> > the
> > > time less than that.
> > >
> > > We just suggest to simplify the build of mxnet, by removing the
> > > unnecessary dependency.
> > >
> > > During that we discovered for example the following amazing issue:
> > > https://github.com/apache/incubator-mxnet/issues/14087
> > >
> > > Best Regards
> > >
> > > Stas
> > >
> > > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> > >
> > > I am very reluctant to feed the trolls again, and this will be teh
> > last
> > > time I address Pedro or Anton on the subject, but since I think the
> > > numbers
> > > being presented are incorrect (either by te builders not really
> > > understanding what they are building, or possibly intentional
> > > misdirection):
> > >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > > in
> > > omp, depending which one is linked in).
> > > There is a HUGE difference.  This is consistent with my experience
> > > before
> > > when it was added.
> > >
> > >
> > > default mnist:
> > >
> > > python ../example/image-classification/train_mnist.py
> > > INFO:root:start with arguments Namespace(add_stn=False,
> > batch_size=64,
> > > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > > gpus=None, image_shape='1, 28, 

Re: OMP

2019-06-19 Thread Chris Olivier
if you’re linking in two then you’re doing something wrong. You can see by
my email yesterday that only one is linked in. This is also the case with
the mkl version built by the Makefile — only the Intel OMP library is used
(no libgomp).

That being said, Do you have clear evidence that using Intel OMP is both
problematic and the situation isn’t fixable?  The burden of proof is on the
ones requesting the change — it is not my responsibility to justify the
current state.  There must be something “terrible” and unfixable to justify
a change.  I have seen no proof of this in all this time.

On a side note, I mentioned a couple of things in my email yesterday that
still are not being responded to (they were also ignored in the last
incarnation of this “discussion” — I have much experience in this matter to
assume “discussion” is a waste of my time, seeing and I am not paid to
“work on” mxnet like y’all are).

-C






On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsuk...@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of
> the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> >
> > I am very reluctant to feed the trolls again, and this will be teh
> last
> > time I address Pedro or Anton on the subject, but since I think the
> > numbers
> > being presented are incorrect (either by te builders not really
> > understanding what they are building, or possibly intentional
> > misdirection):
> >
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > omp, depending which one is linked in).
> > There is a HUGE difference.  This is consistent with my experience
> > before
> > when it was added.
> >
> >
> > default mnist:
> >
> > python ../example/image-classification/train_mnist.py
> > INFO:root:start with arguments Namespace(add_stn=False,
> batch_size=64,
> > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > gpus=None, image_shape='1, 28, 28', initializer='default',
> > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> > monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > num_examples=6, num_layers=None, optimizer='sgd',
> > profile_server_suffix='', profile_worker_suffix='', save_period=1,
> > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> > INTEL OMP:
> >
> > ldd libmxnet.so | grep omp
> > libomp.so =>
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > (0x7f978fde7000)
> >
> > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> > accuracy=0.780012
> > INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
> > accuracy=0.920469
> > INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
> > accuracy=0.928281
> > INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
> > accuracy=0.942813
> > INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
> > accuracy=0.938750
> > INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
> > accuracy=0.946562
> > INFO:root:Epoch[0] Batch [600-700]  Speed: 

Re: OMP

2019-06-19 Thread Pedro Larroy
+1 Would be best to have a controlled environment so we can reason
about how MXNet is being built and what libraries are linked. I'm
happy to help here. I would think docker won't have a big impact on
the measurement or distort the results much.


On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland
 wrote:
>
> I've also quite often seen two versions of OpenMP linked.  I think we can
> all agree we probably want to avoid linking in two libraries that do
> effectively the same thing.
>
> The performance questions should be fairly straight forward to demonstrate
> right?  Could we just collaborate on a few minimal Dockerfiles that show
> (or don't show) Intel OpenMP performance speedups with the workloads Chris
> is referencing?
>
> On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
> stanislav.tsuk...@gmail.com> wrote:
>
> > Hi, Chris!
> >
> > Stas here - I've gathered that performance data.
> > Sure thing, I can be wrong, but please elaborate a bit on what we are
> > missing.
> > Be assured, intentional misdirection was never a case.
> >
> > Thanks a lot for being constructive.
> >
> > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> > omp, depending which one is linked in).
> >
> > We never ever considered turning MKL off. We are on the same page here -
> > MKL is crucial for the performance.
> > Why should we? There's a GOMP-linked version of MKL, that we can use.
> >
> > What we did - we measured, if using compilers default OpenMP
> > implementation instead of referenced source code distribution of OpenMP
> > makes anything slower.
> > We have found the impact to be hardly measurable.
> > The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> > time less than that.
> >
> > We just suggest to simplify the build of mxnet, by removing the
> > unnecessary dependency.
> >
> > During that we discovered for example the following amazing issue:
> > https://github.com/apache/incubator-mxnet/issues/14087
> >
> > Best Regards
> >
> > Stas
> >
> > On 18.06.19, 18:24, "Chris Olivier"  wrote:
> >
> > I am very reluctant to feed the trolls again, and this will be teh last
> > time I address Pedro or Anton on the subject, but since I think the
> > numbers
> > being presented are incorrect (either by te builders not really
> > understanding what they are building, or possibly intentional
> > misdirection):
> >
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull
> > in
> > omp, depending which one is linked in).
> > There is a HUGE difference.  This is consistent with my experience
> > before
> > when it was added.
> >
> >
> > default mnist:
> >
> > python ../example/image-classification/train_mnist.py
> > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> > gpus=None, image_shape='1, 28, 28', initializer='default',
> > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> > monitor=0, network='mlp', num_classes=10, num_epochs=20,
> > num_examples=6, num_layers=None, optimizer='sgd',
> > profile_server_suffix='', profile_worker_suffix='', save_period=1,
> > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> > wd=0.0001)
> >
> > INTEL OMP:
> >
> > ldd libmxnet.so | grep omp
> > libomp.so =>
> > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> > (0x7f978fde7000)
> >
> > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> > accuracy=0.780012
> > INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
> > accuracy=0.920469
> > INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
> > accuracy=0.928281
> > INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
> > accuracy=0.942813
> > INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
> > accuracy=0.938750
> > INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
> > accuracy=0.946562
> > INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
> > accuracy=0.953281
> > INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
> > accuracy=0.951562
> > INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
> > accuracy=0.957500
> > INFO:root:Epoch[0] Train-accuracy=0.925423
> > INFO:root:Epoch[0] Time cost=3.806
> > INFO:root:Epoch[0] Validation-accuracy=0.962580
> > INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
> > accuracy=0.968131
> > INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
> > accuracy=0.966250
> >
> >
> > LIBGOMP:
> >
> > ldd libmxnet.so | grep omp
> > libgomp.so.1 => 

Re: OMP

2019-06-19 Thread kellen sunderland
I've also quite often seen two versions of OpenMP linked.  I think we can
all agree we probably want to avoid linking in two libraries that do
effectively the same thing.

The performance questions should be fairly straight forward to demonstrate
right?  Could we just collaborate on a few minimal Dockerfiles that show
(or don't show) Intel OpenMP performance speedups with the workloads Chris
is referencing?

On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav <
stanislav.tsuk...@gmail.com> wrote:

> Hi, Chris!
>
> Stas here - I've gathered that performance data.
> Sure thing, I can be wrong, but please elaborate a bit on what we are
> missing.
> Be assured, intentional misdirection was never a case.
>
> Thanks a lot for being constructive.
>
> > Turning Intel OMP on and off (and MKL as well, since it tends to pull in
> omp, depending which one is linked in).
>
> We never ever considered turning MKL off. We are on the same page here -
> MKL is crucial for the performance.
> Why should we? There's a GOMP-linked version of MKL, that we can use.
>
> What we did - we measured, if using compilers default OpenMP
> implementation instead of referenced source code distribution of OpenMP
> makes anything slower.
> We have found the impact to be hardly measurable.
> The difference between GOMP and iOMP is <5% on our benchmarks, most of the
> time less than that.
>
> We just suggest to simplify the build of mxnet, by removing the
> unnecessary dependency.
>
> During that we discovered for example the following amazing issue:
> https://github.com/apache/incubator-mxnet/issues/14087
>
> Best Regards
>
> Stas
>
> On 18.06.19, 18:24, "Chris Olivier"  wrote:
>
> I am very reluctant to feed the trolls again, and this will be teh last
> time I address Pedro or Anton on the subject, but since I think the
> numbers
> being presented are incorrect (either by te builders not really
> understanding what they are building, or possibly intentional
> misdirection):
>
> Turning Intel OMP on and off (and MKL as well, since it tends to pull
> in
> omp, depending which one is linked in).
> There is a HUGE difference.  This is consistent with my experience
> before
> when it was added.
>
>
> default mnist:
>
> python ../example/image-classification/train_mnist.py
> INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
> disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
> gpus=None, image_shape='1, 28, 28', initializer='default',
> kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
> lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
> monitor=0, network='mlp', num_classes=10, num_epochs=20,
> num_examples=6, num_layers=None, optimizer='sgd',
> profile_server_suffix='', profile_worker_suffix='', save_period=1,
> test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear',
> wd=0.0001)
>
> INTEL OMP:
>
> ldd libmxnet.so | grep omp
> libomp.so =>
> /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
> (0x7f978fde7000)
>
> :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
> accuracy=0.780012
> INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
> accuracy=0.920469
> INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
> accuracy=0.928281
> INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
> accuracy=0.942813
> INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
> accuracy=0.938750
> INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
> accuracy=0.946562
> INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
> accuracy=0.953281
> INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
> accuracy=0.951562
> INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
> accuracy=0.957500
> INFO:root:Epoch[0] Train-accuracy=0.925423
> INFO:root:Epoch[0] Time cost=3.806
> INFO:root:Epoch[0] Validation-accuracy=0.962580
> INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
> accuracy=0.968131
> INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
> accuracy=0.966250
>
>
> LIBGOMP:
>
> ldd libmxnet.so | grep omp
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7f25c25dd000)
>
> INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
>  accuracy=0.782488
> INFO:root:Epoch[0] Batch [100-200]  Speed: 3551.32 samples/sec
>  accuracy=0.907813
> INFO:root:Epoch[0] Batch [200-300]  Speed: 1991.00 samples/sec
>  accuracy=0.927188
> INFO:root:Epoch[0] Batch [300-400]  Speed: 2175.45 samples/sec
>  accuracy=0.937969
> INFO:root:Epoch[0] Batch [400-500]  Speed: 1644.95 samples/sec
>  

Re: [VOTE] Remove conflicting OpenMP from CMake builds

2019-06-19 Thread Pedro Larroy
Let's cancel this vote for now and move the OMP conversation to a
discuss thread as suggested here in order not to antagonize anyone.
Let's find the best solution, this is not about one upping anyone but
to gather facts and document what's going on.

Chris, would you be so kind to reopen the PR so we can keep gathering
facts to understand and reproduce your results so we can have a
constructive discussion?  I think this would be a compromise to move
forward in a non confrontational way.

Thank you.

Pedro.

On Tue, Jun 18, 2019 at 1:35 PM Pedro Larroy
 wrote:
>
> Good suggestion. I thought the issue was discussed at length but I
> agree might be better to softly bring them to the list as a discuss
> first instead of a direct vote. Will do so in the future.
>
> Thanks.
>
>
> On Tue, Jun 18, 2019 at 12:56 PM Tianqi Chen  wrote:
> >
> > In a situation like this, I would suggest a DISCUSS thread is more proper
> > before the voting one.
> > As the tension can generally be high in a voting thread and it is hard to
> > make a technical decision without previous discussions and pros/cons being
> > listed.
> >
> > Tianqi
> >
> > On Fri, Jun 14, 2019 at 4:59 PM Pedro Larroy 
> > wrote:
> >
> > > Hi all
> > >
> > > This is a 5-day vote to act and wrap up an outstanding PR that removes
> > > linkage with multiple OpenMP from 3rdparty and uses the system
> > > provided one which might resolve a number of difficult to debug issues
> > > and possible undefined behaviour.
> > >
> > > https://github.com/apache/incubator-mxnet/pull/12160
> > >
> > > See the comments in the thread for more details but in short, linking
> > > with multiple openmp versions seems to lead to undefined behaviour,
> > > plus it would simplify not having to deal with a custom openmp version
> > > and rely on the platform provided one.
> > >
> > > This is expected to simplify builds and address a number of problems.
> > > Seems it doesn't cause any performance degradation, (the Gluon tests
> > > run almost 4x faster in my 64 core machine).
> > >
> > > There has been in depth study of performance implications by
> > > contributors like Stanislav Tsukrov and Anton Chernov.  All the
> > > concerns and comments from the reviewers have been addressed and we
> > > can't keep asking open ended questions to block PRs. Reviewers are
> > > expected to be proactive and responsive to contributors so we keep
> > > encouraging active contributors.
> > >
> > > please vote to merge this PR accordingly:
> > >
> > > +1 = approve
> > > +0 = no opinion
> > > -1 = disapprove (provide reason)
> > >
> > > If we observe regressions reported by any internal performance systems
> > > or by contributors the PR can be reverted easily. So it's not a one
> > > way door. But it will be useful to try this in master for a while.
> > >
> > > Thank you.
> > >
> > > Pedro.
> > >


Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
don't believe there's any need to drop SMs.

On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I think where we're all going to have agreement is that we shouldn't have
> code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> than 6.  We can go ahead and remove any code that targets those old
> versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> suggest we also add some logging for users with prior versions letting them
> know they can still use MXNet 1.4.
>
> Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> originally a proponent of the N and N-1 route for simplicity.  Looking back
> at the choice, one complication I see is that there's competing concerns
> between semver library compatibility and feature releases on NVIDIA's
> part.  NVIDIA is releasing new libraries with a lot of new features on a
> regular basis, which is good, but for compatibility reasons they've begun
> to bump major versions less often, which is also probably also good.  For
> example if memory serves correctly cuDNN used to get an MV bump every 6
> months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> project maintainer I would certainly like to drop support for library
> versions that are 2 years old in my latest release.  Supporting a 2 year
> wide range of dependency libraries in the CI for example is going to be a
> burden.
>
> From the MXNet users' perspective obviously having to update dependencies
> is a pain, but updating these libs are likely to give significant
> performance increases (occasional perf regressions aside).  I think a
> consistent thread I've heard from users is that training takes too long,
> inference costs too much, and they want their DL framework to abstract the
> complexity of using custom hardware like TCs or AVX with them having to put
> in a lot of effort.  Another consideration is that using old versions of
> MXNet is actually quite easy and convenient thanks to (IMO) some solid
> release practices and naming conventions.
>
> Given how easy it is to use old MXNet versions I think it's reasonable to
> target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> versions).
>
> On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
> wrote:
>
>> Good points anirudh. Generally I would understand N as being the major
>> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
>> and
>> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
>> be
>> dropped when 11 is released and tested.
>>
>> At the same time, we would always only supported the latest compatible
>> cudnn version. Or is there any reason somebody would use an old cudnn
>> version?
>>
>> Wdyt?
>>
>> -Marco
>>
>> Anirudh Subramanian  schrieb am Mi., 19. Juni
>> 2019,
>> 01:47:
>>
>> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
>> CUDA
>> > Version N and CUDA Version N - 1 should be supported in CI.
>> >
>> > My question is what happens, when we are at a position, where we are on
>> a
>> > CUDA version N and removed support for CUDA version N - 1. Within a
>> small
>> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
>> perf
>> > regressions and some bugs have been fixed. Should we just move to N + 1,
>> > since version N will have all these issues for users and may also slow
>> us
>> > down on CI.
>> >
>> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
>> > causing intermittent CI failures:
>> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
>> already
>> > a
>> > PR to bump up Nvidia version to 10.1 (
>> > https://github.com/apache/incubator-mxnet/pull/14986/files).
>> >
>> > I think for situations where there is a quick follow up release like
>> 10.1
>> > and MXNet users are impacted by certain issues, we should just bump up
>> the
>> > version and stop support for 10.0.
>> > Would like to hear more from Nvidia folks (on this particular case of
>> CUDA
>> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
>> customers).
>> >
>> > Anirudh
>> >
>> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter 
>> wrote:
>> >
>> > > Actually, I tried to say that support *doesn't necessarily* include
>> N-1.
>> > > I'm proposing that the supported versions are 1) covered by CI and 2)
>> > have
>> > > been available in a usable form long enough that a semi-motivated user
>> > has
>> > > been able to transition to it.  That might mean only N (e.g. per my
>> > > proposal, only cuDNN v7).
>> > >
>> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
>> > users
>> > > will transition to it at their own pace, thereby creating a N / N-1
>> > support
>> > > situation for some period.
>> > >
>> > >
>> > > On 2019/06/03 22:43:20, Pedro Larroy 
>> > > wrote:
>> > > > Your proposal of having support for N 

Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
I think where we're all going to have agreement is that we shouldn't have
code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
than 6.  We can go ahead and remove any code that targets those old
versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
suggest we also add some logging for users with prior versions letting them
know they can still use MXNet 1.4.

Where things get interesting is CUDA 9 / cuDNN 6 support.  I was originally
a proponent of the N and N-1 route for simplicity.  Looking back at the
choice, one complication I see is that there's competing concerns between
semver library compatibility and feature releases on NVIDIA's part.  NVIDIA
is releasing new libraries with a lot of new features on a regular basis,
which is good, but for compatibility reasons they've begun to bump major
versions less often, which is also probably also good.  For example if
memory serves correctly cuDNN used to get an MV bump every 6 months or so,
but now the N-1 MV (6) was released in April of 2017.  As a project
maintainer I would certainly like to drop support for library versions that
are 2 years old in my latest release.  Supporting a 2 year wide range of
dependency libraries in the CI for example is going to be a burden.

>From the MXNet users' perspective obviously having to update dependencies
is a pain, but updating these libs are likely to give significant
performance increases (occasional perf regressions aside).  I think a
consistent thread I've heard from users is that training takes too long,
inference costs too much, and they want their DL framework to abstract the
complexity of using custom hardware like TCs or AVX with them having to put
in a lot of effort.  Another consideration is that using old versions of
MXNet is actually quite easy and convenient thanks to (IMO) some solid
release practices and naming conventions.

Given how easy it is to use old MXNet versions I think it's reasonable to
target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
versions).

On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
wrote:

> Good points anirudh. Generally I would understand N as being the major
> versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
> dropped when 11 is released and tested.
>
> At the same time, we would always only supported the latest compatible
> cudnn version. Or is there any reason somebody would use an old cudnn
> version?
>
> Wdyt?
>
> -Marco
>
> Anirudh Subramanian  schrieb am Mi., 19. Juni 2019,
> 01:47:
>
> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> CUDA
> > Version N and CUDA Version N - 1 should be supported in CI.
> >
> > My question is what happens, when we are at a position, where we are on a
> > CUDA version N and removed support for CUDA version N - 1. Within a small
> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> perf
> > regressions and some bugs have been fixed. Should we just move to N + 1,
> > since version N will have all these issues for users and may also slow us
> > down on CI.
> >
> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> > causing intermittent CI failures:
> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> already
> > a
> > PR to bump up Nvidia version to 10.1 (
> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >
> > I think for situations where there is a quick follow up release like 10.1
> > and MXNet users are impacted by certain issues, we should just bump up
> the
> > version and stop support for 10.0.
> > Would like to hear more from Nvidia folks (on this particular case of
> CUDA
> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> customers).
> >
> > Anirudh
> >
> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:
> >
> > > Actually, I tried to say that support *doesn't necessarily* include
> N-1.
> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> > have
> > > been available in a usable form long enough that a semi-motivated user
> > has
> > > been able to transition to it.  That might mean only N (e.g. per my
> > > proposal, only cuDNN v7).
> > >
> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> > users
> > > will transition to it at their own pace, thereby creating a N / N-1
> > support
> > > situation for some period.
> > >
> > >
> > > On 2019/06/03 22:43:20, Pedro Larroy 
> > > wrote:
> > > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > > me. Are there use cases for supporting older CUDA versions?
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter 
> > wrote:
> > > > >
> > > > > I'd like to revisit the discussion of:
> > >
> >
> 

Re: OMP

2019-06-19 Thread Tsukrov, Stanislav
Hi, Chris!

Stas here - I've gathered that performance data.
Sure thing, I can be wrong, but please elaborate a bit on what we are missing.
Be assured, intentional misdirection was never a case.

Thanks a lot for being constructive. 

> Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, 
> depending which one is linked in).

We never ever considered turning MKL off. We are on the same page here - MKL is 
crucial for the performance. 
Why should we? There's a GOMP-linked version of MKL, that we can use.

What we did - we measured, if using compilers default OpenMP implementation 
instead of referenced source code distribution of OpenMP makes anything slower.
We have found the impact to be hardly measurable. 
The difference between GOMP and iOMP is <5% on our benchmarks, most of the time 
less than that. 

We just suggest to simplify the build of mxnet, by removing the unnecessary 
dependency.

During that we discovered for example the following amazing issue:
https://github.com/apache/incubator-mxnet/issues/14087

Best Regards

Stas

On 18.06.19, 18:24, "Chris Olivier"  wrote:

I am very reluctant to feed the trolls again, and this will be teh last
time I address Pedro or Anton on the subject, but since I think the numbers
being presented are incorrect (either by te builders not really
understanding what they are building, or possibly intentional misdirection):

Turning Intel OMP on and off (and MKL as well, since it tends to pull in
omp, depending which one is linked in).
There is a HUGE difference.  This is consistent with my experience before
when it was added.


default mnist:

python ../example/image-classification/train_mnist.py
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64,
disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none',
gpus=None, image_shape='1, 28, 28', initializer='default',
kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1,
lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9,
monitor=0, network='mlp', num_classes=10, num_epochs=20,
num_examples=6, num_layers=None, optimizer='sgd',
profile_server_suffix='', profile_worker_suffix='', save_period=1,
test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)

INTEL OMP:

ldd libmxnet.so | grep omp
libomp.so =>
/home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so
(0x7f978fde7000)

:root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec
accuracy=0.780012
INFO:root:Epoch[0] Batch [100-200]  Speed: 16073.21 samples/sec
accuracy=0.920469
INFO:root:Epoch[0] Batch [200-300]  Speed: 19075.91 samples/sec
accuracy=0.928281
INFO:root:Epoch[0] Batch [300-400]  Speed: 23211.36 samples/sec
accuracy=0.942813
INFO:root:Epoch[0] Batch [400-500]  Speed: 22139.79 samples/sec
accuracy=0.938750
INFO:root:Epoch[0] Batch [500-600]  Speed: 23225.52 samples/sec
accuracy=0.946562
INFO:root:Epoch[0] Batch [600-700]  Speed: 19547.41 samples/sec
accuracy=0.953281
INFO:root:Epoch[0] Batch [700-800]  Speed: 24111.73 samples/sec
accuracy=0.951562
INFO:root:Epoch[0] Batch [800-900]  Speed: 13959.88 samples/sec
accuracy=0.957500
INFO:root:Epoch[0] Train-accuracy=0.925423
INFO:root:Epoch[0] Time cost=3.806
INFO:root:Epoch[0] Validation-accuracy=0.962580
INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec
accuracy=0.968131
INFO:root:Epoch[1] Batch [100-200]  Speed: 23457.03 samples/sec
accuracy=0.966250


LIBGOMP:

ldd libmxnet.so | grep omp
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f25c25dd000)

INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec
 accuracy=0.782488
INFO:root:Epoch[0] Batch [100-200]  Speed: 3551.32 samples/sec
 accuracy=0.907813
INFO:root:Epoch[0] Batch [200-300]  Speed: 1991.00 samples/sec
 accuracy=0.927188
INFO:root:Epoch[0] Batch [300-400]  Speed: 2175.45 samples/sec
 accuracy=0.937969
INFO:root:Epoch[0] Batch [400-500]  Speed: 1644.95 samples/sec
 accuracy=0.942187
INFO:root:Epoch[0] Batch [500-600]  Speed: 6444.58 samples/sec
 accuracy=0.950156
INFO:root:Epoch[0] Batch [600-700]  Speed: 7842.16 samples/sec
 accuracy=0.947969
INFO:root:Epoch[0] Batch [700-800]  Speed: 9412.07 samples/sec
 accuracy=0.953750
INFO:root:Epoch[0] Batch [800-900]  Speed: 12707.58 samples/sec
accuracy=0.953125

That being said, there's other issued beyond speed.  The DEFAULT build from
makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously
it has no issues?  This seems highly suspicious.  All I see is a lot of
hand-waving and conjecture 

Re: CUDA / CUDNN support revisited

2019-06-19 Thread Marco de Abreu
Good points anirudh. Generally I would understand N as being the major
versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
dropped when 11 is released and tested.

At the same time, we would always only supported the latest compatible
cudnn version. Or is there any reason somebody would use an old cudnn
version?

Wdyt?

-Marco

Anirudh Subramanian  schrieb am Mi., 19. Juni 2019,
01:47:

> +1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
> Version N and CUDA Version N - 1 should be supported in CI.
>
> My question is what happens, when we are at a position, where we are on a
> CUDA version N and removed support for CUDA version N - 1. Within a small
> duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
> regressions and some bugs have been fixed. Should we just move to N + 1,
> since version N will have all these issues for users and may also slow us
> down on CI.
>
> I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> causing intermittent CI failures:
> https://github.com/apache/incubator-mxnet/issues/15273 . There is already
> a
> PR to bump up Nvidia version to 10.1 (
> https://github.com/apache/incubator-mxnet/pull/14986/files).
>
> I think for situations where there is a quick follow up release like 10.1
> and MXNet users are impacted by certain issues, we should just bump up the
> version and stop support for 10.0.
> Would like to hear more from Nvidia folks (on this particular case of CUDA
> 10.0 vs CUDA 10.1 and what are the recommendations for existing customers).
>
> Anirudh
>
> On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:
>
> > Actually, I tried to say that support *doesn't necessarily* include N-1.
> > I'm proposing that the supported versions are 1) covered by CI and 2)
> have
> > been available in a usable form long enough that a semi-motivated user
> has
> > been able to transition to it.  That might mean only N (e.g. per my
> > proposal, only cuDNN v7).
> >
> > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> users
> > will transition to it at their own pace, thereby creating a N / N-1
> support
> > situation for some period.
> >
> >
> > On 2019/06/03 22:43:20, Pedro Larroy 
> > wrote:
> > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > me. Are there use cases for supporting older CUDA versions?
> > >
> > >
> > > Thanks.
> > >
> > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter 
> wrote:
> > > >
> > > > I'd like to revisit the discussion of:
> >
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> > now that a year has passed.
> > > >
> > > > My motivation is:
> > > >
> > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> > cuDNN versions back as far as v4(!?).  We need to clean this out before
> it
> > hampers our ability to nimbly move the codebase forward.
> > > >
> > > > 2.  There seems to be a difference of opinion on whether we should be
> > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> > does not compile against cuDNN v6, so this should be either fixed or be
> > up-front stated to the user community.  The breaking PR was
> > https://github.com/apache/incubator-mxnet/pull/14476.
> > > >
> > > > Having read the prior discussion, my take on it is:
> > > >
> > > > - Users should be given an ample time period (1 year?) to move to a
> > new CUDA/cuDNN version once it becomes 'usable.'
> > > >
> > > > - We should not claim to support a given version if it is no longer
> > part of the MXNet CI.  User's should be warned of an impeding dropping of
> > this 'testing support.'
> > > >
> > > > So these statements do not necessarily promise 'N-1' support.  I
> could
> > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> > Some period before CUDA9 is dropped from CI, the user community is
> warned.
> > After that time, CUDA10 might be the only version tested by CI, and hence
> > the only version supported (until the next CUDA version came around).
> > > >
> > > > Let me propose as a 'strawman' that we claim to support CUDA version
> 9
> > and 10, with cuDNN version 7 only.  Those versions have been out for over
> > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> > coverage by our CI.
> > > >
> > > > -Dick
> > >
> >
>