Re: OMP

2019-06-24 Thread kellen sunderland
I remember at the time we also had a read through of this blog post, but to
use the code looked like it was following the advice:
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I remember this hang as well, it was pretty hard to reproduce IIRC.  I
> believe the stacks for the hang are here:
> https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
> the trick was we could only debug it up to the point that we hit:
>
> #0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
> futex_word=0x7fec60843758)
> at ../sysdeps/unix/sysv/linux/futex-internal.h:61
> #1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
> at ../sysdeps/nptl/futex-internal.h:135
> #2  __pthread_once_slow (once_control=0x7fec60843758,
> init_routine=0x7fec605f38f0)
> at pthread_once.c:105
> ...
> #6  0x7fec6061c577 in cudaSetDevice () from
> /usr/local/cuda/lib64/libcudart.so.9.0
>
> because the code in libcudart is obviously closed source we couldn't dig
> into what threading work was going on when we called cudaSetDevice.
>
> On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy 
> wrote:
>
>> If you check initialize.cc we seem to be explicitly disabling that
>> behaviour in pthread_at_fork which seems to cause thread contention
>> during multiprocessing. Why do we need this major advantage for the
>> library if that's the case?
>>
>> Related PRs:
>>
>> https://github.com/apache/incubator-mxnet/pull/10820
>> https://github.com/apache/incubator-mxnet/issues/14396
>>
>> The original code was authored in this PR:
>>
>> https://github.com/apache/incubator-mxnet/pull/8677
>>
>> I actually remember this fix, it was done during a release as the cuda
>> runtime was forking and the engine was being re-entered. If that
>> situation is now happening anymore it might not be needed any longer.
>> I don't think we know the cause why there was a fork inside cuda, so
>> the code has grown around a fix for an issue which its root cause was
>> not understood, and side effects which this fix caused afterwards.
>>
>> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
>> the link above, no libgomp.
>>
>> I didn't try the Make build.
>>
>> I would refactor the code linked above and stop using pthread_at_fork,
>> since OMP assumes it won't be initialized twice, but needs to be very
>> well tested to make sure it doesn't cause bugs or affect the fixes
>> done on the linked PRs above.
>>
>> Pedro.
>>
>> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
>> wrote:
>> >
>> > one major advantage of intel/llvm omp is that it spawns a new thread
>> pool
>> > after fork if a thread pool was already created. this is so that omp
>> can be
>> > used in the forked processes. libgomp doesn’t do this so it’ll just
>> lock up
>> > if you try to do omp in the forked process.
>> >
>> > is your build linking libgomp as well?
>> >
>> > standard mkl build (from Makefile) uses same omp library. are there
>> > problems with that build?
>> >
>> > what changes need to be made to make the assertion not fire?
>> >
>> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> > > There's an assertion which is easily reproducible, and also there's a
>> > > crash including core dump, the latter is not easy to reproduce for me
>> > > in different environments. I have also seen mxnet getting stuck
>> > > without progressing with this build configuration and using no CPU at
>> > > all when running unit tests.
>> > >
>> > > In my view, the root cause of the assertion is that we are re-entering
>> > > OMP initialization when spawning threads on the following code through
>> > > pthread_at_fork
>> > >
>> > >
>> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>> > >
>> > > This causes double initialization of the OMP engine, including the
>> > > assertion which you are asking about,  and I suspect some additional
>> > > overhead. That's the shady forking part you are asking for.
>> > >
>> > > A question for you: What is the cause of runtime differences between
>> > > OMP runtimes? Shouldn't the implementation overhead diminish as
>> > > threads run longer?
>> > >
>> > > Pedro.
>> > >
>> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
>> > > wrote:
>> > > >
>> > > > What’s the reason for the assertion failure? btw classifying an
>> assertion
>> > > > failure a “crash” is debatable. As I stated in the original issue a
>> long
>> > > > time ago, it’s possible something shady is being done with when
>> forking
>> > > > that should be fixed.  The assertion should be root caused.
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
>> > > pedro.larroy.li...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Added a dockerfile, and reports of a crash in my local machine
>> when
>> > > > > running 

Re: OMP

2019-06-24 Thread kellen sunderland
I remember this hang as well, it was pretty hard to reproduce IIRC.  I
believe the stacks for the hang are here:
https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
the trick was we could only debug it up to the point that we hit:

#0  0x7fec6df1ba4f in futex_wait (private=0, expected=1,
futex_word=0x7fec60843758)
at ../sysdeps/unix/sysv/linux/futex-internal.h:61
#1  futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758)
at ../sysdeps/nptl/futex-internal.h:135
#2  __pthread_once_slow (once_control=0x7fec60843758,
init_routine=0x7fec605f38f0)
at pthread_once.c:105
...
#6  0x7fec6061c577 in cudaSetDevice () from
/usr/local/cuda/lib64/libcudart.so.9.0

because the code in libcudart is obviously closed source we couldn't dig
into what threading work was going on when we called cudaSetDevice.

On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy 
wrote:

> If you check initialize.cc we seem to be explicitly disabling that
> behaviour in pthread_at_fork which seems to cause thread contention
> during multiprocessing. Why do we need this major advantage for the
> library if that's the case?
>
> Related PRs:
>
> https://github.com/apache/incubator-mxnet/pull/10820
> https://github.com/apache/incubator-mxnet/issues/14396
>
> The original code was authored in this PR:
>
> https://github.com/apache/incubator-mxnet/pull/8677
>
> I actually remember this fix, it was done during a release as the cuda
> runtime was forking and the engine was being re-entered. If that
> situation is now happening anymore it might not be needed any longer.
> I don't think we know the cause why there was a fork inside cuda, so
> the code has grown around a fix for an issue which its root cause was
> not understood, and side effects which this fix caused afterwards.
>
> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
> the link above, no libgomp.
>
> I didn't try the Make build.
>
> I would refactor the code linked above and stop using pthread_at_fork,
> since OMP assumes it won't be initialized twice, but needs to be very
> well tested to make sure it doesn't cause bugs or affect the fixes
> done on the linked PRs above.
>
> Pedro.
>
> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier 
> wrote:
> >
> > one major advantage of intel/llvm omp is that it spawns a new thread pool
> > after fork if a thread pool was already created. this is so that omp can
> be
> > used in the forked processes. libgomp doesn’t do this so it’ll just lock
> up
> > if you try to do omp in the forked process.
> >
> > is your build linking libgomp as well?
> >
> > standard mkl build (from Makefile) uses same omp library. are there
> > problems with that build?
> >
> > what changes need to be made to make the assertion not fire?
> >
> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > There's an assertion which is easily reproducible, and also there's a
> > > crash including core dump, the latter is not easy to reproduce for me
> > > in different environments. I have also seen mxnet getting stuck
> > > without progressing with this build configuration and using no CPU at
> > > all when running unit tests.
> > >
> > > In my view, the root cause of the assertion is that we are re-entering
> > > OMP initialization when spawning threads on the following code through
> > > pthread_at_fork
> > >
> > >
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> > >
> > > This causes double initialization of the OMP engine, including the
> > > assertion which you are asking about,  and I suspect some additional
> > > overhead. That's the shady forking part you are asking for.
> > >
> > > A question for you: What is the cause of runtime differences between
> > > OMP runtimes? Shouldn't the implementation overhead diminish as
> > > threads run longer?
> > >
> > > Pedro.
> > >
> > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
> > > wrote:
> > > >
> > > > What’s the reason for the assertion failure? btw classifying an
> assertion
> > > > failure a “crash” is debatable. As I stated in the original issue a
> long
> > > > time ago, it’s possible something shady is being done with when
> forking
> > > > that should be fixed.  The assertion should be root caused.
> > > >
> > > >
> > > >
> > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com>
> > > > wrote:
> > > >
> > > > > Added a dockerfile, and reports of a crash in my local machine when
> > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as
> well.
> > > > > I couldn't reproduce the crash on my EC2 machine:
> > > > > Added the backtrace of the crash as well.
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > > >
> > > > > Dockerfile here:
> > > > >
> > > > > https://github.com/larroy/mxnet_omp
> > > > >
> > > > > Kind regards.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco 

Re: OMP

2019-06-24 Thread Pedro Larroy
If you check initialize.cc we seem to be explicitly disabling that
behaviour in pthread_at_fork which seems to cause thread contention
during multiprocessing. Why do we need this major advantage for the
library if that's the case?

Related PRs:

https://github.com/apache/incubator-mxnet/pull/10820
https://github.com/apache/incubator-mxnet/issues/14396

The original code was authored in this PR:

https://github.com/apache/incubator-mxnet/pull/8677

I actually remember this fix, it was done during a release as the cuda
runtime was forking and the engine was being re-entered. If that
situation is now happening anymore it might not be needed any longer.
I don't think we know the cause why there was a fork inside cuda, so
the code has grown around a fix for an issue which its root cause was
not understood, and side effects which this fix caused afterwards.

My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in
the link above, no libgomp.

I didn't try the Make build.

I would refactor the code linked above and stop using pthread_at_fork,
since OMP assumes it won't be initialized twice, but needs to be very
well tested to make sure it doesn't cause bugs or affect the fixes
done on the linked PRs above.

Pedro.

On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier  wrote:
>
> one major advantage of intel/llvm omp is that it spawns a new thread pool
> after fork if a thread pool was already created. this is so that omp can be
> used in the forked processes. libgomp doesn’t do this so it’ll just lock up
> if you try to do omp in the forked process.
>
> is your build linking libgomp as well?
>
> standard mkl build (from Makefile) uses same omp library. are there
> problems with that build?
>
> what changes need to be made to make the assertion not fire?
>
> On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy 
> wrote:
>
> > There's an assertion which is easily reproducible, and also there's a
> > crash including core dump, the latter is not easy to reproduce for me
> > in different environments. I have also seen mxnet getting stuck
> > without progressing with this build configuration and using no CPU at
> > all when running unit tests.
> >
> > In my view, the root cause of the assertion is that we are re-entering
> > OMP initialization when spawning threads on the following code through
> > pthread_at_fork
> >
> > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
> >
> > This causes double initialization of the OMP engine, including the
> > assertion which you are asking about,  and I suspect some additional
> > overhead. That's the shady forking part you are asking for.
> >
> > A question for you: What is the cause of runtime differences between
> > OMP runtimes? Shouldn't the implementation overhead diminish as
> > threads run longer?
> >
> > Pedro.
> >
> > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
> > wrote:
> > >
> > > What’s the reason for the assertion failure? btw classifying an assertion
> > > failure a “crash” is debatable. As I stated in the original issue a long
> > > time ago, it’s possible something shady is being done with when forking
> > > that should be fixed.  The assertion should be root caused.
> > >
> > >
> > >
> > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com>
> > > wrote:
> > >
> > > > Added a dockerfile, and reports of a crash in my local machine when
> > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > > I couldn't reproduce the crash on my EC2 machine:
> > > > Added the backtrace of the crash as well.
> > > >
> > > > https://github.com/apache/incubator-mxnet/issues/10856
> > > >
> > > > Dockerfile here:
> > > >
> > > > https://github.com/larroy/mxnet_omp
> > > >
> > > > Kind regards.
> > > >
> > > > Pedro.
> > > >
> > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> > marco.g.ab...@gmail.com>
> > > > wrote:
> > > > >
> > > > > As already proposed, I think the easiest way to get a common
> > > > understanding
> > > > > is if we start with a few docker containers. Pedro, would it be
> > possible
> > > > > for you to wrap your benchmarks into a few containers that will
> > produce
> > > > > your shown results? That way, we can avoid possible
> > misunderstandings and
> > > > > also pinpoint the exact parts where people disagree or misunderstood
> > each
> > > > > other.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Pedro Larroy  schrieb am Do., 20. Juni
> > > > 2019,
> > > > > 21:47:
> > > > >
> > > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > > gaining more clarity into this topic, but I have still questions,
> > the
> > > > > > facts that I got so far are the folllowing:
> > > > > >
> > > > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > > > openmp when building with MKL enabled.
> > > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes
> > with
> > > > > > MKL), LLVM OpenMP (3rdparty/openmp), 

Re: OMP

2019-06-24 Thread Chris Olivier
one major advantage of intel/llvm omp is that it spawns a new thread pool
after fork if a thread pool was already created. this is so that omp can be
used in the forked processes. libgomp doesn’t do this so it’ll just lock up
if you try to do omp in the forked process.

is your build linking libgomp as well?

standard mkl build (from Makefile) uses same omp library. are there
problems with that build?

what changes need to be made to make the assertion not fire?

On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy 
wrote:

> There's an assertion which is easily reproducible, and also there's a
> crash including core dump, the latter is not easy to reproduce for me
> in different environments. I have also seen mxnet getting stuck
> without progressing with this build configuration and using no CPU at
> all when running unit tests.
>
> In my view, the root cause of the assertion is that we are re-entering
> OMP initialization when spawning threads on the following code through
> pthread_at_fork
>
> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58
>
> This causes double initialization of the OMP engine, including the
> assertion which you are asking about,  and I suspect some additional
> overhead. That's the shady forking part you are asking for.
>
> A question for you: What is the cause of runtime differences between
> OMP runtimes? Shouldn't the implementation overhead diminish as
> threads run longer?
>
> Pedro.
>
> On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier 
> wrote:
> >
> > What’s the reason for the assertion failure? btw classifying an assertion
> > failure a “crash” is debatable. As I stated in the original issue a long
> > time ago, it’s possible something shady is being done with when forking
> > that should be fixed.  The assertion should be root caused.
> >
> >
> >
> > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com>
> > wrote:
> >
> > > Added a dockerfile, and reports of a crash in my local machine when
> > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > > I couldn't reproduce the crash on my EC2 machine:
> > > Added the backtrace of the crash as well.
> > >
> > > https://github.com/apache/incubator-mxnet/issues/10856
> > >
> > > Dockerfile here:
> > >
> > > https://github.com/larroy/mxnet_omp
> > >
> > > Kind regards.
> > >
> > > Pedro.
> > >
> > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu <
> marco.g.ab...@gmail.com>
> > > wrote:
> > > >
> > > > As already proposed, I think the easiest way to get a common
> > > understanding
> > > > is if we start with a few docker containers. Pedro, would it be
> possible
> > > > for you to wrap your benchmarks into a few containers that will
> produce
> > > > your shown results? That way, we can avoid possible
> misunderstandings and
> > > > also pinpoint the exact parts where people disagree or misunderstood
> each
> > > > other.
> > > >
> > > > -Marco
> > > >
> > > > Pedro Larroy  schrieb am Do., 20. Juni
> > > 2019,
> > > > 21:47:
> > > >
> > > > > I can confirm that we are linking with two versions of omp, I'm
> > > > > gaining more clarity into this topic, but I have still questions,
> the
> > > > > facts that I got so far are the folllowing:
> > > > >
> > > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > > openmp when building with MKL enabled.
> > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes
> with
> > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > > one is used on the PR proposed by Anton).
> > > > >
> > > > > Questions:
> > > > >
> > > > >  * #1 Is it ok to have two versions of openmp linked at the same
> time?
> > > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > > total training time of my measurement for a partial answer)
> > > > >  * #3 Should we have a build flag so we can choose the OMP version
> at
> > > > > runtime?
> > > > >  * #4 Which Compiler and build flags did Chris use to get 10x
> slowdown?
> > > > >  * #5 @Stas: is there a script to replicate your benchmarks
> easily? If
> > > > > so could you provide a link?  I think we would need to reproduce
> your
> > > > > benchmarks and verify which versions are being linked. It's
> possible
> > > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > > GNU OpenMP.
> > > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > > update the subrepo regularly?
> > > > >
> > > > > My conclusion so far:
> > > > >
> > > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > > allow users to choose one in the build as we do for BLAS.
> > > > >  * #2 For performance reasons and more control vs different
> compiler
> > > > > versions seems it makes indeed sense to keep the LLVM OpenMP
> version
> > > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > > sense not to remove it as of now.
> > > > >  

Re: OMP

2019-06-24 Thread Pedro Larroy
There's an assertion which is easily reproducible, and also there's a
crash including core dump, the latter is not easy to reproduce for me
in different environments. I have also seen mxnet getting stuck
without progressing with this build configuration and using no CPU at
all when running unit tests.

In my view, the root cause of the assertion is that we are re-entering
OMP initialization when spawning threads on the following code through
pthread_at_fork

https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58

This causes double initialization of the OMP engine, including the
assertion which you are asking about,  and I suspect some additional
overhead. That's the shady forking part you are asking for.

A question for you: What is the cause of runtime differences between
OMP runtimes? Shouldn't the implementation overhead diminish as
threads run longer?

Pedro.

On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier  wrote:
>
> What’s the reason for the assertion failure? btw classifying an assertion
> failure a “crash” is debatable. As I stated in the original issue a long
> time ago, it’s possible something shady is being done with when forking
> that should be fixed.  The assertion should be root caused.
>
>
>
> On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy 
> wrote:
>
> > Added a dockerfile, and reports of a crash in my local machine when
> > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> > I couldn't reproduce the crash on my EC2 machine:
> > Added the backtrace of the crash as well.
> >
> > https://github.com/apache/incubator-mxnet/issues/10856
> >
> > Dockerfile here:
> >
> > https://github.com/larroy/mxnet_omp
> >
> > Kind regards.
> >
> > Pedro.
> >
> > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu 
> > wrote:
> > >
> > > As already proposed, I think the easiest way to get a common
> > understanding
> > > is if we start with a few docker containers. Pedro, would it be possible
> > > for you to wrap your benchmarks into a few containers that will produce
> > > your shown results? That way, we can avoid possible misunderstandings and
> > > also pinpoint the exact parts where people disagree or misunderstood each
> > > other.
> > >
> > > -Marco
> > >
> > > Pedro Larroy  schrieb am Do., 20. Juni
> > 2019,
> > > 21:47:
> > >
> > > > I can confirm that we are linking with two versions of omp, I'm
> > > > gaining more clarity into this topic, but I have still questions, the
> > > > facts that I got so far are the folllowing:
> > > >
> > > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > > openmp when building with MKL enabled.
> > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > > one is used on the PR proposed by Anton).
> > > >
> > > > Questions:
> > > >
> > > >  * #1 Is it ok to have two versions of openmp linked at the same time?
> > > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > > total training time of my measurement for a partial answer)
> > > >  * #3 Should we have a build flag so we can choose the OMP version at
> > > > runtime?
> > > >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> > > >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > > > so could you provide a link?  I think we would need to reproduce your
> > > > benchmarks and verify which versions are being linked. It's possible
> > > > that while compiling with MKL intel's omp was pulled in instead of
> > > > GNU OpenMP.
> > > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > > update the subrepo regularly?
> > > >
> > > > My conclusion so far:
> > > >
> > > >  * #1 We should avoid linking two versions of omp if possible and
> > > > allow users to choose one in the build as we do for BLAS.
> > > >  * #2 For performance reasons and more control vs different compiler
> > > > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > > sense not to remove it as of now.
> > > >  * #3 We should provide build options to choose which openmp library
> > > > is to be used from the three options available, including libgomp.
> > > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > > additional contortions (doesn't work as of today):
> > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > > >  * #5 We should add different omp versions to our benchmarks and track
> > > > the performance, so this data is available for prescribing the best
> > > > build options and for binary releases.
> > > >
> > > > This is also an interesting related gh issue posted in the mkl-dnn
> > > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > > >
> > > >
> > > > I don't observe the order of magnitude divergence reported by Chris in
> > > > vanilla Ubuntu 

Re: OMP

2019-06-24 Thread Chris Olivier
What’s the reason for the assertion failure? btw classifying an assertion
failure a “crash” is debatable. As I stated in the original issue a long
time ago, it’s possible something shady is being done with when forking
that should be fixed.  The assertion should be root caused.



On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy 
wrote:

> Added a dockerfile, and reports of a crash in my local machine when
> running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
> I couldn't reproduce the crash on my EC2 machine:
> Added the backtrace of the crash as well.
>
> https://github.com/apache/incubator-mxnet/issues/10856
>
> Dockerfile here:
>
> https://github.com/larroy/mxnet_omp
>
> Kind regards.
>
> Pedro.
>
> On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu 
> wrote:
> >
> > As already proposed, I think the easiest way to get a common
> understanding
> > is if we start with a few docker containers. Pedro, would it be possible
> > for you to wrap your benchmarks into a few containers that will produce
> > your shown results? That way, we can avoid possible misunderstandings and
> > also pinpoint the exact parts where people disagree or misunderstood each
> > other.
> >
> > -Marco
> >
> > Pedro Larroy  schrieb am Do., 20. Juni
> 2019,
> > 21:47:
> >
> > > I can confirm that we are linking with two versions of omp, I'm
> > > gaining more clarity into this topic, but I have still questions, the
> > > facts that I got so far are the folllowing:
> > >
> > > * #1: We are linking with two versions of omp, intel's omp and llvm
> > > openmp when building with MKL enabled.
> > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > > one is used on the PR proposed by Anton).
> > >
> > > Questions:
> > >
> > >  * #1 Is it ok to have two versions of openmp linked at the same time?
> > >  * #2 Which implementation of OMP gives the best performance?  (See
> > > total training time of my measurement for a partial answer)
> > >  * #3 Should we have a build flag so we can choose the OMP version at
> > > runtime?
> > >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> > >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > > so could you provide a link?  I think we would need to reproduce your
> > > benchmarks and verify which versions are being linked. It's possible
> > > that while compiling with MKL intel's omp was pulled in instead of
> > > GNU OpenMP.
> > >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > > update the subrepo regularly?
> > >
> > > My conclusion so far:
> > >
> > >  * #1 We should avoid linking two versions of omp if possible and
> > > allow users to choose one in the build as we do for BLAS.
> > >  * #2 For performance reasons and more control vs different compiler
> > > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > > in 3rdparty for now. So unless some more data is gathered, it makes
> > > sense not to remove it as of now.
> > >  * #3 We should provide build options to choose which openmp library
> > > is to be used from the three options available, including libgomp.
> > >  * #4 Refining the build we could also enable OpenMP in mac without
> > > additional contortions (doesn't work as of today):
> > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> > >  * #5 We should add different omp versions to our benchmarks and track
> > > the performance, so this data is available for prescribing the best
> > > build options and for binary releases.
> > >
> > > This is also an interesting related gh issue posted in the mkl-dnn
> > > repository:  https://github.com/intel/mkl-dnn/issues/230
> > >
> > >
> > > I don't observe the order of magnitude divergence reported by Chris in
> > > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> > >
> > > There's also differences in training time when using MKL and the ,
> > > it's actually a bit slower, I don't know if it's related to OMP.
> > >
> > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> > >
> > > Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch
> 'omp'
> > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > > build/libmxnet.so |grep -i omp
> > > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > > (0x7fd99a51d000)
> > >
> > > time python train_mnist.py
> > >
> > > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > > INFO:root:Epoch[19] Batch [0-100]   Speed: 41617.00 samples/sec
> > >  accuracy=1.00
> > > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec
> > >  accuracy=0.999531
> > > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec
> > >  accuracy=0.999687
> > > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec
> > >  accuracy=1.00
> > > 

Re: OMP

2019-06-24 Thread Pedro Larroy
Added a dockerfile, and reports of a crash in my local machine when
running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
I couldn't reproduce the crash on my EC2 machine:
Added the backtrace of the crash as well.

https://github.com/apache/incubator-mxnet/issues/10856

Dockerfile here:

https://github.com/larroy/mxnet_omp

Kind regards.

Pedro.

On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu  wrote:
>
> As already proposed, I think the easiest way to get a common understanding
> is if we start with a few docker containers. Pedro, would it be possible
> for you to wrap your benchmarks into a few containers that will produce
> your shown results? That way, we can avoid possible misunderstandings and
> also pinpoint the exact parts where people disagree or misunderstood each
> other.
>
> -Marco
>
> Pedro Larroy  schrieb am Do., 20. Juni 2019,
> 21:47:
>
> > I can confirm that we are linking with two versions of omp, I'm
> > gaining more clarity into this topic, but I have still questions, the
> > facts that I got so far are the folllowing:
> >
> > * #1: We are linking with two versions of omp, intel's omp and llvm
> > openmp when building with MKL enabled.
> > * #2: We have 3 different possible OMP versions: Intel OMP (comes with
> > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This
> > one is used on the PR proposed by Anton).
> >
> > Questions:
> >
> >  * #1 Is it ok to have two versions of openmp linked at the same time?
> >  * #2 Which implementation of OMP gives the best performance?  (See
> > total training time of my measurement for a partial answer)
> >  * #3 Should we have a build flag so we can choose the OMP version at
> > runtime?
> >  * #4 Which Compiler and build flags did Chris use to get 10x slowdown?
> >  * #5 @Stas: is there a script to replicate your benchmarks easily? If
> > so could you provide a link?  I think we would need to reproduce your
> > benchmarks and verify which versions are being linked. It's possible
> > that while compiling with MKL intel's omp was pulled in instead of
> > GNU OpenMP.
> >  * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we
> > update the subrepo regularly?
> >
> > My conclusion so far:
> >
> >  * #1 We should avoid linking two versions of omp if possible and
> > allow users to choose one in the build as we do for BLAS.
> >  * #2 For performance reasons and more control vs different compiler
> > versions seems it makes indeed sense to keep the LLVM OpenMP version
> > in 3rdparty for now. So unless some more data is gathered, it makes
> > sense not to remove it as of now.
> >  * #3 We should provide build options to choose which openmp library
> > is to be used from the three options available, including libgomp.
> >  * #4 Refining the build we could also enable OpenMP in mac without
> > additional contortions (doesn't work as of today):
> > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/
> >  * #5 We should add different omp versions to our benchmarks and track
> > the performance, so this data is available for prescribing the best
> > build options and for binary releases.
> >
> > This is also an interesting related gh issue posted in the mkl-dnn
> > repository:  https://github.com/intel/mkl-dnn/issues/230
> >
> >
> > I don't observe the order of magnitude divergence reported by Chris in
> > vanilla Ubuntu 18.04 in samples / s but the full training finishes
> > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp.
> >
> > There's also differences in training time when using MKL and the ,
> > it's actually a bit slower, I don't know if it's related to OMP.
> >
> > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
> >
> > Anton's branch:  g...@github.com:lebeg/incubator-mxnet.git   branch 'omp'
> > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd
> > build/libmxnet.so |grep -i omp
> > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> > (0x7fd99a51d000)
> >
> > time python train_mnist.py
> >
> > INFO:root:Epoch[18] Validation-accuracy=0.984176
> > INFO:root:Epoch[19] Batch [0-100]   Speed: 41617.00 samples/sec
> >  accuracy=1.00
> > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec
> >  accuracy=1.00
> > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec
> >  accuracy=0.999687
> > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec
> >  accuracy=0.999375
> > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec
> >  accuracy=0.999844
> > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec
> >  accuracy=0.999531
> > INFO:root:Epoch[19] Train-accuracy=0.999717
> > INFO:root:Epoch[19] Time cost=1.219