Re: OMP
1) I don't think the problem is related to environment variables but is good to know that you agree we can remove the modification to OMP_NUM_THREADS which was creating random crashes with low probability. For the answer to your question I suggest to use debugger and reasoning about initialization, static construction and thread creation using pthread_atfork, I think we are mixing all those difficult and subtle actions with side effects and we have the perfect storm, I think one of the possibilities is that two threads can initialize openmp at the same time, or at least if OMP is initialized and the operator tuning code is running at the same time you get this effect of getting inside the OpenMP code before it has been initialized or during initalization. __kmp_team_pool and other are volatile variables used inside the OMP engine change by different threads. I verified myself by modifying OpenMP and having a write watch in that memory region that indeed they were changing value during __kmp_do_serial_initialize and creating the assert described in this issue: https://github.com/apache/incubator-mxnet/issues/10856 Read below for a more detailed explanation of this process with backtraces included. Kind regards. omp_get_num_procs and other openmp functions are called from different places concurrently, such as operator tuning and static initialization of openmp here: https://github.com/apache/incubator-mxnet/blob/master/src/engine/openmp.cc#L37 While static initialization is thread safe, the constructor of a statically initialized object might not be thread safe. Operator tuning is called during (static) library initialization . Static initialization is by the C++ standard done in an implementation defined order. 2) Is what I described above correct or it could cause some problems? At least is causing an assertion inside OpenMP so seems to violate an invariant that the OpenMP developers consider to hold, this is a bit of a concern for me. This explains the assertion during OpenMP. Let me know if you have any other questions or if you think this is incorrect. If i'm not mistaken you contributed some parts of this code. 3) Why the pip packages are using libgomp and when building from source we are using llvm openmp? (I asked this question on the 1.5 release thread). Below are some stack traces that justify my above observations and reasonings, which I captured using a debugger. __kmp_allocate_thread kmp_runtime.cpp:4153 __kmp_allocate_team kmp_runtime.cpp:4965 __kmp_fork_call kmp_runtime.cpp:1991 __kmp_GOMP_fork_call kmp_gsupport.cpp:290 __kmp_api_GOMP_parallel kmp_gsupport.cpp:1080 mxnet::op::OperatorTune::GetOMPLoopOverhead operator_tune-inl.h:342 mxnet::op::OperatorTune::GetOMPLoopOverhead operator_tune-inl.h:370 mxnet::op::OperatorTune::Initialize operator_tune-inl.h:174 mxnet::op::OperatorTune::TuneAll operator_tune-inl.h:220 mxnet::op::OperatorTune::OperatorTune operator_tune-inl.h:116 mxnet::op::UnaryOpTune::UnaryOpTune operator_tune-inl.h:534 mxnet::op::BinaryOpTune::BinaryOpTune operator_tune-inl.h:724 __static_initialization_and_destruction_0 operator_tune.cc:369 _GLOBAL__sub_I_operator_tune.cc(void) operator_tune.cc:378 call_init 0x7f8f4e41d733 _dl_init 0x7f8f4e41d733 dl_open_worker 0x7f8f4e4221ff __GI__dl_catch_exception 0x7f8f4e1832df _dl_open 0x7f8f4e4217ca dlopen_doit 0x7f8f4dbf9f96 __GI__dl_catch_exception 0x7f8f4e1832df __GI__dl_catch_error 0x7f8f4e18336f _dlerror_run 0x7f8f4dbfa735 __dlopen 0x7f8f4dbfa051 0x7f8f4b3eacda 0x00502d6f _PyEval_EvalFrameDefault 0x00506859 0x00504c28 _PyFunction_FastCallDict 0x00501ba7 0x00591461 0x0054b813 0x00555421 _PyObject_FastCallKeywords 0x005a730c 0x00503073 _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00504c28 0x00511d78 PyCFunction_Call 0x0056617e _PyEval_EvalFrameDefault 0x0050bb66 0x00504c28 0x00502540 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 _PyFunction_FastCallDict 0x00501945 _PyObject_FastCallDict 0x005a36f1 _PyObject_CallMethodIdObjArgs 0x0059662e PyImport_ImportModuleLevelObject 0x004ee84d _PyEval_EvalFrameDefault 0x0050896c 0x00504c28 0x00511d78 PyCFunction_Call 0x0056617e _PyEval_EvalFrameDefault 0x0050bb66 0x00504c28 0x00502540 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d _PyEval_EvalFrameDefault 0x00506859 0x00502209 0x00502f3d
Re: OMP
1) I don't see how that code could cause reentrancy problems in omp. It doesn't make any OMP calls at all. Still doesn't look related to me. Setting an environment variable probably doesn't even do anything, because: a) It probably doesn't check the environment variable except at initial startup b) Even if it did, whether this code ran before or after the OMP init code would be nondeterministic c) It for sure doesn't check the environment variable every time it hits an omp region. That would be ridiculously expensive and checking the OMP source code, it doesn't.. You can't affect the OMP behavior at arbitrary points in time by setting the "OMP_NUM_THREADS" environment variable. On Tue, Jun 25, 2019 at 1:20 PM Pedro Larroy wrote: > Nobody claimed that the original lockup has to do with OMP, but the > fix caused re-entrancy into OMP initialization as explained below. So > I agree with your statement that the bug that using pthread_atfork was > fixing is not related with OMP, but the fix is causing interactions > with OMP as described above. > > Pedro. > > On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier > wrote: > > > > The call stacks there are mostly associated with the execution engine > > threads, which are not OMP threads. That lockup doesn't look to me to be > > related to OMP -- the execution engine uses its own thread pool logic > -- > > I'm pretty familiar with that part of the code. Unless I am missing one > -- > > can you point to the one that looks OMP-related? > > > > > > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > Thanks for digging that out Kellen. That's good info so maybe it would > > > be good to rework the fix with the info you provided and remove the > > > pthread_atfork handlers. > > > Do you think setting the device would avoid the problem seen on the > > > backtrace you provided? specifically here: > > > > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24 > > > > > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland > > > wrote: > > > > > > > > I remember at the time we also had a read through of this blog post, > but > > > to > > > > use the code looked like it was following the advice: > > > > > > > > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ > > > > > > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > I remember this hang as well, it was pretty hard to reproduce > IIRC. I > > > > > believe the stacks for the hang are here: > > > > > > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > > > and > > > > > the trick was we could only debug it up to the point that we hit: > > > > > > > > > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > > > > > futex_word=0x7fec60843758) > > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > > > > > #1 futex_wait_simple (private=0, expected=1, > > > futex_word=0x7fec60843758) > > > > > at ../sysdeps/nptl/futex-internal.h:135 > > > > > #2 __pthread_once_slow (once_control=0x7fec60843758, > > > > > init_routine=0x7fec605f38f0) > > > > > at pthread_once.c:105 > > > > > ... > > > > > #6 0x7fec6061c577 in cudaSetDevice () from > > > > > /usr/local/cuda/lib64/libcudart.so.9.0 > > > > > > > > > > because the code in libcudart is obviously closed source we > couldn't > > > dig > > > > > into what threading work was going on when we called cudaSetDevice. > > > > > > > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > > > > wrote: > > > > > > > > > >> If you check initialize.cc we seem to be explicitly disabling that > > > > >> behaviour in pthread_at_fork which seems to cause thread > contention > > > > >> during multiprocessing. Why do we need this major advantage for > the > > > > >> library if that's the case? > > > > >> > > > > >> Related PRs: > > > > >> > > > > >> https://github.com/apache/incubator-mxnet/
Re: OMP
Nobody claimed that the original lockup has to do with OMP, but the fix caused re-entrancy into OMP initialization as explained below. So I agree with your statement that the bug that using pthread_atfork was fixing is not related with OMP, but the fix is causing interactions with OMP as described above. Pedro. On Tue, Jun 25, 2019 at 12:33 PM Chris Olivier wrote: > > The call stacks there are mostly associated with the execution engine > threads, which are not OMP threads. That lockup doesn't look to me to be > related to OMP -- the execution engine uses its own thread pool logic -- > I'm pretty familiar with that part of the code. Unless I am missing one -- > can you point to the one that looks OMP-related? > > > On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy > wrote: > > > Thanks for digging that out Kellen. That's good info so maybe it would > > be good to rework the fix with the info you provided and remove the > > pthread_atfork handlers. > > Do you think setting the device would avoid the problem seen on the > > backtrace you provided? specifically here: > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24 > > > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland > > wrote: > > > > > > I remember at the time we also had a read through of this blog post, but > > to > > > use the code looked like it was following the advice: > > > > > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ > > > > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > I remember this hang as well, it was pretty hard to reproduce IIRC. I > > > > believe the stacks for the hang are here: > > > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > > and > > > > the trick was we could only debug it up to the point that we hit: > > > > > > > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > > > > futex_word=0x7fec60843758) > > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > > > > #1 futex_wait_simple (private=0, expected=1, > > futex_word=0x7fec60843758) > > > > at ../sysdeps/nptl/futex-internal.h:135 > > > > #2 __pthread_once_slow (once_control=0x7fec60843758, > > > > init_routine=0x7fec605f38f0) > > > > at pthread_once.c:105 > > > > ... > > > > #6 0x7fec6061c577 in cudaSetDevice () from > > > > /usr/local/cuda/lib64/libcudart.so.9.0 > > > > > > > > because the code in libcudart is obviously closed source we couldn't > > dig > > > > into what threading work was going on when we called cudaSetDevice. > > > > > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com> > > > > wrote: > > > > > > > >> If you check initialize.cc we seem to be explicitly disabling that > > > >> behaviour in pthread_at_fork which seems to cause thread contention > > > >> during multiprocessing. Why do we need this major advantage for the > > > >> library if that's the case? > > > >> > > > >> Related PRs: > > > >> > > > >> https://github.com/apache/incubator-mxnet/pull/10820 > > > >> https://github.com/apache/incubator-mxnet/issues/14396 > > > >> > > > >> The original code was authored in this PR: > > > >> > > > >> https://github.com/apache/incubator-mxnet/pull/8677 > > > >> > > > >> I actually remember this fix, it was done during a release as the cuda > > > >> runtime was forking and the engine was being re-entered. If that > > > >> situation is now happening anymore it might not be needed any longer. > > > >> I don't think we know the cause why there was a fork inside cuda, so > > > >> the code has grown around a fix for an issue which its root cause was > > > >> not understood, and side effects which this fix caused afterwards. > > > >> > > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > > > >> the link above, no libgomp. > > > >> > > > >> I didn't try the Make build. > > > >> > > > >> I would refactor the code linked above and stop using pthread_at_fork, > > > >&g
Re: OMP
The call stacks there are mostly associated with the execution engine threads, which are not OMP threads. That lockup doesn't look to me to be related to OMP -- the execution engine uses its own thread pool logic -- I'm pretty familiar with that part of the code. Unless I am missing one -- can you point to the one that looks OMP-related? On Tue, Jun 25, 2019 at 10:35 AM Pedro Larroy wrote: > Thanks for digging that out Kellen. That's good info so maybe it would > be good to rework the fix with the info you provided and remove the > pthread_atfork handlers. > Do you think setting the device would avoid the problem seen on the > backtrace you provided? specifically here: > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24 > > On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland > wrote: > > > > I remember at the time we also had a read through of this blog post, but > to > > use the code looked like it was following the advice: > > > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ > > > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < > > kellen.sunderl...@gmail.com> wrote: > > > > > I remember this hang as well, it was pretty hard to reproduce IIRC. I > > > believe the stacks for the hang are here: > > > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > and > > > the trick was we could only debug it up to the point that we hit: > > > > > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > > > futex_word=0x7fec60843758) > > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > > > #1 futex_wait_simple (private=0, expected=1, > futex_word=0x7fec60843758) > > > at ../sysdeps/nptl/futex-internal.h:135 > > > #2 __pthread_once_slow (once_control=0x7fec60843758, > > > init_routine=0x7fec605f38f0) > > > at pthread_once.c:105 > > > ... > > > #6 0x7fec6061c577 in cudaSetDevice () from > > > /usr/local/cuda/lib64/libcudart.so.9.0 > > > > > > because the code in libcudart is obviously closed source we couldn't > dig > > > into what threading work was going on when we called cudaSetDevice. > > > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > > wrote: > > > > > >> If you check initialize.cc we seem to be explicitly disabling that > > >> behaviour in pthread_at_fork which seems to cause thread contention > > >> during multiprocessing. Why do we need this major advantage for the > > >> library if that's the case? > > >> > > >> Related PRs: > > >> > > >> https://github.com/apache/incubator-mxnet/pull/10820 > > >> https://github.com/apache/incubator-mxnet/issues/14396 > > >> > > >> The original code was authored in this PR: > > >> > > >> https://github.com/apache/incubator-mxnet/pull/8677 > > >> > > >> I actually remember this fix, it was done during a release as the cuda > > >> runtime was forking and the engine was being re-entered. If that > > >> situation is now happening anymore it might not be needed any longer. > > >> I don't think we know the cause why there was a fork inside cuda, so > > >> the code has grown around a fix for an issue which its root cause was > > >> not understood, and side effects which this fix caused afterwards. > > >> > > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > > >> the link above, no libgomp. > > >> > > >> I didn't try the Make build. > > >> > > >> I would refactor the code linked above and stop using pthread_at_fork, > > >> since OMP assumes it won't be initialized twice, but needs to be very > > >> well tested to make sure it doesn't cause bugs or affect the fixes > > >> done on the linked PRs above. > > >> > > >> Pedro. > > >> > > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier > > >> wrote: > > >> > > > >> > one major advantage of intel/llvm omp is that it spawns a new thread > > >> pool > > >> > after fork if a thread pool was already created. this is so that omp > > >> can be > > >> > used in the forked processes. libgomp doesn’t do this so it’ll just > > >> lock up > > >> > if you try to do omp in the forked pr
Re: OMP
That doesnt look like it has anything to do with omp On Mon, Jun 24, 2019 at 6:40 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > I remember this hang as well, it was pretty hard to reproduce IIRC. I > believe the stacks for the hang are here: > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > and > the trick was we could only debug it up to the point that we hit: > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > futex_word=0x7fec60843758) > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > #1 futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758) > at ../sysdeps/nptl/futex-internal.h:135 > #2 __pthread_once_slow (once_control=0x7fec60843758, > init_routine=0x7fec605f38f0) > at pthread_once.c:105 > ... > #6 0x7fec6061c577 in cudaSetDevice () from > /usr/local/cuda/lib64/libcudart.so.9.0 > > because the code in libcudart is obviously closed source we couldn't dig > into what threading work was going on when we called cudaSetDevice. > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy > > wrote: > > > If you check initialize.cc we seem to be explicitly disabling that > > behaviour in pthread_at_fork which seems to cause thread contention > > during multiprocessing. Why do we need this major advantage for the > > library if that's the case? > > > > Related PRs: > > > > https://github.com/apache/incubator-mxnet/pull/10820 > > https://github.com/apache/incubator-mxnet/issues/14396 > > > > The original code was authored in this PR: > > > > https://github.com/apache/incubator-mxnet/pull/8677 > > > > I actually remember this fix, it was done during a release as the cuda > > runtime was forking and the engine was being re-entered. If that > > situation is now happening anymore it might not be needed any longer. > > I don't think we know the cause why there was a fork inside cuda, so > > the code has grown around a fix for an issue which its root cause was > > not understood, and side effects which this fix caused afterwards. > > > > My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > > the link above, no libgomp. > > > > I didn't try the Make build. > > > > I would refactor the code linked above and stop using pthread_at_fork, > > since OMP assumes it won't be initialized twice, but needs to be very > > well tested to make sure it doesn't cause bugs or affect the fixes > > done on the linked PRs above. > > > > Pedro. > > > > On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier > > wrote: > > > > > > one major advantage of intel/llvm omp is that it spawns a new thread > pool > > > after fork if a thread pool was already created. this is so that omp > can > > be > > > used in the forked processes. libgomp doesn’t do this so it’ll just > lock > > up > > > if you try to do omp in the forked process. > > > > > > is your build linking libgomp as well? > > > > > > standard mkl build (from Makefile) uses same omp library. are there > > > problems with that build? > > > > > > what changes need to be made to make the assertion not fire? > > > > > > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com> > > > wrote: > > > > > > > There's an assertion which is easily reproducible, and also there's a > > > > crash including core dump, the latter is not easy to reproduce for me > > > > in different environments. I have also seen mxnet getting stuck > > > > without progressing with this build configuration and using no CPU at > > > > all when running unit tests. > > > > > > > > In my view, the root cause of the assertion is that we are > re-entering > > > > OMP initialization when spawning threads on the following code > through > > > > pthread_at_fork > > > > > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 > > > > > > > > This causes double initialization of the OMP engine, including the > > > > assertion which you are asking about, and I suspect some additional > > > > overhead. That's the shady forking part you are asking for. > > > > > > > > A question for you: What is the cause of runtime differences between > > > > OMP runtimes? Shouldn't the implementation overhead diminish as > > > > threads run longer? > > > > > > > > Pedro. >
Re: OMP
Thanks for digging that out Kellen. That's good info so maybe it would be good to rework the fix with the info you provided and remove the pthread_atfork handlers. Do you think setting the device would avoid the problem seen on the backtrace you provided? specifically here: https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600#file-hang_bt-L24 On Mon, Jun 24, 2019 at 6:43 PM kellen sunderland wrote: > > I remember at the time we also had a read through of this blog post, but to > use the code looked like it was following the advice: > https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ > > On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > > > I remember this hang as well, it was pretty hard to reproduce IIRC. I > > believe the stacks for the hang are here: > > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 > > and > > the trick was we could only debug it up to the point that we hit: > > > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > > futex_word=0x7fec60843758) > > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > > #1 futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758) > > at ../sysdeps/nptl/futex-internal.h:135 > > #2 __pthread_once_slow (once_control=0x7fec60843758, > > init_routine=0x7fec605f38f0) > > at pthread_once.c:105 > > ... > > #6 0x7fec6061c577 in cudaSetDevice () from > > /usr/local/cuda/lib64/libcudart.so.9.0 > > > > because the code in libcudart is obviously closed source we couldn't dig > > into what threading work was going on when we called cudaSetDevice. > > > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy > > wrote: > > > >> If you check initialize.cc we seem to be explicitly disabling that > >> behaviour in pthread_at_fork which seems to cause thread contention > >> during multiprocessing. Why do we need this major advantage for the > >> library if that's the case? > >> > >> Related PRs: > >> > >> https://github.com/apache/incubator-mxnet/pull/10820 > >> https://github.com/apache/incubator-mxnet/issues/14396 > >> > >> The original code was authored in this PR: > >> > >> https://github.com/apache/incubator-mxnet/pull/8677 > >> > >> I actually remember this fix, it was done during a release as the cuda > >> runtime was forking and the engine was being re-entered. If that > >> situation is now happening anymore it might not be needed any longer. > >> I don't think we know the cause why there was a fork inside cuda, so > >> the code has grown around a fix for an issue which its root cause was > >> not understood, and side effects which this fix caused afterwards. > >> > >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > >> the link above, no libgomp. > >> > >> I didn't try the Make build. > >> > >> I would refactor the code linked above and stop using pthread_at_fork, > >> since OMP assumes it won't be initialized twice, but needs to be very > >> well tested to make sure it doesn't cause bugs or affect the fixes > >> done on the linked PRs above. > >> > >> Pedro. > >> > >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier > >> wrote: > >> > > >> > one major advantage of intel/llvm omp is that it spawns a new thread > >> pool > >> > after fork if a thread pool was already created. this is so that omp > >> can be > >> > used in the forked processes. libgomp doesn’t do this so it’ll just > >> lock up > >> > if you try to do omp in the forked process. > >> > > >> > is your build linking libgomp as well? > >> > > >> > standard mkl build (from Makefile) uses same omp library. are there > >> > problems with that build? > >> > > >> > what changes need to be made to make the assertion not fire? > >> > > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < > >> pedro.larroy.li...@gmail.com> > >> > wrote: > >> > > >> > > There's an assertion which is easily reproducible, and also there's a > >> > > crash including core dump, the latter is not easy to reproduce for me > >> > > in different environments. I have also seen mxnet getting stuck > >> > > without progressing with this build configuration and using no
Re: OMP
I remember at the time we also had a read through of this blog post, but to use the code looked like it was following the advice: https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > I remember this hang as well, it was pretty hard to reproduce IIRC. I > believe the stacks for the hang are here: > https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and > the trick was we could only debug it up to the point that we hit: > > #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, > futex_word=0x7fec60843758) > at ../sysdeps/unix/sysv/linux/futex-internal.h:61 > #1 futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758) > at ../sysdeps/nptl/futex-internal.h:135 > #2 __pthread_once_slow (once_control=0x7fec60843758, > init_routine=0x7fec605f38f0) > at pthread_once.c:105 > ... > #6 0x7fec6061c577 in cudaSetDevice () from > /usr/local/cuda/lib64/libcudart.so.9.0 > > because the code in libcudart is obviously closed source we couldn't dig > into what threading work was going on when we called cudaSetDevice. > > On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy > wrote: > >> If you check initialize.cc we seem to be explicitly disabling that >> behaviour in pthread_at_fork which seems to cause thread contention >> during multiprocessing. Why do we need this major advantage for the >> library if that's the case? >> >> Related PRs: >> >> https://github.com/apache/incubator-mxnet/pull/10820 >> https://github.com/apache/incubator-mxnet/issues/14396 >> >> The original code was authored in this PR: >> >> https://github.com/apache/incubator-mxnet/pull/8677 >> >> I actually remember this fix, it was done during a release as the cuda >> runtime was forking and the engine was being re-entered. If that >> situation is now happening anymore it might not be needed any longer. >> I don't think we know the cause why there was a fork inside cuda, so >> the code has grown around a fix for an issue which its root cause was >> not understood, and side effects which this fix caused afterwards. >> >> My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in >> the link above, no libgomp. >> >> I didn't try the Make build. >> >> I would refactor the code linked above and stop using pthread_at_fork, >> since OMP assumes it won't be initialized twice, but needs to be very >> well tested to make sure it doesn't cause bugs or affect the fixes >> done on the linked PRs above. >> >> Pedro. >> >> On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier >> wrote: >> > >> > one major advantage of intel/llvm omp is that it spawns a new thread >> pool >> > after fork if a thread pool was already created. this is so that omp >> can be >> > used in the forked processes. libgomp doesn’t do this so it’ll just >> lock up >> > if you try to do omp in the forked process. >> > >> > is your build linking libgomp as well? >> > >> > standard mkl build (from Makefile) uses same omp library. are there >> > problems with that build? >> > >> > what changes need to be made to make the assertion not fire? >> > >> > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < >> pedro.larroy.li...@gmail.com> >> > wrote: >> > >> > > There's an assertion which is easily reproducible, and also there's a >> > > crash including core dump, the latter is not easy to reproduce for me >> > > in different environments. I have also seen mxnet getting stuck >> > > without progressing with this build configuration and using no CPU at >> > > all when running unit tests. >> > > >> > > In my view, the root cause of the assertion is that we are re-entering >> > > OMP initialization when spawning threads on the following code through >> > > pthread_at_fork >> > > >> > > >> https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 >> > > >> > > This causes double initialization of the OMP engine, including the >> > > assertion which you are asking about, and I suspect some additional >> > > overhead. That's the shady forking part you are asking for. >> > > >> > > A question for you: What is the cause of runtime differences between >> > > OMP runtimes? Shouldn't the implementation overhead diminish as >> > > threa
Re: OMP
I remember this hang as well, it was pretty hard to reproduce IIRC. I believe the stacks for the hang are here: https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and the trick was we could only debug it up to the point that we hit: #0 0x7fec6df1ba4f in futex_wait (private=0, expected=1, futex_word=0x7fec60843758) at ../sysdeps/unix/sysv/linux/futex-internal.h:61 #1 futex_wait_simple (private=0, expected=1, futex_word=0x7fec60843758) at ../sysdeps/nptl/futex-internal.h:135 #2 __pthread_once_slow (once_control=0x7fec60843758, init_routine=0x7fec605f38f0) at pthread_once.c:105 ... #6 0x7fec6061c577 in cudaSetDevice () from /usr/local/cuda/lib64/libcudart.so.9.0 because the code in libcudart is obviously closed source we couldn't dig into what threading work was going on when we called cudaSetDevice. On Mon, Jun 24, 2019 at 6:13 PM Pedro Larroy wrote: > If you check initialize.cc we seem to be explicitly disabling that > behaviour in pthread_at_fork which seems to cause thread contention > during multiprocessing. Why do we need this major advantage for the > library if that's the case? > > Related PRs: > > https://github.com/apache/incubator-mxnet/pull/10820 > https://github.com/apache/incubator-mxnet/issues/14396 > > The original code was authored in this PR: > > https://github.com/apache/incubator-mxnet/pull/8677 > > I actually remember this fix, it was done during a release as the cuda > runtime was forking and the engine was being re-entered. If that > situation is now happening anymore it might not be needed any longer. > I don't think we know the cause why there was a fork inside cuda, so > the code has grown around a fix for an issue which its root cause was > not understood, and side effects which this fix caused afterwards. > > My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in > the link above, no libgomp. > > I didn't try the Make build. > > I would refactor the code linked above and stop using pthread_at_fork, > since OMP assumes it won't be initialized twice, but needs to be very > well tested to make sure it doesn't cause bugs or affect the fixes > done on the linked PRs above. > > Pedro. > > On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier > wrote: > > > > one major advantage of intel/llvm omp is that it spawns a new thread pool > > after fork if a thread pool was already created. this is so that omp can > be > > used in the forked processes. libgomp doesn’t do this so it’ll just lock > up > > if you try to do omp in the forked process. > > > > is your build linking libgomp as well? > > > > standard mkl build (from Makefile) uses same omp library. are there > > problems with that build? > > > > what changes need to be made to make the assertion not fire? > > > > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > There's an assertion which is easily reproducible, and also there's a > > > crash including core dump, the latter is not easy to reproduce for me > > > in different environments. I have also seen mxnet getting stuck > > > without progressing with this build configuration and using no CPU at > > > all when running unit tests. > > > > > > In my view, the root cause of the assertion is that we are re-entering > > > OMP initialization when spawning threads on the following code through > > > pthread_at_fork > > > > > > > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 > > > > > > This causes double initialization of the OMP engine, including the > > > assertion which you are asking about, and I suspect some additional > > > overhead. That's the shady forking part you are asking for. > > > > > > A question for you: What is the cause of runtime differences between > > > OMP runtimes? Shouldn't the implementation overhead diminish as > > > threads run longer? > > > > > > Pedro. > > > > > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier > > > wrote: > > > > > > > > What’s the reason for the assertion failure? btw classifying an > assertion > > > > failure a “crash” is debatable. As I stated in the original issue a > long > > > > time ago, it’s possible something shady is being done with when > forking > > > > that should be fixed. The assertion should be root caused. > > > > > > > > > > > > > > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > >
Re: OMP
If you check initialize.cc we seem to be explicitly disabling that behaviour in pthread_at_fork which seems to cause thread contention during multiprocessing. Why do we need this major advantage for the library if that's the case? Related PRs: https://github.com/apache/incubator-mxnet/pull/10820 https://github.com/apache/incubator-mxnet/issues/14396 The original code was authored in this PR: https://github.com/apache/incubator-mxnet/pull/8677 I actually remember this fix, it was done during a release as the cuda runtime was forking and the engine was being re-entered. If that situation is now happening anymore it might not be needed any longer. I don't think we know the cause why there was a fork inside cuda, so the code has grown around a fix for an issue which its root cause was not understood, and side effects which this fix caused afterwards. My build uses MKL+LLVM OMP+DEBUG as seen in the container provided in the link above, no libgomp. I didn't try the Make build. I would refactor the code linked above and stop using pthread_at_fork, since OMP assumes it won't be initialized twice, but needs to be very well tested to make sure it doesn't cause bugs or affect the fixes done on the linked PRs above. Pedro. On Mon, Jun 24, 2019 at 5:38 PM Chris Olivier wrote: > > one major advantage of intel/llvm omp is that it spawns a new thread pool > after fork if a thread pool was already created. this is so that omp can be > used in the forked processes. libgomp doesn’t do this so it’ll just lock up > if you try to do omp in the forked process. > > is your build linking libgomp as well? > > standard mkl build (from Makefile) uses same omp library. are there > problems with that build? > > what changes need to be made to make the assertion not fire? > > On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy > wrote: > > > There's an assertion which is easily reproducible, and also there's a > > crash including core dump, the latter is not easy to reproduce for me > > in different environments. I have also seen mxnet getting stuck > > without progressing with this build configuration and using no CPU at > > all when running unit tests. > > > > In my view, the root cause of the assertion is that we are re-entering > > OMP initialization when spawning threads on the following code through > > pthread_at_fork > > > > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 > > > > This causes double initialization of the OMP engine, including the > > assertion which you are asking about, and I suspect some additional > > overhead. That's the shady forking part you are asking for. > > > > A question for you: What is the cause of runtime differences between > > OMP runtimes? Shouldn't the implementation overhead diminish as > > threads run longer? > > > > Pedro. > > > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier > > wrote: > > > > > > What’s the reason for the assertion failure? btw classifying an assertion > > > failure a “crash” is debatable. As I stated in the original issue a long > > > time ago, it’s possible something shady is being done with when forking > > > that should be fixed. The assertion should be root caused. > > > > > > > > > > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com> > > > wrote: > > > > > > > Added a dockerfile, and reports of a crash in my local machine when > > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. > > > > I couldn't reproduce the crash on my EC2 machine: > > > > Added the backtrace of the crash as well. > > > > > > > > https://github.com/apache/incubator-mxnet/issues/10856 > > > > > > > > Dockerfile here: > > > > > > > > https://github.com/larroy/mxnet_omp > > > > > > > > Kind regards. > > > > > > > > Pedro. > > > > > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu < > > marco.g.ab...@gmail.com> > > > > wrote: > > > > > > > > > > As already proposed, I think the easiest way to get a common > > > > understanding > > > > > is if we start with a few docker containers. Pedro, would it be > > possible > > > > > for you to wrap your benchmarks into a few containers that will > > produce > > > > > your shown results? That way, we can avoid possible > > misunderstandings and > > > > > also pinpoint the exact parts where people disagree or misunderstood > >
Re: OMP
one major advantage of intel/llvm omp is that it spawns a new thread pool after fork if a thread pool was already created. this is so that omp can be used in the forked processes. libgomp doesn’t do this so it’ll just lock up if you try to do omp in the forked process. is your build linking libgomp as well? standard mkl build (from Makefile) uses same omp library. are there problems with that build? what changes need to be made to make the assertion not fire? On Mon, Jun 24, 2019 at 5:32 PM Pedro Larroy wrote: > There's an assertion which is easily reproducible, and also there's a > crash including core dump, the latter is not easy to reproduce for me > in different environments. I have also seen mxnet getting stuck > without progressing with this build configuration and using no CPU at > all when running unit tests. > > In my view, the root cause of the assertion is that we are re-entering > OMP initialization when spawning threads on the following code through > pthread_at_fork > > https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 > > This causes double initialization of the OMP engine, including the > assertion which you are asking about, and I suspect some additional > overhead. That's the shady forking part you are asking for. > > A question for you: What is the cause of runtime differences between > OMP runtimes? Shouldn't the implementation overhead diminish as > threads run longer? > > Pedro. > > On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier > wrote: > > > > What’s the reason for the assertion failure? btw classifying an assertion > > failure a “crash” is debatable. As I stated in the original issue a long > > time ago, it’s possible something shady is being done with when forking > > that should be fixed. The assertion should be root caused. > > > > > > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > Added a dockerfile, and reports of a crash in my local machine when > > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. > > > I couldn't reproduce the crash on my EC2 machine: > > > Added the backtrace of the crash as well. > > > > > > https://github.com/apache/incubator-mxnet/issues/10856 > > > > > > Dockerfile here: > > > > > > https://github.com/larroy/mxnet_omp > > > > > > Kind regards. > > > > > > Pedro. > > > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu < > marco.g.ab...@gmail.com> > > > wrote: > > > > > > > > As already proposed, I think the easiest way to get a common > > > understanding > > > > is if we start with a few docker containers. Pedro, would it be > possible > > > > for you to wrap your benchmarks into a few containers that will > produce > > > > your shown results? That way, we can avoid possible > misunderstandings and > > > > also pinpoint the exact parts where people disagree or misunderstood > each > > > > other. > > > > > > > > -Marco > > > > > > > > Pedro Larroy schrieb am Do., 20. Juni > > > 2019, > > > > 21:47: > > > > > > > > > I can confirm that we are linking with two versions of omp, I'm > > > > > gaining more clarity into this topic, but I have still questions, > the > > > > > facts that I got so far are the folllowing: > > > > > > > > > > * #1: We are linking with two versions of omp, intel's omp and llvm > > > > > openmp when building with MKL enabled. > > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes > with > > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > > > > > one is used on the PR proposed by Anton). > > > > > > > > > > Questions: > > > > > > > > > > * #1 Is it ok to have two versions of openmp linked at the same > time? > > > > > * #2 Which implementation of OMP gives the best performance? (See > > > > > total training time of my measurement for a partial answer) > > > > > * #3 Should we have a build flag so we can choose the OMP version > at > > > > > runtime? > > > > > * #4 Which Compiler and build flags did Chris use to get 10x > slowdown? > > > > > * #5 @Stas: is there a script to replicate your benchmarks > easily? If > > > > > so could you provide a link? I th
Re: OMP
There's an assertion which is easily reproducible, and also there's a crash including core dump, the latter is not easy to reproduce for me in different environments. I have also seen mxnet getting stuck without progressing with this build configuration and using no CPU at all when running unit tests. In my view, the root cause of the assertion is that we are re-entering OMP initialization when spawning threads on the following code through pthread_at_fork https://github.com/apache/incubator-mxnet/blob/master/src/initialize.cc#L58 This causes double initialization of the OMP engine, including the assertion which you are asking about, and I suspect some additional overhead. That's the shady forking part you are asking for. A question for you: What is the cause of runtime differences between OMP runtimes? Shouldn't the implementation overhead diminish as threads run longer? Pedro. On Mon, Jun 24, 2019 at 5:10 PM Chris Olivier wrote: > > What’s the reason for the assertion failure? btw classifying an assertion > failure a “crash” is debatable. As I stated in the original issue a long > time ago, it’s possible something shady is being done with when forking > that should be fixed. The assertion should be root caused. > > > > On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy > wrote: > > > Added a dockerfile, and reports of a crash in my local machine when > > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. > > I couldn't reproduce the crash on my EC2 machine: > > Added the backtrace of the crash as well. > > > > https://github.com/apache/incubator-mxnet/issues/10856 > > > > Dockerfile here: > > > > https://github.com/larroy/mxnet_omp > > > > Kind regards. > > > > Pedro. > > > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu > > wrote: > > > > > > As already proposed, I think the easiest way to get a common > > understanding > > > is if we start with a few docker containers. Pedro, would it be possible > > > for you to wrap your benchmarks into a few containers that will produce > > > your shown results? That way, we can avoid possible misunderstandings and > > > also pinpoint the exact parts where people disagree or misunderstood each > > > other. > > > > > > -Marco > > > > > > Pedro Larroy schrieb am Do., 20. Juni > > 2019, > > > 21:47: > > > > > > > I can confirm that we are linking with two versions of omp, I'm > > > > gaining more clarity into this topic, but I have still questions, the > > > > facts that I got so far are the folllowing: > > > > > > > > * #1: We are linking with two versions of omp, intel's omp and llvm > > > > openmp when building with MKL enabled. > > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > > > > one is used on the PR proposed by Anton). > > > > > > > > Questions: > > > > > > > > * #1 Is it ok to have two versions of openmp linked at the same time? > > > > * #2 Which implementation of OMP gives the best performance? (See > > > > total training time of my measurement for a partial answer) > > > > * #3 Should we have a build flag so we can choose the OMP version at > > > > runtime? > > > > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > > > > * #5 @Stas: is there a script to replicate your benchmarks easily? If > > > > so could you provide a link? I think we would need to reproduce your > > > > benchmarks and verify which versions are being linked. It's possible > > > > that while compiling with MKL intel's omp was pulled in instead of > > > > GNU OpenMP. > > > > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > > > > update the subrepo regularly? > > > > > > > > My conclusion so far: > > > > > > > > * #1 We should avoid linking two versions of omp if possible and > > > > allow users to choose one in the build as we do for BLAS. > > > > * #2 For performance reasons and more control vs different compiler > > > > versions seems it makes indeed sense to keep the LLVM OpenMP version > > > > in 3rdparty for now. So unless some more data is gathered, it makes > > > > sense not to remove it as of now. > > > > * #3 We should provide build options to choose which openmp library > > > > is to be used from the three optio
Re: OMP
What’s the reason for the assertion failure? btw classifying an assertion failure a “crash” is debatable. As I stated in the original issue a long time ago, it’s possible something shady is being done with when forking that should be fixed. The assertion should be root caused. On Mon, Jun 24, 2019 at 1:22 PM Pedro Larroy wrote: > Added a dockerfile, and reports of a crash in my local machine when > running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. > I couldn't reproduce the crash on my EC2 machine: > Added the backtrace of the crash as well. > > https://github.com/apache/incubator-mxnet/issues/10856 > > Dockerfile here: > > https://github.com/larroy/mxnet_omp > > Kind regards. > > Pedro. > > On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu > wrote: > > > > As already proposed, I think the easiest way to get a common > understanding > > is if we start with a few docker containers. Pedro, would it be possible > > for you to wrap your benchmarks into a few containers that will produce > > your shown results? That way, we can avoid possible misunderstandings and > > also pinpoint the exact parts where people disagree or misunderstood each > > other. > > > > -Marco > > > > Pedro Larroy schrieb am Do., 20. Juni > 2019, > > 21:47: > > > > > I can confirm that we are linking with two versions of omp, I'm > > > gaining more clarity into this topic, but I have still questions, the > > > facts that I got so far are the folllowing: > > > > > > * #1: We are linking with two versions of omp, intel's omp and llvm > > > openmp when building with MKL enabled. > > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > > > one is used on the PR proposed by Anton). > > > > > > Questions: > > > > > > * #1 Is it ok to have two versions of openmp linked at the same time? > > > * #2 Which implementation of OMP gives the best performance? (See > > > total training time of my measurement for a partial answer) > > > * #3 Should we have a build flag so we can choose the OMP version at > > > runtime? > > > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > > > * #5 @Stas: is there a script to replicate your benchmarks easily? If > > > so could you provide a link? I think we would need to reproduce your > > > benchmarks and verify which versions are being linked. It's possible > > > that while compiling with MKL intel's omp was pulled in instead of > > > GNU OpenMP. > > > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > > > update the subrepo regularly? > > > > > > My conclusion so far: > > > > > > * #1 We should avoid linking two versions of omp if possible and > > > allow users to choose one in the build as we do for BLAS. > > > * #2 For performance reasons and more control vs different compiler > > > versions seems it makes indeed sense to keep the LLVM OpenMP version > > > in 3rdparty for now. So unless some more data is gathered, it makes > > > sense not to remove it as of now. > > > * #3 We should provide build options to choose which openmp library > > > is to be used from the three options available, including libgomp. > > > * #4 Refining the build we could also enable OpenMP in mac without > > > additional contortions (doesn't work as of today): > > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > > > * #5 We should add different omp versions to our benchmarks and track > > > the performance, so this data is available for prescribing the best > > > build options and for binary releases. > > > > > > This is also an interesting related gh issue posted in the mkl-dnn > > > repository: https://github.com/intel/mkl-dnn/issues/230 > > > > > > > > > I don't observe the order of magnitude divergence reported by Chris in > > > vanilla Ubuntu 18.04 in samples / s but the full training finishes > > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. > > > > > > There's also differences in training time when using MKL and the , > > > it's actually a bit slower, I don't know if it's related to OMP. > > > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > > > > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch > 'omp' > > > (py3_venv) piotr@ec2 c
Re: OMP
Added a dockerfile, and reports of a crash in my local machine when running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. I couldn't reproduce the crash on my EC2 machine: Added the backtrace of the crash as well. https://github.com/apache/incubator-mxnet/issues/10856 Dockerfile here: https://github.com/larroy/mxnet_omp Kind regards. Pedro. On Thu, Jun 20, 2019 at 5:29 PM Marco de Abreu wrote: > > As already proposed, I think the easiest way to get a common understanding > is if we start with a few docker containers. Pedro, would it be possible > for you to wrap your benchmarks into a few containers that will produce > your shown results? That way, we can avoid possible misunderstandings and > also pinpoint the exact parts where people disagree or misunderstood each > other. > > -Marco > > Pedro Larroy schrieb am Do., 20. Juni 2019, > 21:47: > > > I can confirm that we are linking with two versions of omp, I'm > > gaining more clarity into this topic, but I have still questions, the > > facts that I got so far are the folllowing: > > > > * #1: We are linking with two versions of omp, intel's omp and llvm > > openmp when building with MKL enabled. > > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > > one is used on the PR proposed by Anton). > > > > Questions: > > > > * #1 Is it ok to have two versions of openmp linked at the same time? > > * #2 Which implementation of OMP gives the best performance? (See > > total training time of my measurement for a partial answer) > > * #3 Should we have a build flag so we can choose the OMP version at > > runtime? > > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > > * #5 @Stas: is there a script to replicate your benchmarks easily? If > > so could you provide a link? I think we would need to reproduce your > > benchmarks and verify which versions are being linked. It's possible > > that while compiling with MKL intel's omp was pulled in instead of > > GNU OpenMP. > > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > > update the subrepo regularly? > > > > My conclusion so far: > > > > * #1 We should avoid linking two versions of omp if possible and > > allow users to choose one in the build as we do for BLAS. > > * #2 For performance reasons and more control vs different compiler > > versions seems it makes indeed sense to keep the LLVM OpenMP version > > in 3rdparty for now. So unless some more data is gathered, it makes > > sense not to remove it as of now. > > * #3 We should provide build options to choose which openmp library > > is to be used from the three options available, including libgomp. > > * #4 Refining the build we could also enable OpenMP in mac without > > additional contortions (doesn't work as of today): > > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > > * #5 We should add different omp versions to our benchmarks and track > > the performance, so this data is available for prescribing the best > > build options and for binary releases. > > > > This is also an interesting related gh issue posted in the mkl-dnn > > repository: https://github.com/intel/mkl-dnn/issues/230 > > > > > > I don't observe the order of magnitude divergence reported by Chris in > > vanilla Ubuntu 18.04 in samples / s but the full training finishes > > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. > > > > There's also differences in training time when using MKL and the , > > it's actually a bit slower, I don't know if it's related to OMP. > > > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' > > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd > > build/libmxnet.so |grep -i omp > > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > > (0x7fd99a51d000) > > > > time python train_mnist.py > > > > INFO:root:Epoch[18] Validation-accuracy=0.984176 > > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec > > accuracy=1.00 > > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec > > accuracy=0.999531 > > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec > > accuracy=0.999687 > > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec > > accuracy=1.00 > > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec &
Re: OMP
As already proposed, I think the easiest way to get a common understanding is if we start with a few docker containers. Pedro, would it be possible for you to wrap your benchmarks into a few containers that will produce your shown results? That way, we can avoid possible misunderstandings and also pinpoint the exact parts where people disagree or misunderstood each other. -Marco Pedro Larroy schrieb am Do., 20. Juni 2019, 21:47: > I can confirm that we are linking with two versions of omp, I'm > gaining more clarity into this topic, but I have still questions, the > facts that I got so far are the folllowing: > > * #1: We are linking with two versions of omp, intel's omp and llvm > openmp when building with MKL enabled. > * #2: We have 3 different possible OMP versions: Intel OMP (comes with > MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This > one is used on the PR proposed by Anton). > > Questions: > > * #1 Is it ok to have two versions of openmp linked at the same time? > * #2 Which implementation of OMP gives the best performance? (See > total training time of my measurement for a partial answer) > * #3 Should we have a build flag so we can choose the OMP version at > runtime? > * #4 Which Compiler and build flags did Chris use to get 10x slowdown? > * #5 @Stas: is there a script to replicate your benchmarks easily? If > so could you provide a link? I think we would need to reproduce your > benchmarks and verify which versions are being linked. It's possible > that while compiling with MKL intel's omp was pulled in instead of > GNU OpenMP. > * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we > update the subrepo regularly? > > My conclusion so far: > > * #1 We should avoid linking two versions of omp if possible and > allow users to choose one in the build as we do for BLAS. > * #2 For performance reasons and more control vs different compiler > versions seems it makes indeed sense to keep the LLVM OpenMP version > in 3rdparty for now. So unless some more data is gathered, it makes > sense not to remove it as of now. > * #3 We should provide build options to choose which openmp library > is to be used from the three options available, including libgomp. > * #4 Refining the build we could also enable OpenMP in mac without > additional contortions (doesn't work as of today): > https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ > * #5 We should add different omp versions to our benchmarks and track > the performance, so this data is available for prescribing the best > build options and for binary releases. > > This is also an interesting related gh issue posted in the mkl-dnn > repository: https://github.com/intel/mkl-dnn/issues/230 > > > I don't observe the order of magnitude divergence reported by Chris in > vanilla Ubuntu 18.04 in samples / s but the full training finishes > indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. > > There's also differences in training time when using MKL and the , > it's actually a bit slower, I don't know if it's related to OMP. > > gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) > > Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' > (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd > build/libmxnet.so |grep -i omp > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > (0x7fd99a51d000) > > time python train_mnist.py > > INFO:root:Epoch[18] Validation-accuracy=0.984176 > INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec > accuracy=1.00 > INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec > accuracy=1.00 > INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec > accuracy=0.999687 > INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec > accuracy=0.999375 > INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec > accuracy=0.999844 > INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec > accuracy=0.999531 > INFO:root:Epoch[19] Train-accuracy=0.999717 > INFO:root:Epoch[19] Time cost=1.219 > INFO:root:Epoch[19] Validation-accuracy=0.983977 > 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata > 1146052maxresident)k > 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps > > Master, MKL ON: > > (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd > ../../build/libmxne
Re: OMP
I can confirm that we are linking with two versions of omp, I'm gaining more clarity into this topic, but I have still questions, the facts that I got so far are the folllowing: * #1: We are linking with two versions of omp, intel's omp and llvm openmp when building with MKL enabled. * #2: We have 3 different possible OMP versions: Intel OMP (comes with MKL), LLVM OpenMP (3rdparty/openmp), libgomp (comes with gcc) (This one is used on the PR proposed by Anton). Questions: * #1 Is it ok to have two versions of openmp linked at the same time? * #2 Which implementation of OMP gives the best performance? (See total training time of my measurement for a partial answer) * #3 Should we have a build flag so we can choose the OMP version at runtime? * #4 Which Compiler and build flags did Chris use to get 10x slowdown? * #5 @Stas: is there a script to replicate your benchmarks easily? If so could you provide a link? I think we would need to reproduce your benchmarks and verify which versions are being linked. It's possible that while compiling with MKL intel's omp was pulled in instead of GNU OpenMP. * #6 @Chris: how to maintain the copy of LLVM's Openmp? Should we update the subrepo regularly? My conclusion so far: * #1 We should avoid linking two versions of omp if possible and allow users to choose one in the build as we do for BLAS. * #2 For performance reasons and more control vs different compiler versions seems it makes indeed sense to keep the LLVM OpenMP version in 3rdparty for now. So unless some more data is gathered, it makes sense not to remove it as of now. * #3 We should provide build options to choose which openmp library is to be used from the three options available, including libgomp. * #4 Refining the build we could also enable OpenMP in mac without additional contortions (doesn't work as of today): https://iscinumpy.gitlab.io/post/omp-on-high-sierra/ * #5 We should add different omp versions to our benchmarks and track the performance, so this data is available for prescribing the best build options and for binary releases. This is also an interesting related gh issue posted in the mkl-dnn repository: https://github.com/intel/mkl-dnn/issues/230 I don't observe the order of magnitude divergence reported by Chris in vanilla Ubuntu 18.04 in samples / s but the full training finishes indeed faster with the OMP from 3rdparty (LLVM openmp) vs libgomp. There's also differences in training time when using MKL and the , it's actually a bit slower, I don't know if it's related to OMP. gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) Anton's branch: g...@github.com:lebeg/incubator-mxnet.git branch 'omp' (py3_venv) piotr@ec2 cpu:0: ~/mxnet_openmp [omp]> ldd build/libmxnet.so |grep -i omp libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x7fd99a51d000) time python train_mnist.py INFO:root:Epoch[18] Validation-accuracy=0.984176 INFO:root:Epoch[19] Batch [0-100] Speed: 41617.00 samples/sec accuracy=1.00 INFO:root:Epoch[19] Batch [100-200] Speed: 47990.69 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Batch [200-300] Speed: 47517.01 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [300-400] Speed: 47430.53 samples/sec accuracy=1.00 INFO:root:Epoch[19] Batch [400-500] Speed: 47649.77 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [500-600] Speed: 51708.12 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [600-700] Speed: 57228.63 samples/sec accuracy=0.999375 INFO:root:Epoch[19] Batch [700-800] Speed: 50887.85 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [800-900] Speed: 53947.98 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Train-accuracy=0.999717 INFO:root:Epoch[19] Time cost=1.219 INFO:root:Epoch[19] Validation-accuracy=0.983977 1011.98user 26.78system 0:31.54elapsed 3292%CPU (0avgtext+0avgdata 1146052maxresident)k 0inputs+0outputs (0major+3496364minor)pagefaults 0swaps Master, MKL ON: (py3_venv) piotr@ec2 cpu:1: ~/m/e/image-classification [master]> ldd ../../build/libmxnet.so | grep -i omp libomp.so => /home/piotr/mxnet_master/build/3rdparty/openmp/runtime/src/libomp.so (0x7f05ba38f000) libiomp5.so => /home/piotr/mxnet_master/build/mklml/mklml_lnx_2019.0.5.20190502/lib/libiomp5.so (0x7f05b09f4000) INFO:root:Epoch[18] Validation-accuracy=0.982484 INFO:root:Epoch[19] Batch [0-100] Speed: 36651.63 samples/sec accuracy=0.999691 INFO:root:Epoch[19] Batch [100-200] Speed: 45093.98 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [200-300] Speed: 45146.84 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [300-400] Speed: 45119.90 samples/sec accuracy=0.999687 INFO:root:Epoch[19] Batch [400-500] Speed: 44998.96 samples/sec accuracy=0.999531 INFO:root:Epoch[19] Batch [500-600] Speed: 45072.25 samples/sec accuracy=0.999844 INFO:root:Epoch[19] Batch [600-700] Speed: 44969.79 sample
Re: OMP
"if you’re linking in two then you’re doing something wrong." Correct, that's one thing I believe we've got consensus on. So let's call that out as a bug to be fixed. Let's move forward with some reproducible numbers and then discuss the pros / cons of which particular OMP implementation we should use. On Wed, Jun 19, 2019 at 3:06 PM Pedro Larroy wrote: > Hi Chris > > I would ask you to have a bit of patience and help us with your > experience in this matter. Nobody is ignoring anything, I think we are > individually gathering feedbacks and trying to understand the multiple > contributions done to this topic including yours, then go step by > step, understand what is going on and run experiments and report back > to the list or the corresponding github item. It was suggested by > Kellen to prepare some containers, this takes effort. > > Regarding your final comment, most of us also have many other things > to do and responsibilities even if our daytime jobs might involve > MXNet in some form or another. I think that's part of the privilege > and responsibility of working close with an open source project and > the magic of collaboration across organizations. Let's all be patient > and take some time to understand and reason about this topic which is > not simple. Since we decided to step back and gather more data let's > take time and do it properly. > > Personally I hope to find time to look again into this issue before > the end of the week. > > Thanks. > > Pedro. > > On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier > wrote: > > > > if you’re linking in two then you’re doing something wrong. You can see > by > > my email yesterday that only one is linked in. This is also the case with > > the mkl version built by the Makefile — only the Intel OMP library is > used > > (no libgomp). > > > > That being said, Do you have clear evidence that using Intel OMP is both > > problematic and the situation isn’t fixable? The burden of proof is on > the > > ones requesting the change — it is not my responsibility to justify the > > current state. There must be something “terrible” and unfixable to > justify > > a change. I have seen no proof of this in all this time. > > > > On a side note, I mentioned a couple of things in my email yesterday that > > still are not being responded to (they were also ignored in the last > > incarnation of this “discussion” — I have much experience in this matter > to > > assume “discussion” is a waste of my time, seeing and I am not paid to > > “work on” mxnet like y’all are). > > > > -C > > > > > > > > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < > > kellen.sunderl...@gmail.com> wrote: > > > > > I've also quite often seen two versions of OpenMP linked. I think we > can > > > all agree we probably want to avoid linking in two libraries that do > > > effectively the same thing. > > > > > > The performance questions should be fairly straight forward to > demonstrate > > > right? Could we just collaborate on a few minimal Dockerfiles that > show > > > (or don't show) Intel OpenMP performance speedups with the workloads > Chris > > > is referencing? > > > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > > > stanislav.tsuk...@gmail.com> wrote: > > > > > > > Hi, Chris! > > > > > > > > Stas here - I've gathered that performance data. > > > > Sure thing, I can be wrong, but please elaborate a bit on what we are > > > > missing. > > > > Be assured, intentional misdirection was never a case. > > > > > > > > Thanks a lot for being constructive. > > > > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to > pull > > > in > > > > omp, depending which one is linked in). > > > > > > > > We never ever considered turning MKL off. We are on the same page > here - > > > > MKL is crucial for the performance. > > > > Why should we? There's a GOMP-linked version of MKL, that we can use. > > > > > > > > What we did - we measured, if using compilers default OpenMP > > > > implementation instead of referenced source code distribution of > OpenMP > > > > makes anything slower. > > > > We have found the impact to be hardly measurable. > > > > The difference between GOMP and iOMP is <5% on our benchmarks, most > of > > > the > > > > time less than t
Re: OMP
Hi Chris I would ask you to have a bit of patience and help us with your experience in this matter. Nobody is ignoring anything, I think we are individually gathering feedbacks and trying to understand the multiple contributions done to this topic including yours, then go step by step, understand what is going on and run experiments and report back to the list or the corresponding github item. It was suggested by Kellen to prepare some containers, this takes effort. Regarding your final comment, most of us also have many other things to do and responsibilities even if our daytime jobs might involve MXNet in some form or another. I think that's part of the privilege and responsibility of working close with an open source project and the magic of collaboration across organizations. Let's all be patient and take some time to understand and reason about this topic which is not simple. Since we decided to step back and gather more data let's take time and do it properly. Personally I hope to find time to look again into this issue before the end of the week. Thanks. Pedro. On Wed, Jun 19, 2019 at 2:43 PM Chris Olivier wrote: > > if you’re linking in two then you’re doing something wrong. You can see by > my email yesterday that only one is linked in. This is also the case with > the mkl version built by the Makefile — only the Intel OMP library is used > (no libgomp). > > That being said, Do you have clear evidence that using Intel OMP is both > problematic and the situation isn’t fixable? The burden of proof is on the > ones requesting the change — it is not my responsibility to justify the > current state. There must be something “terrible” and unfixable to justify > a change. I have seen no proof of this in all this time. > > On a side note, I mentioned a couple of things in my email yesterday that > still are not being responded to (they were also ignored in the last > incarnation of this “discussion” — I have much experience in this matter to > assume “discussion” is a waste of my time, seeing and I am not paid to > “work on” mxnet like y’all are). > > -C > > > > > > > On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > > > I've also quite often seen two versions of OpenMP linked. I think we can > > all agree we probably want to avoid linking in two libraries that do > > effectively the same thing. > > > > The performance questions should be fairly straight forward to demonstrate > > right? Could we just collaborate on a few minimal Dockerfiles that show > > (or don't show) Intel OpenMP performance speedups with the workloads Chris > > is referencing? > > > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > > stanislav.tsuk...@gmail.com> wrote: > > > > > Hi, Chris! > > > > > > Stas here - I've gathered that performance data. > > > Sure thing, I can be wrong, but please elaborate a bit on what we are > > > missing. > > > Be assured, intentional misdirection was never a case. > > > > > > Thanks a lot for being constructive. > > > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > > in > > > omp, depending which one is linked in). > > > > > > We never ever considered turning MKL off. We are on the same page here - > > > MKL is crucial for the performance. > > > Why should we? There's a GOMP-linked version of MKL, that we can use. > > > > > > What we did - we measured, if using compilers default OpenMP > > > implementation instead of referenced source code distribution of OpenMP > > > makes anything slower. > > > We have found the impact to be hardly measurable. > > > The difference between GOMP and iOMP is <5% on our benchmarks, most of > > the > > > time less than that. > > > > > > We just suggest to simplify the build of mxnet, by removing the > > > unnecessary dependency. > > > > > > During that we discovered for example the following amazing issue: > > > https://github.com/apache/incubator-mxnet/issues/14087 > > > > > > Best Regards > > > > > > Stas > > > > > > On 18.06.19, 18:24, "Chris Olivier" wrote: > > > > > > I am very reluctant to feed the trolls again, and this will be teh > > last > > > time I address Pedro or Anton on the subject, but since I think the > > > numbers > > > being presented are incorrect (either by te builders not really > > > understanding what they are building, or possibly intentional > > > misdirection): > &g
Re: OMP
if you’re linking in two then you’re doing something wrong. You can see by my email yesterday that only one is linked in. This is also the case with the mkl version built by the Makefile — only the Intel OMP library is used (no libgomp). That being said, Do you have clear evidence that using Intel OMP is both problematic and the situation isn’t fixable? The burden of proof is on the ones requesting the change — it is not my responsibility to justify the current state. There must be something “terrible” and unfixable to justify a change. I have seen no proof of this in all this time. On a side note, I mentioned a couple of things in my email yesterday that still are not being responded to (they were also ignored in the last incarnation of this “discussion” — I have much experience in this matter to assume “discussion” is a waste of my time, seeing and I am not paid to “work on” mxnet like y’all are). -C On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > I've also quite often seen two versions of OpenMP linked. I think we can > all agree we probably want to avoid linking in two libraries that do > effectively the same thing. > > The performance questions should be fairly straight forward to demonstrate > right? Could we just collaborate on a few minimal Dockerfiles that show > (or don't show) Intel OpenMP performance speedups with the workloads Chris > is referencing? > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > stanislav.tsuk...@gmail.com> wrote: > > > Hi, Chris! > > > > Stas here - I've gathered that performance data. > > Sure thing, I can be wrong, but please elaborate a bit on what we are > > missing. > > Be assured, intentional misdirection was never a case. > > > > Thanks a lot for being constructive. > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > in > > omp, depending which one is linked in). > > > > We never ever considered turning MKL off. We are on the same page here - > > MKL is crucial for the performance. > > Why should we? There's a GOMP-linked version of MKL, that we can use. > > > > What we did - we measured, if using compilers default OpenMP > > implementation instead of referenced source code distribution of OpenMP > > makes anything slower. > > We have found the impact to be hardly measurable. > > The difference between GOMP and iOMP is <5% on our benchmarks, most of > the > > time less than that. > > > > We just suggest to simplify the build of mxnet, by removing the > > unnecessary dependency. > > > > During that we discovered for example the following amazing issue: > > https://github.com/apache/incubator-mxnet/issues/14087 > > > > Best Regards > > > > Stas > > > > On 18.06.19, 18:24, "Chris Olivier" wrote: > > > > I am very reluctant to feed the trolls again, and this will be teh > last > > time I address Pedro or Anton on the subject, but since I think the > > numbers > > being presented are incorrect (either by te builders not really > > understanding what they are building, or possibly intentional > > misdirection): > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > > in > > omp, depending which one is linked in). > > There is a HUGE difference. This is consistent with my experience > > before > > when it was added. > > > > > > default mnist: > > > > python ../example/image-classification/train_mnist.py > > INFO:root:start with arguments Namespace(add_stn=False, > batch_size=64, > > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > > gpus=None, image_shape='1, 28, 28', initializer='default', > > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > > monitor=0, network='mlp', num_classes=10, num_epochs=20, > > num_examples=6, num_layers=None, optimizer='sgd', > > profile_server_suffix='', profile_worker_suffix='', save_period=1, > > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', > > wd=0.0001) > > > > INTEL OMP: > > > > ldd libmxnet.so | grep omp > > libomp.so => > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > > (0x7f978fde7000) > > > > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec > > accuracy=0.780012 > > INFO:root:Epoch[0] Batch [100-20
Re: OMP
+1 Would be best to have a controlled environment so we can reason about how MXNet is being built and what libraries are linked. I'm happy to help here. I would think docker won't have a big impact on the measurement or distort the results much. On Wed, Jun 19, 2019 at 10:28 AM kellen sunderland wrote: > > I've also quite often seen two versions of OpenMP linked. I think we can > all agree we probably want to avoid linking in two libraries that do > effectively the same thing. > > The performance questions should be fairly straight forward to demonstrate > right? Could we just collaborate on a few minimal Dockerfiles that show > (or don't show) Intel OpenMP performance speedups with the workloads Chris > is referencing? > > On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < > stanislav.tsuk...@gmail.com> wrote: > > > Hi, Chris! > > > > Stas here - I've gathered that performance data. > > Sure thing, I can be wrong, but please elaborate a bit on what we are > > missing. > > Be assured, intentional misdirection was never a case. > > > > Thanks a lot for being constructive. > > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in > > omp, depending which one is linked in). > > > > We never ever considered turning MKL off. We are on the same page here - > > MKL is crucial for the performance. > > Why should we? There's a GOMP-linked version of MKL, that we can use. > > > > What we did - we measured, if using compilers default OpenMP > > implementation instead of referenced source code distribution of OpenMP > > makes anything slower. > > We have found the impact to be hardly measurable. > > The difference between GOMP and iOMP is <5% on our benchmarks, most of the > > time less than that. > > > > We just suggest to simplify the build of mxnet, by removing the > > unnecessary dependency. > > > > During that we discovered for example the following amazing issue: > > https://github.com/apache/incubator-mxnet/issues/14087 > > > > Best Regards > > > > Stas > > > > On 18.06.19, 18:24, "Chris Olivier" wrote: > > > > I am very reluctant to feed the trolls again, and this will be teh last > > time I address Pedro or Anton on the subject, but since I think the > > numbers > > being presented are incorrect (either by te builders not really > > understanding what they are building, or possibly intentional > > misdirection): > > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > > in > > omp, depending which one is linked in). > > There is a HUGE difference. This is consistent with my experience > > before > > when it was added. > > > > > > default mnist: > > > > python ../example/image-classification/train_mnist.py > > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, > > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > > gpus=None, image_shape='1, 28, 28', initializer='default', > > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > > monitor=0, network='mlp', num_classes=10, num_epochs=20, > > num_examples=6, num_layers=None, optimizer='sgd', > > profile_server_suffix='', profile_worker_suffix='', save_period=1, > > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', > > wd=0.0001) > > > > INTEL OMP: > > > > ldd libmxnet.so | grep omp > > libomp.so => > > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > > (0x7f978fde7000) > > > > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec > > accuracy=0.780012 > > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec > > accuracy=0.920469 > > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec > > accuracy=0.928281 > > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec > > accuracy=0.942813 > > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec > > accuracy=0.938750 > > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec > > accuracy=0.946562 > > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec > > accuracy=0.953281 > > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec > >
Re: OMP
I've also quite often seen two versions of OpenMP linked. I think we can all agree we probably want to avoid linking in two libraries that do effectively the same thing. The performance questions should be fairly straight forward to demonstrate right? Could we just collaborate on a few minimal Dockerfiles that show (or don't show) Intel OpenMP performance speedups with the workloads Chris is referencing? On Wed, Jun 19, 2019 at 4:44 AM Tsukrov, Stanislav < stanislav.tsuk...@gmail.com> wrote: > Hi, Chris! > > Stas here - I've gathered that performance data. > Sure thing, I can be wrong, but please elaborate a bit on what we are > missing. > Be assured, intentional misdirection was never a case. > > Thanks a lot for being constructive. > > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in > omp, depending which one is linked in). > > We never ever considered turning MKL off. We are on the same page here - > MKL is crucial for the performance. > Why should we? There's a GOMP-linked version of MKL, that we can use. > > What we did - we measured, if using compilers default OpenMP > implementation instead of referenced source code distribution of OpenMP > makes anything slower. > We have found the impact to be hardly measurable. > The difference between GOMP and iOMP is <5% on our benchmarks, most of the > time less than that. > > We just suggest to simplify the build of mxnet, by removing the > unnecessary dependency. > > During that we discovered for example the following amazing issue: > https://github.com/apache/incubator-mxnet/issues/14087 > > Best Regards > > Stas > > On 18.06.19, 18:24, "Chris Olivier" wrote: > > I am very reluctant to feed the trolls again, and this will be teh last > time I address Pedro or Anton on the subject, but since I think the > numbers > being presented are incorrect (either by te builders not really > understanding what they are building, or possibly intentional > misdirection): > > Turning Intel OMP on and off (and MKL as well, since it tends to pull > in > omp, depending which one is linked in). > There is a HUGE difference. This is consistent with my experience > before > when it was added. > > > default mnist: > > python ../example/image-classification/train_mnist.py > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > gpus=None, image_shape='1, 28, 28', initializer='default', > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > monitor=0, network='mlp', num_classes=10, num_epochs=20, > num_examples=6, num_layers=None, optimizer='sgd', > profile_server_suffix='', profile_worker_suffix='', save_period=1, > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', > wd=0.0001) > > INTEL OMP: > > ldd libmxnet.so | grep omp > libomp.so => > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > (0x7f978fde7000) > > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec > accuracy=0.780012 > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec > accuracy=0.920469 > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec > accuracy=0.928281 > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec > accuracy=0.942813 > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec > accuracy=0.938750 > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec > accuracy=0.946562 > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec > accuracy=0.953281 > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec > accuracy=0.951562 > INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec > accuracy=0.957500 > INFO:root:Epoch[0] Train-accuracy=0.925423 > INFO:root:Epoch[0] Time cost=3.806 > INFO:root:Epoch[0] Validation-accuracy=0.962580 > INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec > accuracy=0.968131 > INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec > accuracy=0.966250 > > > LIBGOMP: > > ldd libmxnet.so | grep omp > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > (0x7f25c25dd000) > > INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec > accuracy=0.782488 > INFO:root:Epoch[0] Batch [1
Re: OMP
Hi, Chris! Stas here - I've gathered that performance data. Sure thing, I can be wrong, but please elaborate a bit on what we are missing. Be assured, intentional misdirection was never a case. Thanks a lot for being constructive. > Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, > depending which one is linked in). We never ever considered turning MKL off. We are on the same page here - MKL is crucial for the performance. Why should we? There's a GOMP-linked version of MKL, that we can use. What we did - we measured, if using compilers default OpenMP implementation instead of referenced source code distribution of OpenMP makes anything slower. We have found the impact to be hardly measurable. The difference between GOMP and iOMP is <5% on our benchmarks, most of the time less than that. We just suggest to simplify the build of mxnet, by removing the unnecessary dependency. During that we discovered for example the following amazing issue: https://github.com/apache/incubator-mxnet/issues/14087 Best Regards Stas On 18.06.19, 18:24, "Chris Olivier" wrote: I am very reluctant to feed the trolls again, and this will be teh last time I address Pedro or Anton on the subject, but since I think the numbers being presented are incorrect (either by te builders not really understanding what they are building, or possibly intentional misdirection): Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, depending which one is linked in). There is a HUGE difference. This is consistent with my experience before when it was added. default mnist: python ../example/image-classification/train_mnist.py INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=6, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INTEL OMP: ldd libmxnet.so | grep omp libomp.so => /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so (0x7f978fde7000) :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec accuracy=0.780012 INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec accuracy=0.920469 INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec accuracy=0.928281 INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec accuracy=0.942813 INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec accuracy=0.938750 INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec accuracy=0.946562 INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec accuracy=0.953281 INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec accuracy=0.951562 INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec accuracy=0.957500 INFO:root:Epoch[0] Train-accuracy=0.925423 INFO:root:Epoch[0] Time cost=3.806 INFO:root:Epoch[0] Validation-accuracy=0.962580 INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec accuracy=0.968131 INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec accuracy=0.966250 LIBGOMP: ldd libmxnet.so | grep omp libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x7f25c25dd000) INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec accuracy=0.782488 INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 samples/sec accuracy=0.907813 INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 samples/sec accuracy=0.927188 INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 samples/sec accuracy=0.937969 INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 samples/sec accuracy=0.942187 INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 samples/sec accuracy=0.950156 INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 samples/sec accuracy=0.947969 INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 samples/sec accuracy=0.953750 INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 samples/sec accuracy=0.953125 That being said, there's other issued beyond speed. The DEFAULT build from makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously it has no issues? This seems highly suspicious. All I see is a lot of h
Re: OMP
First of all, thanks for following up on this topic and not swiping the problem under the rug. You might very well be right and have some numbers which corroborate your findings, this might be something to celebrate. Before continuing our technical discussion I would like to take a step back and remind you of the code of conduct, since I think the way your are handling the communication about this issue is not conductive for a healthy community, It is also not a good leadership example from a respected engineer and Apache PMC member. We are all trying to do the best we can for the project and not everyone is an expert on everything. There are technical decisions made long ago, sometimes lacking proper documentation and justifications which even if they are right, constitute technical debt as it takes a big effort to reverse-engineer or deep dive to understand all the ramifications which are non-obvious. I called a vote to clarify the issue and have an opportunity to move a long standing problem that remains unaddressed and unclear, this is not trolling, nothing personal nor against anyone nor their work. I actually just know the basics about OpenMP, so this is hardly about ego, as it's also not my contribution, I tried to help by providing some benchmarks requested since I felt the original contributors gave up trying to help. After we provided info and benchmarks one after another, you closed the PR in a way that was not well understood. If there's a flaw on the benchmark you are right to point it out. If someone doesn't have time or willingness to coach contributors or properly explain why a PR is not doing the right thing or document your technical contributions in a way that we can all align behind and understand the tradeoffs they shouldn't be exercising the power to close PRs. Please take some time to read the code of conduct: https://www.apache.org/foundation/policies/conduct There's also other materials about building healthy communities: https://www.jonobacon.com/books/artofcommunity/ Since we don't all share your particular sense of humor I would suggest to be prudent, have politeness, patience explaining your technical decisions and refrain from calling other people's names or using ad-hominem, as well as assuming good intentions. I suggested to you before in a private channel to have your findings and benchmarks documented in the wiki so we can have constructive conversations and help contributors improve the existing issues with OpenMP, people come and go to projects, so you can't assume that everyone knows the reasons why something was done some way two years ago, also the reasons might change with time. Pedro. On Tue, Jun 18, 2019 at 9:24 AM Chris Olivier wrote: > > I am very reluctant to feed the trolls again, and this will be teh last > time I address Pedro or Anton on the subject, but since I think the numbers > being presented are incorrect (either by te builders not really > understanding what they are building, or possibly intentional misdirection): > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in > omp, depending which one is linked in). > There is a HUGE difference. This is consistent with my experience before > when it was added. > > > default mnist: > > python ../example/image-classification/train_mnist.py > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > gpus=None, image_shape='1, 28, 28', initializer='default', > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > monitor=0, network='mlp', num_classes=10, num_epochs=20, > num_examples=6, num_layers=None, optimizer='sgd', > profile_server_suffix='', profile_worker_suffix='', save_period=1, > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) > > INTEL OMP: > > ldd libmxnet.so | grep omp > libomp.so => > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > (0x7f978fde7000) > > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec > accuracy=0.780012 > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec > accuracy=0.920469 > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec > accuracy=0.928281 > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec > accuracy=0.942813 > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec > accuracy=0.938750 > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec > accuracy=0.946562 > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec > accuracy=0.953281 > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec > accuracy=0.951562 > INFO:root:Epoch[0]
Re: OMP
Hi Chris, It's not clear why you think the numbers are wrong. It seems Stas has taken a lot of effort to perform the benchmarks and comprehensively write down the methodology and results. Of course, no one is above making mistakes. Therefore, it would be great if you could shine some light on what you find objectionable and maybe add some suggestions for experiments or improvements. Perhaps you could try to rerun the benchmarks yourself and reach out if there are any steps that are missing or unclear. I work with Stas and he's a very talented engineer and his integrity is above reproach. So, you don't need to fear any "political" motivations behind his effort. I feel this level of antagonism doesn't help the community at all. Perhaps we could keep the conversation around the methodology and the results so we can bring this story to a conclusion. Cheers, Per On Tue., 18 Jun. 2019, 6:24 pm Chris Olivier, wrote: > I am very reluctant to feed the trolls again, and this will be teh last > time I address Pedro or Anton on the subject, but since I think the numbers > being presented are incorrect (either by te builders not really > understanding what they are building, or possibly intentional > misdirection): > > Turning Intel OMP on and off (and MKL as well, since it tends to pull in > omp, depending which one is linked in). > There is a HUGE difference. This is consistent with my experience before > when it was added. > > > default mnist: > > python ../example/image-classification/train_mnist.py > INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, > disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', > gpus=None, image_shape='1, 28, 28', initializer='default', > kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, > lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, > monitor=0, network='mlp', num_classes=10, num_epochs=20, > num_examples=6, num_layers=None, optimizer='sgd', > profile_server_suffix='', profile_worker_suffix='', save_period=1, > test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) > > INTEL OMP: > > ldd libmxnet.so | grep omp > libomp.so => > /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so > (0x7f978fde7000) > > :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec > accuracy=0.780012 > INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec > accuracy=0.920469 > INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec > accuracy=0.928281 > INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec > accuracy=0.942813 > INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec > accuracy=0.938750 > INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec > accuracy=0.946562 > INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec > accuracy=0.953281 > INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec > accuracy=0.951562 > INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec > accuracy=0.957500 > INFO:root:Epoch[0] Train-accuracy=0.925423 > INFO:root:Epoch[0] Time cost=3.806 > INFO:root:Epoch[0] Validation-accuracy=0.962580 > INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec > accuracy=0.968131 > INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec > accuracy=0.966250 > > > LIBGOMP: > > ldd libmxnet.so | grep omp > libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 > (0x7f25c25dd000) > > INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec > accuracy=0.782488 > INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 samples/sec > accuracy=0.907813 > INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 samples/sec > accuracy=0.927188 > INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 samples/sec > accuracy=0.937969 > INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 samples/sec > accuracy=0.942187 > INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 samples/sec > accuracy=0.950156 > INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 samples/sec > accuracy=0.947969 > INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 samples/sec > accuracy=0.953750 > INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 samples/sec > accuracy=0.953125 > > That being said, there's other issued beyond speed. The DEFAULT build from > makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously > it has no issues? This seems highly suspicious. All I see is a lot of > hand-waving and conjecture and pointing to StackOverflow posts made by > people who may be of questionable pedigree to be
OMP
I am very reluctant to feed the trolls again, and this will be teh last time I address Pedro or Anton on the subject, but since I think the numbers being presented are incorrect (either by te builders not really understanding what they are building, or possibly intentional misdirection): Turning Intel OMP on and off (and MKL as well, since it tends to pull in omp, depending which one is linked in). There is a HUGE difference. This is consistent with my experience before when it was added. default mnist: python ../example/image-classification/train_mnist.py INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, disp_batches=100, dtype='float32', gc_threshold=0.5, gc_type='none', gpus=None, image_shape='1, 28, 28', initializer='default', kv_store='device', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=20, num_examples=6, num_layers=None, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001) INTEL OMP: ldd libmxnet.so | grep omp libomp.so => /home/chris/src/mxnet/cmake_omp/3rdparty/openmp/runtime/src/libomp.so (0x7f978fde7000) :root:Epoch[0] Batch [0-100]Speed: 31548.09 samples/sec accuracy=0.780012 INFO:root:Epoch[0] Batch [100-200] Speed: 16073.21 samples/sec accuracy=0.920469 INFO:root:Epoch[0] Batch [200-300] Speed: 19075.91 samples/sec accuracy=0.928281 INFO:root:Epoch[0] Batch [300-400] Speed: 23211.36 samples/sec accuracy=0.942813 INFO:root:Epoch[0] Batch [400-500] Speed: 22139.79 samples/sec accuracy=0.938750 INFO:root:Epoch[0] Batch [500-600] Speed: 23225.52 samples/sec accuracy=0.946562 INFO:root:Epoch[0] Batch [600-700] Speed: 19547.41 samples/sec accuracy=0.953281 INFO:root:Epoch[0] Batch [700-800] Speed: 24111.73 samples/sec accuracy=0.951562 INFO:root:Epoch[0] Batch [800-900] Speed: 13959.88 samples/sec accuracy=0.957500 INFO:root:Epoch[0] Train-accuracy=0.925423 INFO:root:Epoch[0] Time cost=3.806 INFO:root:Epoch[0] Validation-accuracy=0.962580 INFO:root:Epoch[1] Batch [0-100]Speed: 24560.21 samples/sec accuracy=0.968131 INFO:root:Epoch[1] Batch [100-200] Speed: 23457.03 samples/sec accuracy=0.966250 LIBGOMP: ldd libmxnet.so | grep omp libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x7f25c25dd000) INFO:root:Epoch[0] Batch [0-100]Speed: 1731.01 samples/sec accuracy=0.782488 INFO:root:Epoch[0] Batch [100-200] Speed: 3551.32 samples/sec accuracy=0.907813 INFO:root:Epoch[0] Batch [200-300] Speed: 1991.00 samples/sec accuracy=0.927188 INFO:root:Epoch[0] Batch [300-400] Speed: 2175.45 samples/sec accuracy=0.937969 INFO:root:Epoch[0] Batch [400-500] Speed: 1644.95 samples/sec accuracy=0.942187 INFO:root:Epoch[0] Batch [500-600] Speed: 6444.58 samples/sec accuracy=0.950156 INFO:root:Epoch[0] Batch [600-700] Speed: 7842.16 samples/sec accuracy=0.947969 INFO:root:Epoch[0] Batch [700-800] Speed: 9412.07 samples/sec accuracy=0.953750 INFO:root:Epoch[0] Batch [800-900] Speed: 12707.58 samples/sec accuracy=0.953125 That being said, there's other issued beyond speed. The DEFAULT build from makefile (not CMake) uses Intel OMP mkl (I showed before) and mysteriously it has no issues? This seems highly suspicious. All I see is a lot of hand-waving and conjecture and pointing to StackOverflow posts made by people who may be of questionable pedigree to begin with. This smells of a Pedro-ego-fight rather than one of purely technical merit. Also, if one knows how OMP works, they would be very suspicious of the "intermittent hangs" claim -- that's probably just broken race conditions elsewhere until proven differently. It'd tend freeze on the first use if something is wrong (try using libgomp after a fork and see), since worker threads" wouldn't be assigned/joined properly. IntelOMP is faster, but also has other advantages, such as allowing OMP after a fork. I actually addressed a lot of issues and ask for clarification in the original PR's way back when, but they're all just ignored. -Chris