Re: OMP

2019-06-24 Thread kellen sunderland
I remember at the time we also had a read through of this blog post, but to use the code looked like it was following the advice: https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/ On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <

Re: OMP

2019-06-24 Thread kellen sunderland
I remember this hang as well, it was pretty hard to reproduce IIRC. I believe the stacks for the hang are here: https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and the trick was we could only debug it up to the point that we hit: #0 0x7fec6df1ba4f in futex_wait

Re: OMP

2019-06-24 Thread Pedro Larroy
If you check initialize.cc we seem to be explicitly disabling that behaviour in pthread_at_fork which seems to cause thread contention during multiprocessing. Why do we need this major advantage for the library if that's the case? Related PRs: https://github.com/apache/incubator-mxnet/pull/10820

Re: OMP

2019-06-24 Thread Chris Olivier
one major advantage of intel/llvm omp is that it spawns a new thread pool after fork if a thread pool was already created. this is so that omp can be used in the forked processes. libgomp doesn’t do this so it’ll just lock up if you try to do omp in the forked process. is your build linking

Re: OMP

2019-06-24 Thread Pedro Larroy
There's an assertion which is easily reproducible, and also there's a crash including core dump, the latter is not easy to reproduce for me in different environments. I have also seen mxnet getting stuck without progressing with this build configuration and using no CPU at all when running unit

Re: OMP

2019-06-24 Thread Chris Olivier
What’s the reason for the assertion failure? btw classifying an assertion failure a “crash” is debatable. As I stated in the original issue a long time ago, it’s possible something shady is being done with when forking that should be fixed. The assertion should be root caused. On Mon, Jun 24,

Re: OMP

2019-06-24 Thread Pedro Larroy
Added a dockerfile, and reports of a crash in my local machine when running MKL+OMP+DEBUG, with Anton's branch the crash happened as well. I couldn't reproduce the crash on my EC2 machine: Added the backtrace of the crash as well. https://github.com/apache/incubator-mxnet/issues/10856 Dockerfile