I remember at the time we also had a read through of this blog post, but to
use the code looked like it was following the advice:
https://devblogs.nvidia.com/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/
On Mon, Jun 24, 2019 at 6:39 PM kellen sunderland <
I remember this hang as well, it was pretty hard to reproduce IIRC. I
believe the stacks for the hang are here:
https://gist.github.com/KellenSunderland/893d11165e19d1efcf5c0fe8e8584600 and
the trick was we could only debug it up to the point that we hit:
#0 0x7fec6df1ba4f in futex_wait
If you check initialize.cc we seem to be explicitly disabling that
behaviour in pthread_at_fork which seems to cause thread contention
during multiprocessing. Why do we need this major advantage for the
library if that's the case?
Related PRs:
https://github.com/apache/incubator-mxnet/pull/10820
one major advantage of intel/llvm omp is that it spawns a new thread pool
after fork if a thread pool was already created. this is so that omp can be
used in the forked processes. libgomp doesn’t do this so it’ll just lock up
if you try to do omp in the forked process.
is your build linking
There's an assertion which is easily reproducible, and also there's a
crash including core dump, the latter is not easy to reproduce for me
in different environments. I have also seen mxnet getting stuck
without progressing with this build configuration and using no CPU at
all when running unit
What’s the reason for the assertion failure? btw classifying an assertion
failure a “crash” is debatable. As I stated in the original issue a long
time ago, it’s possible something shady is being done with when forking
that should be fixed. The assertion should be root caused.
On Mon, Jun 24,
Added a dockerfile, and reports of a crash in my local machine when
running MKL+OMP+DEBUG, with Anton's branch the crash happened as well.
I couldn't reproduce the crash on my EC2 machine:
Added the backtrace of the crash as well.
https://github.com/apache/incubator-mxnet/issues/10856
Dockerfile