Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Da Zheng
I have come up a temporary solution for this memory error. https://github.com/apache/incubator-mxnet/pull/10812 I tested with Anirudh's command. It works fine. I call it a temporary solution because it only fixes the segfault. It seems to me that the race condition can potentially corrupt data in

Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Zheng, Da
Hello Pedro, I did exactly what you said in your previous email. I edit ci/docker/runtime_functions.sh based on your patch and here is the history of running your commands: 2004 vim ci/docker/runtime_functions.sh 2005 ci/docker/runtime_functions.sh clean_repo 2006 ci/build.py -p ubuntu_cp

Re: segmentation fault in master using mkdlnn

2018-05-04 Thread Pedro Larroy
Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with DLAMI. I'm pretty confident it runs in most linux environments. Can you post the exact commands that you run? is not clear to me what's the problem from your paste. Please make sure your repo is clean and all your subrepos

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Marco de Abreu
Da, it seems like you have a problem with your internet connection, leading to a timeout to the keyserver. -Marco On Thu, May 3, 2018 at 8:20 PM, Anirudh wrote: > Hi Pedro and Da, > > I am not sure how to install mkldnn with cmake. But for make to reproduce > you can do the following: > > make

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Anirudh
Hi Pedro and Da, I am not sure how to install mkldnn with cmake. But for make to reproduce you can do the following: make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0 USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1 export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 export MXNET_TEST_SEED=11 export

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Zheng, Da
Hello Pedro, I tried your instructions. It seems I can't run the docker in EC2 instances. Where did you reproduce the error? Thanks, Da + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/' + gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 gpg: directory `/root/.gnupg' created g

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
I tried to compile with MKLDNN with Cmake + CLION and found some difficulties, even though I have mkldnn in the 3rdparty folder and installed mkl in user local. What are exactly the steps to compile with MKLDNN with Cmake? I saw this documented only for Make. Pedro. On Thu, May 3, 2018 at 4:59 P

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
Hi Da Reproduction instructions: On the host: Adjust core pattern: $ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern Use the following patch: === diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn --- a/3rdparty/mkldnn +++ b/3rdparty/mkldnn @@ -1 +1 @@ -Subproject commit b

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Zheng, Da
Thanks a lot for locating the error. Could you tell me How you reproduce the error? On 5/3/18, 7:45 AM, "Pedro Larroy" wrote: Looks like a problem in mkl's same_shape the pointer to mkldnn::memory::desc &desc looks invalid. (More stack frames follow...) (gdb) p desc

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
Looks like a problem in mkl's same_shape the pointer to mkldnn::memory::desc &desc looks invalid. (More stack frames follow...) (gdb) p desc $1 = (const mkldnn::memory::desc &) @0x10: (gdb) p dtype $2 = 0 (gdb) p shape $3 = (const mxnet::TShape &) @0x7f3905a58b50: {> = {static kStackCache = , n

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
Hi Bhavin Good suggestion I tried 1) but I can't get a core inside the container, even with ulimit -c unlimited I found out that /proc/sys/kernel/core_pattern by default in ubuntu uses a pipe to /usr/share/apport/apport which doesn't exist inside the container, changing it outside the contain

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Zheng, Da
There are a few problems with valgrind, which makes it not an ideal tool for mxnet with python interface. First, valgrind generates a huge number of irrelevant messages, most of them from in Python itself. Second, valgrind can't emulate all CPU instructions. I remember that when I run valgrind

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Bhavin Thaker
Hi Pedro, All, 1) I would suggest that we run “ulimit -c unlimited” in every CI Slave machine at startup to enable core-dump and get stack trace. 2) Valgrind on Python generates so much noise that extracting signal from it is painful, but it is still worth trying it out and look at the messages t

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
Hi Managed to get a stack trace: + nosetests-2.7 --verbose tests/python/unittest/test_module.py:test_forward_reshape [WARNING] *** module-level seed is set: all tests running deterministically *** [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=812478194 to reproduce. [WARN

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
@Chris seems intel inspector requires purchasing right? maybe some of us already owns a license and can execute the test that fails intermittently? test_module.py:test_forward_reshape On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy wrote: > It's very difficult to reproduce, non-deterministic. We w

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Shaikh, Eftiquar
If the issue is platform neutral - I can try reproducing on Windows. A fault in native code should produce a dump that can be analyzed. I am currently working on building mxnet from source, and can spend sometime on this. Sent from my iPhone > On May 3, 2018, at 6:51 AM, Pedro Larroy wrote:

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Pedro Larroy
It's very difficult to reproduce, non-deterministic. We were also running without signal handlers in CI so there are no stack traces unfortunately. Care to elaborate why valgrind doesn't work with Python? On Thu, May 3, 2018 at 3:32 PM, Da Zheng wrote: > can we build it in CI?segfault doesn't

Re: segmentation fault in master using mkdlnn

2018-05-03 Thread Da Zheng
can we build it in CI?segfault doesn't happen infrequently. 2018年5月2日 下午11:34,"Chris Olivier" 写道: > you can try Intel Inspector, which is like an enhanced version of valgrind > with a GUI and whatnot. > > On Wed, May 2, 2018 at 9:42 PM Da Zheng wrote: > > > valgrind doesn't work with Python. als

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Chris Olivier
you can try Intel Inspector, which is like an enhanced version of valgrind with a GUI and whatnot. On Wed, May 2, 2018 at 9:42 PM Da Zheng wrote: > valgrind doesn't work with Python. also, valgrind doesn't support some > CPU instructions used by MXNet (I think some instructions related to > rand

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Da Zheng
valgrind doesn't work with Python. also, valgrind doesn't support some CPU instructions used by MXNet (I think some instructions related to random generator). On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker wrote: > Have you tried running with valgrind to get some clues on the root-cause? > > Bhav

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Bhavin Thaker
Have you tried running with valgrind to get some clues on the root-cause? Bhavin Thaker. On Wed, May 2, 2018 at 8:55 PM Da Zheng wrote: > It might also be possible that this isn't an MKLDNN bug. > I just saw a similar memory error without MKLDNN build. > > http://jenkins.mxnet-ci.amazon-ml.com/

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Da Zheng
It might also be possible that this isn't an MKLDNN bug. I just saw a similar memory error without MKLDNN build. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline Best, Da On Wed, May 2, 2018 at 2:14 PM, Zheng, Da wrote: > There might be

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Zheng, Da
There might be a race condition that causes the memory error. It might be caused by this PR: https://github.com/apache/incubator-mxnet/pull/10706/files This PR removes MKLDNN memory from NDArray. However, I don't know why this causes memory error. If someone is using the memory, it should still ho

Re: segmentation fault in master using mkdlnn

2018-05-02 Thread Pedro Larroy
I couldn't reproduce locally with: ci/build.py -p ubuntu_cpu /work/runtime_functions.sh build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu /work/runtime_functions.sh unittest_ubuntu_python2_cpu On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy wrote: > Hi > > Seems master is not running