I have come up a temporary solution for this memory error.
https://github.com/apache/incubator-mxnet/pull/10812
I tested with Anirudh's command. It works fine.
I call it a temporary solution because it only fixes the segfault. It
seems to me that the race condition can potentially corrupt data in
Hello Pedro,
I did exactly what you said in your previous email.
I edit ci/docker/runtime_functions.sh based on your patch and here is the
history of running your commands:
2004 vim ci/docker/runtime_functions.sh
2005 ci/docker/runtime_functions.sh clean_repo
2006 ci/build.py -p ubuntu_cp
Hi Da. I run it both in my ubuntu 16.04 workstation, in a p3 instance with
DLAMI. I'm pretty confident it runs in most linux environments.
Can you post the exact commands that you run? is not clear to me what's the
problem from your paste. Please make sure your repo is clean and all your
subrepos
Da, it seems like you have a problem with your internet connection, leading
to a timeout to the keyserver.
-Marco
On Thu, May 3, 2018 at 8:20 PM, Anirudh wrote:
> Hi Pedro and Da,
>
> I am not sure how to install mkldnn with cmake. But for make to reproduce
> you can do the following:
>
> make
Hi Pedro and Da,
I am not sure how to install mkldnn with cmake. But for make to reproduce
you can do the following:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_DIST_KVSTORE=0
USE_CUDA=0 USE_CUDNN=0 USE_MKLDNN=1
export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
export MXNET_TEST_SEED=11
export
Hello Pedro,
I tried your instructions. It seems I can't run the docker in EC2 instances.
Where did you reproduce the error?
Thanks,
Da
+ echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
+ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg: directory `/root/.gnupg' created
g
I tried to compile with MKLDNN with Cmake + CLION and found some
difficulties, even though I have mkldnn in the 3rdparty folder and
installed mkl in user local.
What are exactly the steps to compile with MKLDNN with Cmake? I saw this
documented only for Make.
Pedro.
On Thu, May 3, 2018 at 4:59 P
Hi Da
Reproduction instructions:
On the host:
Adjust core pattern:
$ echo '/tmp/core.%h.%e.%t' > /proc/sys/kernel/core_pattern
Use the following patch:
===
diff --git a/3rdparty/mkldnn b/3rdparty/mkldnn
--- a/3rdparty/mkldnn
+++ b/3rdparty/mkldnn
@@ -1 +1 @@
-Subproject commit b
Thanks a lot for locating the error.
Could you tell me How you reproduce the error?
On 5/3/18, 7:45 AM, "Pedro Larroy" wrote:
Looks like a problem in mkl's same_shape
the pointer to mkldnn::memory::desc &desc looks invalid.
(More stack frames follow...)
(gdb) p desc
Looks like a problem in mkl's same_shape
the pointer to mkldnn::memory::desc &desc looks invalid.
(More stack frames follow...)
(gdb) p desc
$1 = (const mkldnn::memory::desc &) @0x10:
(gdb) p dtype
$2 = 0
(gdb) p shape
$3 = (const mxnet::TShape &) @0x7f3905a58b50: {> =
{static kStackCache = , n
Hi Bhavin
Good suggestion
I tried 1) but I can't get a core inside the container, even with ulimit -c
unlimited
I found out that /proc/sys/kernel/core_pattern by default in ubuntu uses
a pipe to /usr/share/apport/apport which doesn't exist inside the
container,
changing it outside the contain
There are a few problems with valgrind, which makes it not an ideal tool for
mxnet with python interface.
First, valgrind generates a huge number of irrelevant messages, most of them
from in Python itself.
Second, valgrind can't emulate all CPU instructions. I remember that when I run
valgrind
Hi Pedro, All,
1) I would suggest that we run “ulimit -c unlimited” in every CI Slave
machine at startup to enable core-dump and get stack trace.
2) Valgrind on Python generates so much noise that extracting signal from
it is painful, but it is still worth trying it out and look at the messages
t
Hi
Managed to get a stack trace:
+ nosetests-2.7 --verbose
tests/python/unittest/test_module.py:test_forward_reshape
[WARNING] *** module-level seed is set: all tests running deterministically
***
[INFO] Setting module np/mx/python random seeds, use
MXNET_MODULE_SEED=812478194 to reproduce.
[WARN
@Chris seems intel inspector requires purchasing right? maybe some of us
already owns a license and can execute the test that fails intermittently?
test_module.py:test_forward_reshape
On Thu, May 3, 2018 at 3:49 PM, Pedro Larroy
wrote:
> It's very difficult to reproduce, non-deterministic. We w
If the issue is platform neutral - I can try reproducing on Windows. A fault in
native code should produce a dump that can be analyzed.
I am currently working on building mxnet from source, and can spend sometime on
this.
Sent from my iPhone
> On May 3, 2018, at 6:51 AM, Pedro Larroy wrote:
It's very difficult to reproduce, non-deterministic. We were also running
without signal handlers in CI so there are no stack traces unfortunately.
Care to elaborate why valgrind doesn't work with Python?
On Thu, May 3, 2018 at 3:32 PM, Da Zheng wrote:
> can we build it in CI?segfault doesn't
can we build it in CI?segfault doesn't happen infrequently.
2018年5月2日 下午11:34,"Chris Olivier" 写道:
> you can try Intel Inspector, which is like an enhanced version of valgrind
> with a GUI and whatnot.
>
> On Wed, May 2, 2018 at 9:42 PM Da Zheng wrote:
>
> > valgrind doesn't work with Python. als
you can try Intel Inspector, which is like an enhanced version of valgrind
with a GUI and whatnot.
On Wed, May 2, 2018 at 9:42 PM Da Zheng wrote:
> valgrind doesn't work with Python. also, valgrind doesn't support some
> CPU instructions used by MXNet (I think some instructions related to
> rand
valgrind doesn't work with Python. also, valgrind doesn't support some
CPU instructions used by MXNet (I think some instructions related to
random generator).
On Wed, May 2, 2018 at 8:59 PM, Bhavin Thaker wrote:
> Have you tried running with valgrind to get some clues on the root-cause?
>
> Bhav
Have you tried running with valgrind to get some clues on the root-cause?
Bhavin Thaker.
On Wed, May 2, 2018 at 8:55 PM Da Zheng wrote:
> It might also be possible that this isn't an MKLDNN bug.
> I just saw a similar memory error without MKLDNN build.
>
> http://jenkins.mxnet-ci.amazon-ml.com/
It might also be possible that this isn't an MKLDNN bug.
I just saw a similar memory error without MKLDNN build.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10783/1/pipeline
Best,
Da
On Wed, May 2, 2018 at 2:14 PM, Zheng, Da wrote:
> There might be
There might be a race condition that causes the memory error.
It might be caused by this PR:
https://github.com/apache/incubator-mxnet/pull/10706/files
This PR removes MKLDNN memory from NDArray.
However, I don't know why this causes memory error. If someone is using the
memory, it should still ho
I couldn't reproduce locally with:
ci/build.py -p ubuntu_cpu /work/runtime_functions.sh
build_ubuntu_cpu_mkldnn && ci/build.py --platform ubuntu_cpu
/work/runtime_functions.sh unittest_ubuntu_python2_cpu
On Wed, May 2, 2018 at 8:50 PM, Pedro Larroy
wrote:
> Hi
>
> Seems master is not running
24 matches
Mail list logo