I have an application that works on other systems but on the current system I'm 
running I'm seeing the following crash:

[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x55556a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2aaaab353370]
[dt04:22457] [ 1] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2aaaacbcf810]
[dt04:22457] [ 2] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2aaaacbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaac30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***


This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.

I'm building with ggc/5.3.0  cuda/8.0.44  openmpi/1.10.7

I'm using this on centos 7 and am using a vanilla MPI configure line:  
./configure --prefix=/home/jluitjens/libs/openmpi/

Currently I'm trying to do this with just a single MPI process but multiple MPI 
processes fail in the same way:

mpirun  --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but not 
consistently at the same spot.  The spot in the code where the single thread is 
at the time of the crash is nowhere near MPI code.  The code where it is 
crashing is just using malloc to allocate some memory. This makes me think the 
crash is due to a thread outside of the application I'm working on (perhaps in 
OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.

Does anyone have any ideas of what I could try to work around this issue?

Thanks,
Justin












-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to