Hi all, I hope to provide enough information to make my problem clear. I have been debugging a lot after continually getting a segfault in my program, but then I decided to try and run it on another node, and it didn't segfault! The program which causes this strange behaviour can be downloaded with
$ git clone https://toothbr...@github.com/toothbrush/bsp-cg.git It depends on bsponmpi (can be found at: http://bsponmpi.sourceforge.net/ ). The machine on which I get a segfault is Linux scarlatti 2.6.38-2-amd64 #1 SMP Thu Apr 7 04:28:07 UTC 2011 x86_64 GNU/Linux OpenMPI --version: mpirun (Open MPI) 1.4.3 And the error message is: [scarlatti:22100] *** Process received signal *** [scarlatti:22100] Signal: Segmentation fault (11) [scarlatti:22100] Signal code: (128) [scarlatti:22100] Failing at address: (nil) [scarlatti:22100] [ 0] /lib/libpthread.so.0(+0xef60) [0x7f33ca69ef60] [scarlatti:22100] [ 1] /lib/libc.so.6(+0x74121) [0x7f33ca3a3121] [scarlatti:22100] [ 2] /lib/libc.so.6(__libc_malloc+0x70) [0x7f33ca3a5930] [scarlatti:22100] [ 3] src/cg(vecalloci+0x2c) [0x401789] [scarlatti:22100] [ 4] src/cg(bspmv_init+0x60) [0x40286a] [scarlatti:22100] [ 5] src/cg(bspcg+0x63b) [0x401f8b] [scarlatti:22100] [ 6] src/cg(main+0xd3) [0x402517] [scarlatti:22100] [ 7] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f33ca34dc4d] [scarlatti:22100] [ 8] src/cg() [0x401609] [scarlatti:22100] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 22100 on node scarlatti exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- The program can be invoked (after downloading the source, running make, and cd'ing into the project's root directory) like: $ mpirun -np 2 src/cg examples/test.mtx-P2 examples/test.mtx-v2 examples/test.mtx-u2 The program seems to fail at src/bspedupack.c:vecalloci(), but printf'ing the pointer that's returned by malloc() looks okay. The node on which the program DOES run without segfault is as follows: (OS X laptop) Darwin purcell 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29 15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 OpenMPI --version: mpirun (Open MPI) 1.2.8 Please inform if this is a real bug in OpenMPI, or if I'm coding something incorrectly. Note that I'm not asking anyone to debug my code for me, it's purely in case people want to try and reproduce my error locally. If I can provide more info, please advise. I'm not an MPI expert, unfortunately. Kind regards, Paul van der Walt -- O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
signature.asc
Description: Digital signature