We have an application that runs for a very long time with 16 processes (the time is order a few months; we do have check points, but this won't be the issue). It has happened twice that it fails with the error message appended below after running undisturbed for 20-25 days. It has happened twice so far. This error is not systematically reproducible, and I believe this is not just because the program is parallel. We use openmpi-1.2.5 as distributed in the RH 5.2-clone Scientific Linux, on which our cluster is based. Is this stack suggesting anything to eyes more trained than main?

Many thanks,
Biagio Lucini

-----------------------------------------------------------------------------------------------------------------------------------------

[node20:04178] *** Process received signal ***
[node20:04178] Signal: Segmentation fault (11)
[node20:04178] Signal code: Address not mapped (1)
[node20:04178] Failing at address: 0x2aaadb8b31a0
[node20:04178] [ 0] /lib64/libpthread.so.0 [0x2b5d9c3ebe80]
[node20:04178] [ 1] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(_int_malloc+0x1d4) [0x2b5d9ccb2
f84]
[node20:04178] [ 2] /usr/lib64/openmpi/1.2.5-gcc/lib/libopen-pal.so.0(malloc+0x93) [0x2b5d9ccb4d93]
[node20:04178] [ 3] /lib64/libc.so.6 [0x2b5d9d77729a]
[node20:04178] [ 4] /usr/lib64/libstdc++.so.6(_ZNSt12__basic_fileIcE4openEPKcSt13_Ios_Openmodei+0x54)
[0x2b5d9bf05cb4]
[node20:04178] [ 5] /usr/lib64/libstdc++.so.6(_ZNSt13basic_filebufIcSt11char_traitsIcEE4openEPKcSt13_
Ios_Openmode+0x83) [0x2b5d9beb45c3]
[node20:04178] [ 6] ./k-string(wait_thread_+0x2a1) [0x42e101]
[node20:04178] [ 7] ./k-string(MAIN__+0x2a72) [0x4212d2]
[node20:04178] [ 8] ./k-string(main+0xe) [0x42e2ce]
[node20:04178] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b5d9d7338b4]
[node20:04178] [10] ./k-string(__gxx_personality_v0+0xb9) [0x404719]
[node20:04178] *** End of error message ***
mpirun noticed that job rank 0 with PID 4152 on node node19 exited on signal 15 (Terminated).

Reply via email to