Can you send a small program that reproduces the problem, perchance?

-jms
Sent from my PDA.  No type good.

----- Original Message -----
From: users-boun...@open-mpi.org <users-boun...@open-mpi.org>
To: us...@open-mpi.org <us...@open-mpi.org>
Sent: Thu Apr 15 01:57:10 2010
Subject: [OMPI users] Segmentation fault in mca_btl_tcp

Hi,

We are using openmpi 1.4.1 on our cluster computer (in conjunction with 
Torque). One of our users has a problem with his jobs generating a segmentation 
fault on one of the slaves, this is the backtrace:

[cstone-00613:28461] *** Process received signal ***
[cstone-00613:28461] Signal: Segmentation fault (11)
[cstone-00613:28461] Signal code:  (128)
[cstone-00613:28461] Failing at address: (nil)
[cstone-00613:28462] *** Process received signal ***
[cstone-00613:28462] Signal: Segmentation fault (11)
[cstone-00613:28462] Signal code: Address not mapped (1)
[cstone-00613:28462] Failing at address: (nil)
[cstone-00613:28461] [ 0] /lib64/libc.so.6 [0x2ba1933dce20]
[cstone-00613:28461] [ 1] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so 
[0x2ba19530ec7a]
[cstone-00613:28461] [ 2] /opt/openmpi-1.3/lib/openmpi/mca_btl_tcp.so 
[0x2ba19530d860]
[cstone-00613:28461] [ 3] /opt/openmpi/lib/libopen-pal.so.0 [0x2ba1938eb16b]
[cstone-00613:28461] [ 4] /opt/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) 
[0x2ba1938e072e]
[cstone-00613:28461] [ 5] /opt/openmpi/lib/libmpi.so.0 [0x2ba193621b38]
[cstone-00613:28461] [ 6] /opt/openmpi/lib/libmpi.so.0(PMPI_Wait+0x5b) 
[0x2ba19364c63b]
[cstone-00613:28461] [ 7] /opt/openmpi/lib/libmpi_f77.so.0(mpi_wait_+0x3a) 
[0x2ba192e98b8a]
[cstone-00613:28461] [ 8] ./roms [0x44976c]
[cstone-00613:28461] [ 9] ./roms [0x449d96]
[cstone-00613:28461] [10] ./roms [0x422708]
[cstone-00613:28461] [11] ./roms [0x402908]
[cstone-00613:28461] [12] ./roms [0x402467]
[cstone-00613:28461] [13] ./roms [0x46d20e]
[cstone-00613:28461] [14] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x2ba1933ca164]
[cstone-00613:28461] [15] ./roms [0x401dd9]
[cstone-00613:28461] *** End of error message ***
[cstone-00613:28462] [ 0] /lib64/libc.so.6 [0x2b5d57db6e20]
[cstone-00613:28462] *** End of error message ***

The other slaves crash with:
[cstone-00612][[21785,1],35][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Since this problem seems to be happening in the network part of MPI my guess is 
that there is, or something wrong with the network, or a bug in OpenMPI. 
This same problem also appeared at the time that we were using openmpi 1.3

How could this problem be solved ?

(for more info about the system see attachments)

Thx,

Werner Van Geit


Reply via email to