Pacey, Mike wrote:
One my users recently reported random hangs of his OpenMPI application.
I've run some tests using multiple 2-node 16-core runs of the IMB
benchmark and can occasionally replicate the problem. Looking through
the mail archive, a previous occurrence of this error seems to been
suspect code, but as it's IMB failing here, I suspect the problem lies
elsewhere. The full set of errors generated by a failed run are:

[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
[lancs2-015][[37376,1],14][btl_tcp_frag.c:
216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed:
Connection reset by peer (104)
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[lancs2-015][[37376,1],14][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],0][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],12][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],2][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],10][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conn
ection reset by peer (104)
[lancs2-015][[37376,1],8][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)
[lancs2-015][[37376,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed: Conne
ction reset by peer (104)

I'm used to OpenMPI terminating cleanly, but that's not happening in
this case. All the OpenMPI processes on one node terminate, while the
processes on the other simply spin with 100% CPU utilisation. I've run
this 2-node test a number of times, and I'm not seeing any pattern (ie,
I can't pin it down to a single node - a subsequent run using the two
nodes involved above ran fine).

Can anyone provide any pointers in tracking down this problem? System
details as follows:

-       OpenMPI 1.3.3, compiled with gcc version 4.1.2 20080704 (Red Hat
4.1.2-44), using only the -prefix and -with-sge options.
-       OS is Scientific Linux SL release 5.3
-       CPUs are 2.3GHz Opteron 2356

Regards,
Mike.

-----

Dr Mike Pacey,                         Email: m.pa...@lancaster.ac.uk
High Performance Systems Support,      Phone: 01524 593543
Information Systems Services,            Fax: 01524 594459
Lancaster University,
Lancaster LA1 4YW



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

I got a similar error when using non-blocking communication on large datasets. I could not figure out why this was happening, since it seemed sort of random. I eventually bypassed the problem by switching to blocking communication, which felt kind of sad...If anyone knows if this is a bug in OpenMPI or connected to hardware somehow, please share.

- Atle

Reply via email to