No, it is not obvious, unfortunately. Can you send all the
information listed here:
http://www.open-mpi.org/community/help/
On Mar 3, 2009, at 5:22 AM, Ondrej Marsalek wrote:
Dear everyone,
I have a calculation (the CP2K program) using MPI over Infiniband and
it is stuck. All processes (16 on 4 nodes) are running, taking 100%
CPU. Attaching a debugger reveals this (only the end of the stack
shown here):
(gdb) backtrace
#0 0x00002b3460916dbf in btl_openib_component_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
#1 0x00002b345c22c778 in opal_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
#2 0x00002b345bd2d66d in ompi_request_default_wait_any () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#3 0x00002b345bd6021a in PMPI_Waitany () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#4 0x00002b345bae77f1 in pmpi_waitany__ () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0
It has survived a restart of the IB switch, unlike "healthy" runs. My
question is - is it obvious at what level the problem is? IB, Open
MPI, application?I would be glad to provide detailed information, if
anyone was willing to help. I want to work on this, but unfortunately
I am not sure where to begin.
Best regards,
Ondrej Marsalek
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems