No, it is not obvious, unfortunately. Can you send all the information listed here:

    http://www.open-mpi.org/community/help/


On Mar 3, 2009, at 5:22 AM, Ondrej Marsalek wrote:

Dear everyone,

I have a calculation (the CP2K program) using MPI over Infiniband and
it is stuck. All processes (16 on 4 nodes) are running, taking 100%
CPU. Attaching a debugger reveals this (only the end of the stack
shown here):

(gdb) backtrace
#0  0x00002b3460916dbf in btl_openib_component_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/openmpi/mca_btl_openib.so
#1  0x00002b345c22c778 in opal_progress () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libopen-pal.so.0
#2  0x00002b345bd2d66d in ompi_request_default_wait_any () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#3  0x00002b345bd6021a in PMPI_Waitany () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi.so.0
#4  0x00002b345bae77f1 in pmpi_waitany__ () from
/home/marsalek/opt/openmpi-1.3-intel/lib/libmpi_f77.so.0

It has survived a restart of the IB switch, unlike "healthy" runs. My
question is - is it obvious at what level the problem is? IB, Open
MPI, application?I would be glad to provide detailed information, if
anyone was willing to help. I want to work on this, but unfortunately
I am not sure where to begin.

Best regards,
Ondrej Marsalek
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to