Re: [OMPI users] RETRY EXCEEDED ERROR

Jeff Squyres Wed, 4 Mar 2009 16:02:40 -0500

This *usually* indicates a physical / layer 0 problem in your IBfabric. You should do a diagnostic on your HCAs, cables, and switches.

Increasing the timeout value should only be necessary on very large IBfabrics and/or very congested networks.



On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:

I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout.
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:

[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174to: shc175error polling LP CQ with status RETRY EXCEEDED ERROR status number12 for

wr_id 250450816 opcode 11048 qp_idx 3

This is seen with OpenMPI 1.3 and OpenFabrics 1.4.

Is this normal or is it an indicator of other problems, mayberelated to

hardware?
Are there other parameters that need to be looked at too?

Thanks for any insight on this!

Regards,
Jan Lindheim
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] RETRY EXCEEDED ERROR

Reply via email to