This *usually* indicates a physical / layer 0 problem in your IB fabric. You should do a diagnostic on your HCAs, cables, and switches.

Increasing the timeout value should only be necessary on very large IB fabrics and/or very congested networks.


On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:

I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout.
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:

[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 to: shc175 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for
wr_id 250450816 opcode 11048 qp_idx 3

This is seen with OpenMPI 1.3 and OpenFabrics 1.4.

Is this normal or is it an indicator of other problems, maybe related to
hardware?
Are there other parameters that need to be looked at too?

Thanks for any insight on this!

Regards,
Jan Lindheim
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to