This *usually* indicates a physical / layer 0 problem in your IB
fabric. You should do a diagnostic on your HCAs, cables, and switches.
Increasing the timeout value should only be necessary on very large IB
fabrics and/or very congested networks.
On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
I found several reports on the openmpi users mailing list from users,
who need to bump up the default value for btl_openib_ib_timeout.
We also have some applications on our cluster, that have problems,
unless we set this value from the default 10 to 15:
[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174
to: shc175
error polling LP CQ with status RETRY EXCEEDED ERROR status number
12 for
wr_id 250450816 opcode 11048 qp_idx 3
This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
Is this normal or is it an indicator of other problems, maybe
related to
hardware?
Are there other parameters that need to be looked at too?
Thanks for any insight on this!
Regards,
Jan Lindheim
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems