This means that you have some problem on that node,
and it's probably unrelated to Open MPI.
Bad cable? Bad port? FW/driver in some bad state?
Do other IB performance tests work OK on this node?
Try rebooting the node.
-- YK
On 12-Sep-11 7:52 AM, Ahsan Ali wrote:
> Hello all
>
> I am getting fol
Hello all
I am getting following error during an application run which causes it to
crash.
*[[36944,1],41][btl_openib_component.c:3227:handle_wc] from
compute-01-19.private.dns.zone to: compute-01-04 error polling LP CQ with
status RETRY EXCEEDED ERROR status number 12 for wr_id 167703304 opcode
It would be best if an IB vendor replies (hint hint!), but it is likely that
you have some kind of hardware issue on that node (e.g., a bad / flakey HCA,
etc.). You should probably run a full set of layer-0 diagnostics on your
fabric to make sure it's clean.
I say this because back when Cisco
Dear all,
I would like to ask for help with understanding an error message I get
when communication using Open MPI 1.4.1 over Infiniband fails. After
several hours of operation, communication with one particular node
(f24) fails with something like:
[[20265,1],79][btl_openib_component.c:2951:hand