I'm not sure if these are being reported from OpenMPI or through
OpenMPI from OpenFabrics, but i figured this would be a good place to
start

On one node we received the below errors, i'm not sure i under the
error sequence, hopefully someone can shed some light on what
happened.

[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id c30b100 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node26 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1755c900 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1779b180 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node20 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 8e1aa80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node24 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1164b600 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 118c3f80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node12 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1b8f0080 opcode 128 vendor error 0 qp_idx 0

It was the only node out of a 75 node run that spit out the error.  I
rechecked the node, no symbol/link recovery errors on the network and
ran Pallas between it and several other machines with no errors

network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)

thanks

Reply via email to