Re: [OMPI users] CQ errors

2011-01-10 Thread Michael Di Domenico
2011/1/10 Peter Kjellström :
> On Monday, January 10, 2011 03:06:06 pm Michael Di Domenico wrote:
>> I'm not sure if these are being reported from OpenMPI or through
>> OpenMPI from OpenFabrics, but i figured this would be a good place to
>> start
>>
>> On one node we received the below errors, i'm not sure i under the
>> error sequence, hopefully someone can shed some light on what
>> happened.
>>
>> [[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
> ...
>> network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)
>
> Not really addressing your problem, but, with qlogic you should be using psm,
> not verbs (btl_openib).
>
> That said, openib should work (slowly).

Yes, you are correct.  We're running via verbs at the moment because
of a slurm interop issue.  I have a patch from ralph but have not
tested it yet.

So far the only noticeable to effect to running non-psm is a 5usec hit
on each packet.  otherwise functionally we seem okay.



Re: [OMPI users] CQ errors

2011-01-10 Thread Peter Kjellström
On Monday, January 10, 2011 03:06:06 pm Michael Di Domenico wrote:
> I'm not sure if these are being reported from OpenMPI or through
> OpenMPI from OpenFabrics, but i figured this would be a good place to
> start
> 
> On one node we received the below errors, i'm not sure i under the
> error sequence, hopefully someone can shed some light on what
> happened.
> 
> [[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
...
> network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)

Not really addressing your problem, but, with qlogic you should be using psm, 
not verbs (btl_openib).

That said, openib should work (slowly).

/Peter


signature.asc
Description: This is a digitally signed message part.


[OMPI users] CQ errors

2011-01-10 Thread Michael Di Domenico
I'm not sure if these are being reported from OpenMPI or through
OpenMPI from OpenFabrics, but i figured this would be a good place to
start

On one node we received the below errors, i'm not sure i under the
error sequence, hopefully someone can shed some light on what
happened.

[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node27 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id c30b100 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node26 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1755c900 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1779b180 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node20 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 8e1aa80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node24 to:
node28 error polling LP CQ with status RETRY EXCEEDED ERROR status
number 12 for wr_id 1164b600 opcode 1 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from (null) to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 118c3f80 opcode 128 vendor error 0 qp_idx 0
[[5691,1],49][btl_openib_component.c:3294:handle_wc] from node12 to:
node28 error polling HP CQ with status WORK_REQUEST FLUSHED ERROR
status number 5 for wr_id 1b8f0080 opcode 128 vendor error 0 qp_idx 0

It was the only node out of a 75 node run that spit out the error.  I
rechecked the node, no symbol/link recovery errors on the network and
ran Pallas between it and several other machines with no errors

network is qlogic qdr end to end, openmpi 1.5 and ofed 1.5.2 (q stack)

thanks