Re: [OMPI users] Infiniband Error

2011-09-12 Thread Yevgeny Kliteynik
This means that you have some problem on that node, and it's probably unrelated to Open MPI. Bad cable? Bad port? FW/driver in some bad state? Do other IB performance tests work OK on this node? Try rebooting the node. -- YK On 12-Sep-11 7:52 AM, Ahsan Ali wrote: > Hello all > > I am getting fol

[OMPI users] Infiniband Error

2011-09-12 Thread Ahsan Ali
Hello all I am getting following error during an application run which causes it to crash. *[[36944,1],41][btl_openib_component.c:3227:handle_wc] from compute-01-19.private.dns.zone to: compute-01-04 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 167703304 opcode

Re: [OMPI users] Infiniband error

2010-11-12 Thread Jeff Squyres
It would be best if an IB vendor replies (hint hint!), but it is likely that you have some kind of hardware issue on that node (e.g., a bad / flakey HCA, etc.). You should probably run a full set of layer-0 diagnostics on your fabric to make sure it's clean. I say this because back when Cisco

[OMPI users] Infiniband error

2010-11-04 Thread Ondrej Marsalek
Dear all, I would like to ask for help with understanding an error message I get when communication using Open MPI 1.4.1 over Infiniband fails. After several hours of operation, communication with one particular node (f24) fails with something like: [[20265,1],79][btl_openib_component.c:2951:hand