2009/2/26 Brett Pemberton <br...@vpac.org>: > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 38996224 opcode 0 qp_idx 0
What OS are you using? I've seen this error and many other Infiniband related errors on RedHat enterprise linux 4 update 4, with ConnectX cards and various versions of OFED, up to version 1.3. Depending on the MCA parameters, I also see hangs often enough to make native Infiniband unusable on this OS. However, the openib btl works just fine on the same hardware and the same OFED/OpenMPI stack when used with Centos 4.6. I suspect there may be something about the kernel that is contributing to these problems, but I haven't had a chance to test the kernel from 4.6 on 4.4. mch