> have a user whos code at scale dies reliably with the errors (new hosts each > time): > > We have been using for this code: > -mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12 > > Without that option it dies with an out of memory message reliably. > > Note this code runs fine at the same scale on Pilaties (NASA SGI box) using > MPT, > > Are we running out of QP? Is that possible?
I don't think this running-out-of-QP error. The initiator gets NACK on request, which essentially says that the request isn't good. The passive side reports QP access error. Do you observe this error on small scale runs ? let's say 8-16 nodes ? Did you try to replace all the "X" with "S" and see what happens ? Do you know what OFED version is installed on your system ? Last time I tested the XRC (X) with OFED 1.5.1. I'm wandering if newer OFED version changed XRC behavior. Regards, Pasha > > -------------------------------------------------------------------------- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host: nyx5608.engin.umich.edu > MPI process PID: 42036 > Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -------------------------------------------------------------------------- > [[9462,1],3][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3394:handle_wc] > from nyx5608.engin.umich.edu to: nyx5022 error polling LP CQ with status > INVALID REQUEST ERROR status number 9 for wr_id 14d6d00 opcode 0 vendor > error 138 qp_idx 0 > -------------------------------------------------------------------------- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host: (null) > MPI process PID: 42038 > Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users