> have a user whos code at scale dies reliably with the errors (new hosts each 
> time):
> 
> We have been using for this code:
> -mca btl_openib_receive_queues X,4096,128:X,12288,128:X,65536,12
> 
> Without that option it dies with an out of memory message reliably. 
> 
> Note this code runs fine at the same scale on Pilaties (NASA SGI box) using 
> MPT, 
> 
> Are we running out of QP?  Is that possible?

I don't think this running-out-of-QP error. 

The initiator gets NACK on request, which essentially says that the request 
isn't good. The passive side reports QP access error.
Do you observe this error on small scale runs ? let's say 8-16 nodes ?

Did you try to replace all the "X" with "S" and see what happens ? Do you know 
what OFED version is installed on your system ?
Last time I tested the XRC (X) with OFED 1.5.1. I'm wandering if newer OFED 
version changed XRC behavior.


Regards,
Pasha


> 
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event.  Open MPI
> will try to continue, but your job may end up failing.
> 
>  Local host:        nyx5608.engin.umich.edu
>  MPI process PID:   42036
>  Error number:      3 (IBV_EVENT_QP_ACCESS_ERR)
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> --------------------------------------------------------------------------
> [[9462,1],3][../../../../../openmpi-1.6/ompi/mca/btl/openib/btl_openib_component.c:3394:handle_wc]
>  from nyx5608.engin.umich.edu to: nyx5022 error polling LP CQ with status 
> INVALID REQUEST ERROR status number 9 for wr_id 14d6d00 opcode 0  vendor 
> error 138 qp_idx 0
> --------------------------------------------------------------------------
> The OpenFabrics stack has reported a network error event.  Open MPI
> will try to continue, but your job may end up failing.
> 
>  Local host:        (null)
>  MPI process PID:   42038
>  Error number:      3 (IBV_EVENT_QP_ACCESS_ERR)
> 
> This error may indicate connectivity problems within the fabric;
> please contact your system administrator.
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> CAEN Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to