Pardon if this has been addressed already, but I could not find the answer 
after going through the OpenMPI FAQ and doing Google searches of the site.

We are in the process of analyzing and troubleshooting MPI jobs of increasingly 
large scale (OpenMPI 1.6.5).  At a sufficiently large scale (# cores) a job 
will end up failing with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error 
in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when 
queue pairs are being created to support such a huge mesh.  We are now 
investigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:

1.       How do we determine HOW MUCH memory is being pinned by an MPI job on a 
node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x.  Thank you!


Reply via email to