Pardon if this has been addressed already, but I could not find the answer 
after going through the OpenMPI FAQ and doing Google searches of the 
open-mpi.org site.

We are in the process of analyzing and troubleshooting MPI jobs of increasingly 
large scale (OpenMPI 1.6.5).  At a sufficiently large scale (# cores) a job 
will end up failing with errors similar to:

[yyyyy][[56933,1],1904][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error 
in endpoint reply start connect
[xxxxx:29318] 853 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / ibv_create_qp failed

So I know we are running into some memory limitation (educated guess) when 
queue pairs are being created to support such a huge mesh.  We are now 
investigating using the XRC transport to decrease memory consumption.

Anyways, my questions are:


1.       How do we determine HOW MUCH memory is being pinned by an MPI job on a 
node?  (If pmap, what exactly are we looking for?)

2.       How do we determine WHERE these pinned memory regions are?

We are running RedHat 6.x.  Thank you!

--john

Reply via email to