Hi, Thanks for getting back to me (and thanks to Jeff for the explanation too).
On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote: > Hi, > > On May 19, 2011, at 9:37 AM, Robert Horton wrote > > > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: > >> Hi, > >> > >> Try the following QP parameters that only use shared receive queues. > >> > >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 > >> > > > > Thanks for that. If I run the job over 2 x 48 cores it now works and the > > performance seems reasonable (I need to do some more tuning) but when I > > go up to 4 x 48 cores I'm getting the same problem: > > > > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > > error creating qp errno says Cannot allocate memory > > [compute-1-7.local:18106] *** An error occurred in MPI_Isend > > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > > abort) > > > > Any thoughts? > > How much memory does each node have? Does this happen at startup? Each node has 64GB of RAM. The error happens fairly soon after the job starts. > > Try adding: > > -mca btl_openib_cpc_include rdmacm Ah - that looks much better. I can now run hpcc over all 15x48 cores. I need to look at the performance in a bit more detail but it seems to be "reasonable" at least :) One thing is puzzling me - when I compile OpenMPI myself it seems to lack rdmamc support - however the one compiled by the OFED install process does include it. I'm compiling with: '--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' '--enable-openib-rdmacm' Any idea what might be going on there? > I'm not sure if your version of OFED supports this feature, but maybe using > XRC may help. I **think** other tweaks are needed to get this going, but I'm > not familiar with the details. I'm using the QLogic (QLE7340) rather than Mellanox cards so that doesn't seem to be an option to me (?). It would be interesting to know how much difference it would make though... Thanks again for your help and have a good weekend. Rob -- Robert Horton System Administrator (Research Support) - School of Mathematical Sciences Queen Mary, University of London r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345