Thank-you Nathan. Since the default btl_openib_receive_queues setting is: P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64
this would mean that, with max_qp = 392632 and 4 QPs above, the "actual" max would be 392632 / 4 = 98158. Using this value in my prior math, the upper bound on the number of 24-core nodes would be 98158 / 24^2 ~ 170. This comes closer to the limit I encountered while testing. I'm sure there are other particulars I am not accounting for in this math, but the approximation is reasonable. Thanks for the clarification, Nathan! --john -----Original Message----- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm Sent: Thursday, June 16, 2016 9:56 AM To: Open MPI Users Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to see if a newer version fixed the shutdown hang. When calculating the required number of queue pairs you also have to divide by the number of queue pairs in the btl_openib_receive_queues parameter. Additionally Open MPI uses 1 qp/rank for connections (1.7+) and there are some in use by IPoIB and other services. -Nathan > On Jun 16, 2016, at 7:15 AM, Sasso, John (GE Power, Non-GE) > <john1.sa...@ge.com> wrote: > > Nathan, > > Thank you for the suggestion. I tried your btl_openib_receive_queues > setting with a 4200+ core IMB job, and the job ran (great!). The shutdown > of the job took such a long time that after 6 minutes, I had to > force-terminate the job. > > When I tried using this scheme before, with the following recommended by the > OpenMPI FAQ, I got some odd errors: > > --mca btl openib,sm,self --mca btl_openib_receive_queues > X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128 > ,32 > > However, when I tried: > > --mca btl openib,sm,self --mca btl_openib_receive_queues > X,4096,1024:X,12288,512:X,65536,512 > > I got success with my aforementioned job. > > I am going to do more testing, with the goal of getting a 5000 core job to > run successfully. If I can, then down the road my concern is the impact the > btl_openib_receive_queues mca parameter (above) will have on lower-scale (< > 1024 cores) jobs, since the parameter change to the global openmpi config > file would impact ALL users and jobs of all scales. > > Chuck – as I noted in my first email, log_num_mtt was set fine, so that is > not the issue here. > > Finally, regarding running out of QPs, I examined the output of ‘ibv_devinfo > –v’ on our compute nodes. I see the following pertinent settings: > > max_qp: 392632 > max_qp_wr: 16351 > max_qp_rd_atom: 16 > max_qp_init_rd_atom: 128 > max_cq: 65408 > max_cqe: 4194303 > > Figuring that max_qp is the prime limitation here I am running into when > using the PP and SRQ QPs, considering 12 cores per node, this would seem to > imply that an upper bound on the number of nodes would be 392632 / 24^2 ~ 681 > nodes. This does not make sense, because I saw the QP creation failure error > (again, NO error about failure to register enough memory) for as small as 177 > 24-core nodes! I don’t know how to make sense of this, tho I don’t question > that we were running out of QPs. > > --john > > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan > Hjelm > Sent: Wednesday, June 15, 2016 2:43 PM > To: Open MPI Users > Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, > but settings appear OK > > You ran out of queue pairs. There is no way around this for larger all-to-all > transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully > connect with SRQ or PP QPs. I recommend using XRC instead by adding: > > btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512 > > > to your openmpi-mca-params.conf > > or > > -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512 > > > to the mpirun command line. > > > -Nathan > > On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" > <john1.sa...@ge.com> wrote: > > Chuck, > > The per-process limits appear fine, including those for the resource mgr > daemons: > > Limit Soft Limit Hard Limit Units > Max address space unlimited unlimited bytes Max core file size 0 0 > bytes Max cpu time unlimited unlimited seconds Max data size unlimited > unlimited bytes Max file locks unlimited unlimited locks Max file size > unlimited unlimited bytes Max locked memory unlimited unlimited bytes > Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max open > files 16384 16384 files Max pending signals 515625 515625 signals Max > processes 515625 515625 processes Max realtime priority 0 0 Max > realtime timeout unlimited unlimited us Max resident set unlimited > unlimited bytes Max stack size 307200000 unlimited bytes > > > > As for the FAQ re registered memory, checking our OpenMPI settings with > ompi_info, we have: > > mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much > user memory as necessary btl_openib_free_list_max = -1 ==> Open MPI > will try to allocate as many registered buffers as it needs > btl_openib_eager_rdma_num = 16 btl_openib_max_eager_rdma = 16 > btl_openib_eager_limit = 12288 > > > Other suggestions welcome. Hitting a brick wall here. Thanks! > > --john > > > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus > Correa > Sent: Wednesday, June 15, 2016 1:39 PM > To: Open MPI Users > Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, > but settings appear OK > > Hi John > > 1) For diagnostic, you could check the actual "per process" limits on the > nodes while that big job is running: > > cat /proc/$PID/limits > > 2) If you're using a resource manager to launch the job, the resource manager > daemon/deamons (local to the nodes) may have to to set the memlock and other > limits, so that the Open MPI processes inherit them. > I use Torque, so I put these lines in the pbs_mom (Torque local daemon) > initialization script: > > # pbs_mom system limits > # max file descriptors > ulimit -n 32768 > # locked memory > ulimit -l unlimited > # stacksize > ulimit -s unlimited > > 3) See also this FAQ related to registered memory. > I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're > set may depend on the Linux distro/release and the OFED you're using. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or > g_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem&d=CwIF-g&c=IV_c > lAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPi > mBPnb-JT-Js0Fmk&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=UFQ0uS > WQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA&e= > > I hope this helps, > Gus Correa > > On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote: > > > In doing testing with IMB, I find that running a 4200+ core case with > the IMB test Alltoall, and message lengths of 16..1024 bytes (as per > -msglog 4:10 IMB option), it fails with: > > ---------------------------------------------------------------------- > ---- > > A process failed to create a queue pair. This usually means either > > the device has run out of queue pairs (too many connections) or > > there are insufficient resources available to allocate a queue pair > > (out of memory). The latter can happen if either 1) insufficient > > memory is available, or 2) no more physical memory can be registered > > with the device. > > For more information on memory registration see the Open MPI FAQs at: > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org > _faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages&d=CwIF-g&c=IV_clA > zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimB > Pnb-JT-Js0Fmk&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=dKT5yJta > 2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc&e= > > Local host: node7106 > > Local device: mlx4_0 > > Queue pair type: Reliable connected (RC) > > ---------------------------------------------------------------------- > ---- > > [node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_ > cb] > error in endpoint reply start connect > > [node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > > ---------------------------------------------------------------------- > ---- > > mpirun has exited due to process rank 0 with PID 6504 on > > node node7106 exiting improperly. There are two reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > > the job did. This can cause a job to hang indefinitely while it waits > > for all processes to call "init". By rule, if one process calls > "init", > > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > > By rule, all processes that call "init" MUST call "finalize" prior to > > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > > terminated by signals sent by mpirun (as reported here). > > ---------------------------------------------------------------------- > ---- > > Yes, these are ALL of the error messages. I did not get a message > about not being able to register enough memory. I verified that > log_num_mtt = 24 and log_mtts_per_seg = 0 (via catting of their files > in /sys/module/mlx4_core/parameters and what is set in > /etc/modprobe.d/mlx4_core.conf). While such a large-scale job runs, I > run 'vmstat 10' to examine memory usage, but there appears to be a > good amount of memory still available and swap is never used. In terms > of settings in /etc/security/limits.conf: > > * soft memlock unlimited > > * hard memlock unlimited > > * soft stack 300000 > > * hard stack unlimited > > I don't know if btl_openib_connect_oob.c or mca_oob_tcp_msg_recv are > clues, but I am now at a loss as to where the problem lies. > > This is for an application using OpenMPI 1.6.5, and the systems have > Mellanox OFED 3.1.1 installed. > > *--john* > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or > g_mailman_listinfo.cgi_users&d=CwIF-g&c=IV_clAzoPDE253xZdHuilRgztyh_Ri > V3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=fkBwjwn1 > Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=jTwvPXqRGWpfeRFC_6XkYAx5DH99crNb > mWhBN9r1hdg&e= Link to this post: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org > _community_lists_users_2016_06_29455.php&d=CwIF-g&c=IV_clAzoPDE253xZdH > uilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fm > k&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=8xTBNYgBKnKVf6SD7vEn > 3-wizYAxVVSS63L5bCdfidE&e= > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or > g_mailman_listinfo.cgi_users&d=CwIF-g&c=IV_clAzoPDE253xZdHuilRgztyh_Ri > V3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=fkBwjwn1 > Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=jTwvPXqRGWpfeRFC_6XkYAx5DH99crNb > mWhBN9r1hdg&e= Link to this post: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org > _community_lists_users_2016_06_29458.php&d=CwIF-g&c=IV_clAzoPDE253xZdH > uilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fm > k&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=uK1Ww0uehyaqSfXOtAt3 > Lqhers5lzDnBPhdDVCQx_hk&e= > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29459.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29467.php