Thank-you Nathan.  Since the default btl_openib_receive_queues setting is:

P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

this would mean that, with max_qp = 392632 and 4 QPs above, the "actual" max 
would be 392632 / 4 = 98158.   Using this value in my prior math, the upper 
bound on the number of 24-core nodes would be  98158 / 24^2 ~ 170.    This 
comes closer to the limit I encountered while testing.   I'm sure there are 
other particulars I am not accounting for in this math, but the approximation 
is reasonable.  

Thanks for the clarification, Nathan!

--john

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
Sent: Thursday, June 16, 2016 9:56 AM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to 
see if a newer version fixed the shutdown hang.

When calculating the required number of queue pairs you also have to divide by 
the number of queue pairs in the btl_openib_receive_queues parameter. 
Additionally Open MPI uses 1 qp/rank for connections (1.7+) and there are some 
in use by IPoIB and other services.

-Nathan

> On Jun 16, 2016, at 7:15 AM, Sasso, John (GE Power, Non-GE) 
> <john1.sa...@ge.com> wrote:
> 
> Nathan,
> 
> Thank you for the suggestion.   I tried your btl_openib_receive_queues 
> setting with a 4200+ core IMB job, and the job ran (great!).   The shutdown 
> of the job took such a long time that after 6 minutes, I had to 
> force-terminate the job.
> 
> When I tried using this scheme before, with the following recommended by the 
> OpenMPI FAQ, I got some odd errors:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128
> ,32
> 
> However, when I tried:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,4096,1024:X,12288,512:X,65536,512
> 
> I got success with my aforementioned job.
> 
> I am going to do more testing, with the goal of getting a 5000 core job to 
> run successfully.  If I can, then down the road my concern is the impact the 
> btl_openib_receive_queues mca parameter (above) will have on lower-scale (< 
> 1024 cores) jobs, since the parameter change to the global openmpi config 
> file would impact ALL users and jobs of all scales.
> 
> Chuck – as I noted in my first email, log_num_mtt was set fine, so that is 
> not the issue here.
> 
> Finally, regarding running out of QPs, I examined the output of ‘ibv_devinfo 
> –v’ on our compute nodes.  I see the following pertinent settings:
> 
>         max_qp:                         392632
>         max_qp_wr:                      16351
>         max_qp_rd_atom:                 16
>         max_qp_init_rd_atom:            128
>         max_cq:                         65408
>        max_cqe:                        4194303
> 
> Figuring that max_qp is the prime limitation here I am running into when 
> using the PP and SRQ QPs, considering 12 cores per node, this would seem to 
> imply that an upper bound on the number of nodes would be 392632 / 24^2 ~ 681 
> nodes.  This does not make sense, because I saw the QP creation failure error 
> (again, NO error about failure to register enough memory) for as small as 177 
> 24-core nodes!  I don’t know how to make sense of this, tho I don’t question 
> that we were running out of QPs.
> 
> --john
> 
> 
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan 
> Hjelm
> Sent: Wednesday, June 15, 2016 2:43 PM
> To: Open MPI Users
> Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, 
> but settings appear OK
> 
> You ran out of queue pairs. There is no way around this for larger all-to-all 
> transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully 
> connect with SRQ or PP QPs. I recommend using XRC instead by adding:
> 
> btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to your openmpi-mca-params.conf
> 
> or
> 
> -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to the mpirun command line.
> 
> 
> -Nathan
> 
> On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" 
> <john1.sa...@ge.com> wrote:
> 
> Chuck,
> 
> The per-process limits appear fine, including those for the resource mgr 
> daemons:
> 
> Limit Soft Limit Hard Limit Units
> Max address space unlimited unlimited bytes Max core file size 0 0 
> bytes Max cpu time unlimited unlimited seconds Max data size unlimited 
> unlimited bytes Max file locks unlimited unlimited locks Max file size 
> unlimited unlimited bytes Max locked memory unlimited unlimited bytes 
> Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max open 
> files 16384 16384 files Max pending signals 515625 515625 signals Max 
> processes 515625 515625 processes Max realtime priority 0 0 Max 
> realtime timeout unlimited unlimited us Max resident set unlimited 
> unlimited bytes Max stack size 307200000 unlimited bytes
> 
> 
> 
> As for the FAQ re registered memory, checking our OpenMPI settings with 
> ompi_info, we have:
> 
> mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much 
> user memory as necessary btl_openib_free_list_max = -1 ==> Open MPI 
> will try to allocate as many registered buffers as it needs 
> btl_openib_eager_rdma_num = 16 btl_openib_max_eager_rdma = 16 
> btl_openib_eager_limit = 12288
> 
> 
> Other suggestions welcome. Hitting a brick wall here. Thanks!
> 
> --john
> 
> 
> 
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus 
> Correa
> Sent: Wednesday, June 15, 2016 1:39 PM
> To: Open MPI Users
> Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, 
> but settings appear OK
> 
> Hi John
> 
> 1) For diagnostic, you could check the actual "per process" limits on the 
> nodes while that big job is running:
> 
> cat /proc/$PID/limits
> 
> 2) If you're using a resource manager to launch the job, the resource manager 
> daemon/deamons (local to the nodes) may have to to set the memlock and other 
> limits, so that the Open MPI processes inherit them.
> I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
> initialization script:
> 
> # pbs_mom system limits
> # max file descriptors
> ulimit -n 32768
> # locked memory
> ulimit -l unlimited
> # stacksize
> ulimit -s unlimited
> 
> 3) See also this FAQ related to registered memory.
> I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're 
> set may depend on the Linux distro/release and the OFED you're using.
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or
> g_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem&d=CwIF-g&c=IV_c
> lAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPi
> mBPnb-JT-Js0Fmk&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=UFQ0uS
> WQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA&e=
> 
> I hope this helps,
> Gus Correa
> 
> On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:
> 
> 
> In doing testing with IMB, I find that running a 4200+ core case with 
> the IMB test Alltoall, and message lengths of 16..1024 bytes (as per 
> -msglog 4:10 IMB option), it fails with:
> 
> ----------------------------------------------------------------------
> ----
> 
> A process failed to create a queue pair. This usually means either
> 
> the device has run out of queue pairs (too many connections) or
> 
> there are insufficient resources available to allocate a queue pair
> 
> (out of memory). The latter can happen if either 1) insufficient
> 
> memory is available, or 2) no more physical memory can be registered
> 
> with the device.
> 
> For more information on memory registration see the Open MPI FAQs at:
> 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
> _faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages&d=CwIF-g&c=IV_clA
> zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
> Pnb-JT-Js0Fmk&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=dKT5yJta
> 2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc&e=
> 
> Local host: node7106
> 
> Local device: mlx4_0
> 
> Queue pair type: Reliable connected (RC)
> 
> ----------------------------------------------------------------------
> ----
> 
> [node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_
> cb]
> error in endpoint reply start connect
> 
> [node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv:
> readv failed: Connection reset by peer (104)
> 
> ----------------------------------------------------------------------
> ----
> 
> mpirun has exited due to process rank 0 with PID 6504 on
> 
> node node7106 exiting improperly. There are two reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> 
> the job did. This can cause a job to hang indefinitely while it waits
> 
> for all processes to call "init". By rule, if one process calls 
> "init",
> 
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> 
> By rule, all processes that call "init" MUST call "finalize" prior to
> 
> exiting or it will be considered an "abnormal termination"
> 
> This may have caused other processes in the application to be
> 
> terminated by signals sent by mpirun (as reported here).
> 
> ----------------------------------------------------------------------
> ----
> 
> Yes, these are ALL of the error messages. I did not get a message 
> about not being able to register enough memory. I verified that 
> log_num_mtt = 24 and log_mtts_per_seg = 0 (via catting of their files 
> in /sys/module/mlx4_core/parameters and what is set in 
> /etc/modprobe.d/mlx4_core.conf). While such a large-scale job runs, I 
> run 'vmstat 10' to examine memory usage, but there appears to be a 
> good amount of memory still available and swap is never used. In terms 
> of settings in /etc/security/limits.conf:
> 
> * soft memlock unlimited
> 
> * hard memlock unlimited
> 
> * soft stack 300000
> 
> * hard stack unlimited
> 
> I don't know if btl_openib_connect_oob.c or mca_oob_tcp_msg_recv are 
> clues, but I am now at a loss as to where the problem lies.
> 
> This is for an application using OpenMPI 1.6.5, and the systems have 
> Mellanox OFED 3.1.1 installed.
> 
> *--john*
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or
> g_mailman_listinfo.cgi_users&d=CwIF-g&c=IV_clAzoPDE253xZdHuilRgztyh_Ri
> V3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=fkBwjwn1
> Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=jTwvPXqRGWpfeRFC_6XkYAx5DH99crNb
> mWhBN9r1hdg&e= Link to this post:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
> _community_lists_users_2016_06_29455.php&d=CwIF-g&c=IV_clAzoPDE253xZdH
> uilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fm
> k&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=8xTBNYgBKnKVf6SD7vEn
> 3-wizYAxVVSS63L5bCdfidE&e=
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.or
> g_mailman_listinfo.cgi_users&d=CwIF-g&c=IV_clAzoPDE253xZdHuilRgztyh_Ri
> V3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk&m=fkBwjwn1
> Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=jTwvPXqRGWpfeRFC_6XkYAx5DH99crNb
> mWhBN9r1hdg&e= Link to this post: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
> _community_lists_users_2016_06_29458.php&d=CwIF-g&c=IV_clAzoPDE253xZdH
> uilRgztyh_RiV3wUrLrDQYWSI&r=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fm
> k&m=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE&s=uK1Ww0uehyaqSfXOtAt3
> Lqhers5lzDnBPhdDVCQx_hk&e= 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29459.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29467.php

Reply via email to