[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Sasso, John (GE Power, Non-GE)
Thank-you Nathan.  Since the default btl_openib_receive_queues setting is:

P,128,256,192,128:S,2048,1024,1008,64:S,12288,1024,1008,64:S,65536,1024,1008,64

this would mean that, with max_qp = 392632 and 4 QPs above, the "actual" max 
would be 392632 / 4 = 98158.   Using this value in my prior math, the upper 
bound on the number of 24-core nodes would be  98158 / 24^2 ~ 170.This 
comes closer to the limit I encountered while testing.   I'm sure there are 
other particulars I am not accounting for in this math, but the approximation 
is reasonable.  

Thanks for the clarification, Nathan!

--john

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
Sent: Thursday, June 16, 2016 9:56 AM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to 
see if a newer version fixed the shutdown hang.

When calculating the required number of queue pairs you also have to divide by 
the number of queue pairs in the btl_openib_receive_queues parameter. 
Additionally Open MPI uses 1 qp/rank for connections (1.7+) and there are some 
in use by IPoIB and other services.

-Nathan

> On Jun 16, 2016, at 7:15 AM, Sasso, John (GE Power, Non-GE) 
> <john1.sa...@ge.com> wrote:
> 
> Nathan,
> 
> Thank you for the suggestion.   I tried your btl_openib_receive_queues 
> setting with a 4200+ core IMB job, and the job ran (great!).   The shutdown 
> of the job took such a long time that after 6 minutes, I had to 
> force-terminate the job.
> 
> When I tried using this scheme before, with the following recommended by the 
> OpenMPI FAQ, I got some odd errors:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128
> ,32
> 
> However, when I tried:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,4096,1024:X,12288,512:X,65536,512
> 
> I got success with my aforementioned job.
> 
> I am going to do more testing, with the goal of getting a 5000 core job to 
> run successfully.  If I can, then down the road my concern is the impact the 
> btl_openib_receive_queues mca parameter (above) will have on lower-scale (< 
> 1024 cores) jobs, since the parameter change to the global openmpi config 
> file would impact ALL users and jobs of all scales.
> 
> Chuck – as I noted in my first email, log_num_mtt was set fine, so that is 
> not the issue here.
> 
> Finally, regarding running out of QPs, I examined the output of ‘ibv_devinfo 
> –v’ on our compute nodes.  I see the following pertinent settings:
> 
> max_qp: 392632
> max_qp_wr:  16351
> max_qp_rd_atom: 16
> max_qp_init_rd_atom:128
> max_cq: 65408
>max_cqe:4194303
> 
> Figuring that max_qp is the prime limitation here I am running into when 
> using the PP and SRQ QPs, considering 12 cores per node, this would seem to 
> imply that an upper bound on the number of nodes would be 392632 / 24^2 ~ 681 
> nodes.  This does not make sense, because I saw the QP creation failure error 
> (again, NO error about failure to register enough memory) for as small as 177 
> 24-core nodes!  I don’t know how to make sense of this, tho I don’t question 
> that we were running out of QPs.
> 
> --john
> 
> 
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan 
> Hjelm
> Sent: Wednesday, June 15, 2016 2:43 PM
> To: Open MPI Users
> Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, 
> but settings appear OK
> 
> You ran out of queue pairs. There is no way around this for larger all-to-all 
> transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully 
> connect with SRQ or PP QPs. I recommend using XRC instead by adding:
> 
> btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to your openmpi-mca-params.conf
> 
> or
> 
> -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to the mpirun command line.
> 
> 
> -Nathan
> 
> On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" 
> <john1.sa...@ge.com> wrote:
> 
> Chuck,
> 
> The per-process limits appear fine, including those for the resource mgr 
> daemons:
> 
> Limit Soft Limit Hard Limit Units
> Max address space unlimited unlimited bytes Max core file size 0 0 
> bytes Max cpu time unlimited unlimited seconds Max data size unlimited 
> unlimited bytes Max file locks unlimited unlimited loc

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Nathan Hjelm
XRC support is greatly improved in 1.10.x and 2.0.0. Would be interesting to 
see if a newer version fixed the shutdown hang.

When calculating the required number of queue pairs you also have to divide by 
the number of queue pairs in the btl_openib_receive_queues parameter. 
Additionally Open MPI uses 1 qp/rank for connections (1.7+) and there are some 
in use by IPoIB and other services.

-Nathan

> On Jun 16, 2016, at 7:15 AM, Sasso, John (GE Power, Non-GE) 
> <john1.sa...@ge.com> wrote:
> 
> Nathan,
> 
> Thank you for the suggestion.   I tried your btl_openib_receive_queues 
> setting with a 4200+ core IMB job, and the job ran (great!).   The shutdown 
> of the job took such a long time that after 6 minutes, I had to 
> force-terminate the job.
> 
> When I tried using this scheme before, with the following recommended by the 
> OpenMPI FAQ, I got some odd errors:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32
> 
> However, when I tried:
> 
> --mca btl openib,sm,self --mca btl_openib_receive_queues 
> X,4096,1024:X,12288,512:X,65536,512
> 
> I got success with my aforementioned job.
> 
> I am going to do more testing, with the goal of getting a 5000 core job to 
> run successfully.  If I can, then down the road my concern is the impact the 
> btl_openib_receive_queues mca parameter (above) will have on lower-scale (< 
> 1024 cores) jobs, since the parameter change to the global openmpi config 
> file would impact ALL users and jobs of all scales.
> 
> Chuck – as I noted in my first email, log_num_mtt was set fine, so that is 
> not the issue here.
> 
> Finally, regarding running out of QPs, I examined the output of ‘ibv_devinfo 
> –v’ on our compute nodes.  I see the following pertinent settings:
> 
> max_qp: 392632
> max_qp_wr:  16351
> max_qp_rd_atom: 16
> max_qp_init_rd_atom:128
> max_cq: 65408
>max_cqe:4194303
> 
> Figuring that max_qp is the prime limitation here I am running into when 
> using the PP and SRQ QPs, considering 12 cores per node, this would seem to 
> imply that an upper bound on the number of nodes would be 392632 / 24^2 ~ 681 
> nodes.  This does not make sense, because I saw the QP creation failure error 
> (again, NO error about failure to register enough memory) for as small as 177 
> 24-core nodes!  I don’t know how to make sense of this, tho I don’t question 
> that we were running out of QPs.
> 
> --john
> 
> 
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
> Sent: Wednesday, June 15, 2016 2:43 PM
> To: Open MPI Users
> Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
> settings appear OK
> 
> You ran out of queue pairs. There is no way around this for larger all-to-all 
> transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully 
> connect with SRQ or PP QPs. I recommend using XRC instead by adding:
> 
> btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to your openmpi-mca-params.conf
> 
> or
> 
> -mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512
> 
> 
> to the mpirun command line.
> 
> 
> -Nathan
> 
> On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" 
> <john1.sa...@ge.com> wrote:
> 
> Chuck,
> 
> The per-process limits appear fine, including those for the resource mgr 
> daemons:
> 
> Limit Soft Limit Hard Limit Units
> Max address space unlimited unlimited bytes
> Max core file size 0 0 bytes
> Max cpu time unlimited unlimited seconds
> Max data size unlimited unlimited bytes
> Max file locks unlimited unlimited locks
> Max file size unlimited unlimited bytes
> Max locked memory unlimited unlimited bytes
> Max msgqueue size 819200 819200 bytes
> Max nice priority 0 0
> Max open files 16384 16384 files
> Max pending signals 515625 515625 signals
> Max processes 515625 515625 processes
> Max realtime priority 0 0
> Max realtime timeout unlimited unlimited us
> Max resident set unlimited unlimited bytes
> Max stack size 30720 unlimited bytes
> 
> 
> 
> As for the FAQ re registered memory, checking our OpenMPI settings with 
> ompi_info, we have:
> 
> mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much user 
> memory as necessary
> btl_openib_free_list_max = -1 ==> Open MPI will try to allocate as many 
> registered buffers as it needs
> btl_openib_eager_rdma_num = 16
> btl_openib_max_eager_rdma = 16

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-16 Thread Sasso, John (GE Power, Non-GE)
Nathan,

Thank you for the suggestion.   I tried your btl_openib_receive_queues setting 
with a 4200+ core IMB job, and the job ran (great!).   The shutdown of the job 
took such a long time that after 6 minutes, I had to force-terminate the job.

When I tried using this scheme before, with the following recommended by the 
OpenMPI FAQ, I got some odd errors:

--mca btl openib,sm,self --mca btl_openib_receive_queues 
X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32

However, when I tried:

--mca btl openib,sm,self --mca btl_openib_receive_queues 
X,4096,1024:X,12288,512:X,65536,512

I got success with my aforementioned job.

I am going to do more testing, with the goal of getting a 5000 core job to run 
successfully.  If I can, then down the road my concern is the impact the 
btl_openib_receive_queues mca parameter (above) will have on lower-scale (< 
1024 cores) jobs, since the parameter change to the global openmpi config file 
would impact ALL users and jobs of all scales.

Chuck - as I noted in my first email, log_num_mtt was set fine, so that is not 
the issue here.

Finally, regarding running out of QPs, I examined the output of 'ibv_devinfo 
-v' on our compute nodes.  I see the following pertinent settings:

max_qp: 392632
max_qp_wr:  16351
max_qp_rd_atom: 16
max_qp_init_rd_atom:128
max_cq: 65408
   max_cqe:4194303

Figuring that max_qp is the prime limitation here I am running into when using 
the PP and SRQ QPs, considering 12 cores per node, this would seem to imply 
that an upper bound on the number of nodes would be 392632 / 24^2 ~ 681 nodes.  
This does not make sense, because I saw the QP creation failure error (again, 
NO error about failure to register enough memory) for as small as 177 24-core 
nodes!  I don't know how to make sense of this, tho I don't question that we 
were running out of QPs.

--john


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Nathan Hjelm
Sent: Wednesday, June 15, 2016 2:43 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

You ran out of queue pairs. There is no way around this for larger all-to-all 
transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully 
connect with SRQ or PP QPs. I recommend using XRC instead by adding:


btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512

to your openmpi-mca-params.conf

or

-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512


to the mpirun command line.


-Nathan

On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" 
<john1.sa...@ge.com<mailto:john1.sa...@ge.com>> wrote:
Chuck,

The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit Hard Limit Units
Max address space unlimited unlimited bytes
Max core file size 0 0 bytes
Max cpu time unlimited unlimited seconds
Max data size unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max file size unlimited unlimited bytes
Max locked memory unlimited unlimited bytes
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max open files 16384 16384 files
Max pending signals 515625 515625 signals
Max processes 515625 515625 processes
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Max resident set unlimited unlimited bytes
Max stack size 30720 unlimited bytes



As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:

mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much user memory 
as necessary
btl_openib_free_list_max = -1 ==> Open MPI will try to allocate as many 
registered buffers as it needs
btl_openib_eager_rdma_num = 16
btl_openib_max_eager_rdma = 16
btl_openib_eager_limit = 12288


Other suggestions welcome. Hitting a brick wall here. Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa

On 06/15/2016 02:35 PM, Sasso, John (GE Power, Non-GE) wrote:

Chuck,

The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit   Hard Limit   Units
Max address space unlimitedunlimitedbytes
Max core file size00bytes
Max cpu time  unlimitedunlimitedseconds
Max data size unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max file size unlimitedunlimitedbytes
Max locked memory unlimitedunlimitedbytes
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max open files1638416384files
Max pending signals   515625   515625   signals
Max processes 515625   515625   processes
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus
Max resident set  unlimitedunlimitedbytes
Max stack size30720unlimitedbytes



As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:



Hi John

The FAQ I referred to (#18 in tuning runtime MPI to OpenFabrics)
regards the OFED kernel module parameters
log_num_mtt and log_mtts_per_seg, not to the openib btl mca parameters.
They may default to a less-than-optimal value.

https://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

Gus Correa (not Chuck!)


mpool_rdma_rcache_size_limit = 0  ==> Open MPI will register as much user 
memory as necessary
btl_openib_free_list_max =  -1==> Open MPI will try to allocate as many 
registered buffers as it needs
btl_openib_eager_rdma_num = 16
btl_openib_max_eager_rdma = 16
btl_openib_eager_limit = 12288


Other suggestions welcome.   Hitting a brick wall here.  Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're set 
may depend on the Linux distro/release and the OFED you're using.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem=CwIF-g=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=UFQ0uSWQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA=

I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:


In doing testing with IMB, I find that running a 4200+ core case with
the IMB test Alltoall, and message lengths of 16..1024 bytes (as per
-msglog 4:10 IMB option), it fails with:

--


A process failed to create a queue pair. This usually means either

the device has run out of queue pairs (too many connections) or

there are insufficient resources available to allocate a queue pair

(out of memory). The latter can happen if either 1) insufficient

memory is available, or 2) no more physical memory can be registered

with the device.

For more information on memory registration see the Open MPI FAQs at:

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages=CwIF-g=IV_clA
zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
Pnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=dKT5yJta
2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc=

Local host: node7106

Local device:   mlx4_0

Queue pair type:Reliable connected (RC)

--


[node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_
cb]
error in endpoint reply start connect

[node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm

ibv_devinfo -v

-Nathan

On Jun 15, 2016, at 12:43 PM, "Sasso, John (GE Power, Non-GE)" 
<john1.sa...@ge.com> wrote:

QUESTION: Since the error said the system may have run out of queue pairs, how 
do I determine the # of queue pairs the IB HCA can support?


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sasso, John (GE 
Power, Non-GE)
Sent: Wednesday, June 15, 2016 2:35 PM
To: Open MPI Users
Subject: EXT: [OMPI users] "failed to create queue pair" problem, but settings 
appear OK

Chuck, 


The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit Hard Limit Units 
Max address space unlimited unlimited bytes 
Max core file size 0 0 bytes 
Max cpu time unlimited unlimited seconds 
Max data size unlimited unlimited bytes 
Max file locks unlimited unlimited locks 
Max file size unlimited unlimited bytes 
Max locked memory unlimited unlimited bytes 
Max msgqueue size 819200 819200 bytes 
Max nice priority 0 0 
Max open files 16384 16384 files 
Max pending signals 515625 515625 signals 
Max processes 515625 515625 processes 
Max realtime priority 0 0 
Max realtime timeout unlimited unlimited us 
Max resident set unlimited unlimited bytes 
Max stack size 30720 unlimited bytes 




As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:

mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much user memory as necessary 
btl_openib_free_list_max = -1 ==> Open MPI will try to allocate as many registered buffers as it needs
btl_openib_eager_rdma_num = 16 
btl_openib_max_eager_rdma = 16 
btl_openib_eager_limit = 12288 



Other suggestions welcome. Hitting a brick wall here. Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're set 
may depend on the Linux distro/release and the OFED you're using.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem=CwIF-g=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=UFQ0uSWQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA= 


I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:

In doing testing with IMB, I find that running a 4200+ core case with
the IMB test Alltoall, and message lengths of 16..1024 bytes (as per
-msglog 4:10 IMB option), it fails with:

--


A process failed to create a queue pair. This usually means either

the device has run out of queue pairs (too many connections) or

there are insufficient resources available to allocate a queue pair

(out of memory). The latter can happen if either 1) insufficient

memory is available, or 2) no more physical memory can be registered

with the device.

For more information on memory registration see the Open MPI FAQs at:

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages=CwIF-g=IV_clA
zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
Pnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=dKT5yJta
2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc=

Local host: node7106

Local device: mlx4_0

Queue pair type: Reliable connected (RC)

--


[node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_
cb]
error in endpoint reply start connect

[node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)

--


mpirun has exited due to process rank 0 with PID 6504 on

node node7106 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in

the job did. This can cause a job to hang indefinitely while it waits

for all

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm

You ran out of queue pairs. There is no way around this for larger all-to-all 
transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully 
connect with SRQ or PP QPs. I recommend using XRC instead by adding:

btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512

to your openmpi-mca-params.conf

or

-mca btl_openib_receive_queues X,4096,1024:X,12288,512:X,65536,512

to the mpirun command line.

-Nathan

On Jun 15, 2016, at 12:35 PM, "Sasso, John (GE Power, Non-GE)" 
<john1.sa...@ge.com> wrote:

Chuck, 


The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit Hard Limit Units 
Max address space unlimited unlimited bytes 
Max core file size 0 0 bytes 
Max cpu time unlimited unlimited seconds 
Max data size unlimited unlimited bytes 
Max file locks unlimited unlimited locks 
Max file size unlimited unlimited bytes 
Max locked memory unlimited unlimited bytes 
Max msgqueue size 819200 819200 bytes 
Max nice priority 0 0 
Max open files 16384 16384 files 
Max pending signals 515625 515625 signals 
Max processes 515625 515625 processes 
Max realtime priority 0 0 
Max realtime timeout unlimited unlimited us 
Max resident set unlimited unlimited bytes 
Max stack size 30720 unlimited bytes 




As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:

mpool_rdma_rcache_size_limit = 0 ==> Open MPI will register as much user memory as necessary 
btl_openib_free_list_max = -1 ==> Open MPI will try to allocate as many registered buffers as it needs
btl_openib_eager_rdma_num = 16 
btl_openib_max_eager_rdma = 16 
btl_openib_eager_limit = 12288 



Other suggestions welcome. Hitting a brick wall here. Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're set 
may depend on the Linux distro/release and the OFED you're using.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem=CwIF-g=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=UFQ0uSWQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA= 


I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:

In doing testing with IMB, I find that running a 4200+ core case with
the IMB test Alltoall, and message lengths of 16..1024 bytes (as per
-msglog 4:10 IMB option), it fails with:

--


A process failed to create a queue pair. This usually means either

the device has run out of queue pairs (too many connections) or

there are insufficient resources available to allocate a queue pair

(out of memory). The latter can happen if either 1) insufficient

memory is available, or 2) no more physical memory can be registered

with the device.

For more information on memory registration see the Open MPI FAQs at:

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages=CwIF-g=IV_clA
zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
Pnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=dKT5yJta
2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc=

Local host: node7106

Local device: mlx4_0

Queue pair type: Reliable connected (RC)

--


[node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_
cb]
error in endpoint reply start connect

[node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv:
readv failed: Connection reset by peer (104)

--


mpirun has exited due to process rank 0 with PID 6504 on

node node7106 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in

the job did. This can cause a job to hang indefinitely while it waits

for all processes

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
QUESTION:   Since the error said the system may have run out of queue pairs, 
how do I determine the # of queue pairs the IB HCA can support?


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sasso, John (GE 
Power, Non-GE)
Sent: Wednesday, June 15, 2016 2:35 PM
To: Open MPI Users
Subject: EXT: [OMPI users] "failed to create queue pair" problem, but settings 
appear OK

Chuck, 

The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit   Hard Limit   Units 
Max address space unlimitedunlimitedbytes 
Max core file size00bytes 
Max cpu time  unlimitedunlimitedseconds   
Max data size unlimitedunlimitedbytes 
Max file locksunlimitedunlimitedlocks 
Max file size unlimitedunlimitedbytes 
Max locked memory unlimitedunlimitedbytes 
Max msgqueue size 819200   819200   bytes 
Max nice priority 00
Max open files1638416384files 
Max pending signals   515625   515625   signals   
Max processes 515625   515625   processes 
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus
Max resident set  unlimitedunlimitedbytes 
Max stack size30720unlimitedbytes   



As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:

mpool_rdma_rcache_size_limit = 0  ==> Open MPI will register as much user 
memory as necessary 
btl_openib_free_list_max =  -1==> Open MPI will try to allocate as many 
registered buffers as it needs
btl_openib_eager_rdma_num = 16 
btl_openib_max_eager_rdma = 16 
btl_openib_eager_limit = 12288   


Other suggestions welcome.   Hitting a brick wall here.  Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're set 
may depend on the Linux distro/release and the OFED you're using.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem=CwIF-g=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=UFQ0uSWQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA=
 

I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:
>
> In doing testing with IMB, I find that running a 4200+ core case with 
> the IMB test Alltoall, and message lengths of 16..1024 bytes (as per 
> -msglog 4:10 IMB option), it fails with:
>
> --
> 
>
> A process failed to create a queue pair. This usually means either
>
> the device has run out of queue pairs (too many connections) or
>
> there are insufficient resources available to allocate a queue pair
>
> (out of memory). The latter can happen if either 1) insufficient
>
> memory is available, or 2) no more physical memory can be registered
>
> with the device.
>
> For more information on memory registration see the Open MPI FAQs at:
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
> _faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages=CwIF-g=IV_clA
> zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
> Pnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=dKT5yJta
> 2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc=
>

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
Chuck, 

The per-process limits appear fine, including those for the resource mgr 
daemons:

Limit Soft Limit   Hard Limit   Units 
Max address space unlimitedunlimitedbytes 
Max core file size00bytes 
Max cpu time  unlimitedunlimitedseconds   
Max data size unlimitedunlimitedbytes 
Max file locksunlimitedunlimitedlocks 
Max file size unlimitedunlimitedbytes 
Max locked memory unlimitedunlimitedbytes 
Max msgqueue size 819200   819200   bytes 
Max nice priority 00
Max open files1638416384files 
Max pending signals   515625   515625   signals   
Max processes 515625   515625   processes 
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus
Max resident set  unlimitedunlimitedbytes 
Max stack size30720unlimitedbytes   



As for the FAQ re registered memory, checking our OpenMPI settings with 
ompi_info, we have:

mpool_rdma_rcache_size_limit = 0  ==> Open MPI will register as much user 
memory as necessary 
btl_openib_free_list_max =  -1==> Open MPI will try to allocate as many 
registered buffers as it needs
btl_openib_eager_rdma_num = 16 
btl_openib_max_eager_rdma = 16 
btl_openib_eager_limit = 12288   


Other suggestions welcome.   Hitting a brick wall here.  Thanks!

--john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 15, 2016 1:39 PM
To: Open MPI Users
Subject: EXT: Re: [OMPI users] "failed to create queue pair" problem, but 
settings appear OK

Hi John

1) For diagnostic, you could check the actual "per process" limits on the nodes 
while that big job is running:

cat /proc/$PID/limits

2) If you're using a resource manager to launch the job, the resource manager 
daemon/deamons (local to the nodes) may have to to set the memlock and other 
limits, so that the Open MPI processes inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:

# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf, but where they're set 
may depend on the Linux distro/release and the OFED you're using.

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_faq_-3Fcategory-3Dopenfabrics-23ib-2Dlow-2Dreg-2Dmem=CwIF-g=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=UFQ0uSWQoNPCfwg9q02YzMJczt7g4jEcaCvYOd46RRA=
 

I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:
>
> In doing testing with IMB, I find that running a 4200+ core case with 
> the IMB test Alltoall, and message lengths of 16..1024 bytes (as per 
> -msglog 4:10 IMB option), it fails with:
>
> --
> 
>
> A process failed to create a queue pair. This usually means either
>
> the device has run out of queue pairs (too many connections) or
>
> there are insufficient resources available to allocate a queue pair
>
> (out of memory). The latter can happen if either 1) insufficient
>
> memory is available, or 2) no more physical memory can be registered
>
> with the device.
>
> For more information on memory registration see the Open MPI FAQs at:
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org
> _faq_-3Fcategory-3Dopenfabrics-23ib-2Dlocked-2Dpages=CwIF-g=IV_clA
> zoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimB
> Pnb-JT-Js0Fmk=fkBwjwn1Rvenp2NGwrQM3JtenpfbO_fxYUSK4lrHnzE=dKT5yJta
> 2xW_ZUh06x95KTWjE1LgO8NU3OsjbwQsYLc=
>
> Local host: node7106
>
> Local device:   mlx4_0
>
> Queue pair type:Reliable connected (RC)
>
> --
> 
>
> [node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_
> cb]
> error in endpoint reply start connect
>
> [node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv: 
> readv failed: Connection reset by peer (104)
>
> ---

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa

Hi John

1) For diagnostic, you could check the actual "per process" limits on 
the nodes while that big job is running:


cat /proc/$PID/limits

2) If you're using a resource manager to launch the job,
the resource manager daemon/deamons (local to the nodes) may have to
to set the memlock and other limits, so that the Open MPI processes
inherit them.
I use Torque, so I put these lines in the pbs_mom (Torque local daemon) 
initialization script:


# pbs_mom system limits
# max file descriptors
ulimit -n 32768
# locked memory
ulimit -l unlimited
# stacksize
ulimit -s unlimited

3) See also this FAQ related to registered memory.
I set these parameters in /etc/modprobe.d/mlx4_core.conf,
but where they're set may depend on the Linux distro/release and the 
OFED you're using.


https://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

I hope this helps,
Gus Correa

On 06/15/2016 11:05 AM, Sasso, John (GE Power, Non-GE) wrote:


In doing testing with IMB, I find that running a 4200+ core case with 
the IMB test Alltoall, and message lengths of 16..1024 bytes (as per 
-msglog 4:10 IMB option), it fails with:


--

A process failed to create a queue pair. This usually means either

the device has run out of queue pairs (too many connections) or

there are insufficient resources available to allocate a queue pair

(out of memory). The latter can happen if either 1) insufficient

memory is available, or 2) no more physical memory can be registered

with the device.

For more information on memory registration see the Open MPI FAQs at:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: node7106

Local device:   mlx4_0

Queue pair type:Reliable connected (RC)

--

[node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_cb] 
error in endpoint reply start connect


[node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv: 
readv failed: Connection reset by peer (104)


--

mpirun has exited due to process rank 0 with PID 6504 on

node node7106 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in

the job did. This can cause a job to hang indefinitely while it waits

for all processes to call "init". By rule, if one process calls "init",

then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".

By rule, all processes that call "init" MUST call "finalize" prior to

exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

--

Yes, these are ALL of the error messages. I did not get a message 
about not being able to register enough memory.   I verified that 
log_num_mtt = 24  and log_mtts_per_seg = 0 (via catting of their files 
in /sys/module/mlx4_core/parameters and what is set in 
/etc/modprobe.d/mlx4_core.conf).  While such a large-scale job runs, I 
run ‘vmstat 10’ to examine memory usage, but there appears to be a 
good amount of memory still available and swap is never used.   In 
terms of settings in /etc/security/limits.conf:


* soft memlock  unlimited

* hard memlock  unlimited

* soft stack 30

* hard stack unlimited

I don’t know if btl_openib_connect_oob.c or mca_oob_tcp_msg_recv are 
clues, but I am now at a loss as to where the problem lies.


This is for an application using OpenMPI 1.6.5, and the systems have 
Mellanox OFED 3.1.1 installed.


*--john*



___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/06/29455.php




[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
In doing testing with IMB, I find that running a 4200+ core case with the IMB 
test Alltoall, and message lengths of 16..1024 bytes (as per -msglog 4:10 IMB 
option), it fails with:

--
A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: node7106
Local device:   mlx4_0
Queue pair type:Reliable connected (RC)
--
[node7106][[51922,1],0][connect/btl_openib_connect_oob.c:867:rml_recv_cb] error 
in endpoint reply start connect
[node7106:06503] [[51922,0],0]-[[51922,1],0] mca_oob_tcp_msg_recv: readv 
failed: Connection reset by peer (104)
--
mpirun has exited due to process rank 0 with PID 6504 on
node node7106 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--

Yes, these are ALL of the error messages.  I did not get a message about not 
being able to register enough memory.   I verified that log_num_mtt = 24  and 
log_mtts_per_seg = 0 (via catting of their files in 
/sys/module/mlx4_core/parameters and what is set in 
/etc/modprobe.d/mlx4_core.conf).  While such a large-scale job runs, I run 
'vmstat 10' to examine memory usage, but there appears to be a good amount of 
memory still available and swap is never used.   In terms of settings in 
/etc/security/limits.conf:

* soft memlock  unlimited
* hard memlock  unlimited
* soft stack 30
* hard stack unlimited

I don't know if btl_openib_connect_oob.c or mca_oob_tcp_msg_recv are clues, but 
I am now at a loss as to where the problem lies.

This is for an application using OpenMPI 1.6.5, and the systems have Mellanox 
OFED 3.1.1 installed.

--john