Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa
On 06/15/2016 02:35 PM, Sasso, John (GE Power, Non-GE) wrote: Chuck, The per-process limits appear fine, including those for the resource mgr daemons: Limit Soft Limit Hard Limit Units Max address space unlimitedunlimited

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm
ibv_devinfo -v -Nathan On Jun 15, 2016, at 12:43 PM, "Sasso, John (GE Power, Non-GE)" wrote: QUESTION: Since the error said the system may have run out of queue pairs, how do I determine the # of queue pairs the IB HCA can support? -Original Message- From:

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Nathan Hjelm
You ran out of queue pairs. There is no way around this for larger all-to-all transfers when using the openib btl and SRQ. Need O(cores^2) QPs to fully connect with SRQ or PP QPs. I recommend using XRC instead by adding: btl_openib_receive_queues = X,4096,1024:X,12288,512:X,65536,512 to your

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
QUESTION: Since the error said the system may have run out of queue pairs, how do I determine the # of queue pairs the IB HCA can support? -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sasso, John (GE Power, Non-GE) Sent: Wednesday, June 15, 2016

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
Chuck, The per-process limits appear fine, including those for the resource mgr daemons: Limit Soft Limit Hard Limit Units Max address space unlimitedunlimitedbytes Max core file size0

Re: [OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Gus Correa
Hi John 1) For diagnostic, you could check the actual "per process" limits on the nodes while that big job is running: cat /proc/$PID/limits 2) If you're using a resource manager to launch the job, the resource manager daemon/deamons (local to the nodes) may have to to set the memlock and

Re: [OMPI users] Client-Server Shared Memory Transport

2016-06-15 Thread Ralph Castain
Oh sure - just not shared memory > On Jun 15, 2016, at 8:29 AM, Louis Williams wrote: > > Ralph, thanks for the quick reply. Is cross-job fast transport like > InfiniBand supported? > > Louis > > On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain

Re: [OMPI users] Client-Server Shared Memory Transport

2016-06-15 Thread Louis Williams
Ralph, thanks for the quick reply. Is cross-job fast transport like InfiniBand supported? Louis On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain wrote: > Nope - we don’t currently support cross-job shared memory operations. > Nathan has talked about doing so for vader, but not

[OMPI users] "failed to create queue pair" problem, but settings appear OK

2016-06-15 Thread Sasso, John (GE Power, Non-GE)
In doing testing with IMB, I find that running a 4200+ core case with the IMB test Alltoall, and message lengths of 16..1024 bytes (as per -msglog 4:10 IMB option), it fails with: -- A process failed to create a queue pair.

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Llolsten Kaonga
Hello Mehmet, When we do OS installs, our lab usually just downloads the latest stable version of Open MPI. We try not to move versions of Open MPI we may already have lying around - mostly because we don't trust our book-keeping abilities. We have not had any trouble using this approach

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Llolsten Kaonga
Hello Sreenidhi, In our testing, we cannot use Mellanox OFED for compliance reasons. So, we use regular OFED. We test both Mellanox and Intel DUTs (NICs, switches, gateways, etc). I thank you. -- Llolsten From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Sreenidhi

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Wed, 15 Jun 2016 15:00:05 +0530 Sreenidhi Bharathkar Ramesh wrote: > hi Mehmet / Llolsten / Peter, > > Just curious to know what is the NIC or fabric you are using in your > respective clusters. > > If it is Mellanox, is it not better to use the

Re: [OMPI users] runtime error in orte/loop_spawn test using OMPI 1.10.2

2016-06-15 Thread Gilles Gouaillardet
Jason, How many nodes are you running on ? Since you have an IB network, IB is used for intra node communication between tasks that are not part of the same OpenMPI job (read spawn group) I can make a simple patch to use tcp instead of IB for these intra node communication, Let me know if you

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Sreenidhi Bharathkar Ramesh
hi Mehmet / Llolsten / Peter, Just curious to know what is the NIC or fabric you are using in your respective clusters. If it is Mellanox, is it not better to use the MLNX_OFED ? This information may help us build our cluster. Hence, asking. Thanks, - Sreenidhi. On Wed, Jun 15, 2016 at 1:17

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Tue, 14 Jun 2016 13:18:33 -0400 "Llolsten Kaonga" wrote: > Hello Grigory, > > I am not sure what Redhat does exactly but when you install the OS, > there is always an InfiniBand Support module during the installation > process. We never check/install that module when we

Re: [OMPI users] Big jump from OFED 1.5.4.1 -> recent (stable). Any suggestions?

2016-06-15 Thread Peter Kjellström
On Tue, 14 Jun 2016 16:20:42 + Grigory Shamov wrote: > On 2016-06-14, 3:42 AM, "users on behalf of Peter Kjellström" > wrote: > > >On Mon, 13 Jun 2016 19:04:59 -0400 > >Mehmet Belgin