Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related defaults as no drivers were installed, etc). After installing the Mellanox drivers, I used ./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no

Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll

2020-01-28 Thread Ray Muno via users
I opened a case with pgroup support regarding this. We are also using Slurm along with HCOLL. -Ray Muno On 1/26/20 5:52 AM, Åke Sandgren via users wrote: Note that when built against SLURM it will pick up pthread from libslurm.la too. On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Does it work with pbs but not Mellanox? Just trying to isolate the problem. On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users mailto:users@lists.open-mpi.org> > wrote: Hello,  I have done some additional testing and I can say that it works correctly with gcc8 and no mellanox or pbs

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Also, can you try running: mpirun -np 128 hostname Josh On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote: > I don't see how this can be diagnosed as a "problem with the Mellanox > Software". This is on a single node. What happens when you try to launch on > more than one node? > > Josh > >

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Josh - if you read thru the thread, you will see that disabling Mellanox/IB drivers allows the program to run. It only fails when they are enabled. On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote: I don't see how this can be diagnosed as a "problem with the

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname Output: [Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs [Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd [Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone - exiting

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Can you send the output of a failed run including your command line. Josh On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Okay, so this is a problem with the Mellanox software - copying Artem. > > On Jan 28, 2020, at 8:15 AM, Collin Strassburger >

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard hpcg benchmark) Run command: mpirun -np 128 bin/xhpcg Output: -- mpirun was unable to start the specified application as it encountered an error:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
Interesting. Can you try: mpirun -np 128 --debug-daemons hostname Josh On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > In relation to the multi-node attempt, I haven’t yet set that up yet as > the per-node configuration doesn’t pass its tests (full

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Input: mpirun -np 128 --debug-daemons hostname Output: [Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs [Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd [Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone - exiting

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, so this is a problem with the Mellanox software - copying Artem. On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote: I just tried that and it does indeed work with pbs and without Mellanox (until a reboot makes it complain about Mellanox/IB related

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
I don't see how this can be diagnosed as a "problem with the Mellanox Software". This is on a single node. What happens when you try to launch on more than one node? Josh On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Here’s the I/O for these high

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
In relation to the multi-node attempt, I haven’t yet set that up yet as the per-node configuration doesn’t pass its tests (full node utilization, etc). Here are the results for the hostname test: Input: mpirun -np 128 hostname Output:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
I agree that it is odd that the issue does not appear until after the Mellanox drivers have been installed (and the configure flags set to use them). As requested, here are the results Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10 hostname Output:

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Joshua Ladd via users
OK. Please try: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Josh On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger < cstrassbur...@bihrle.com> wrote: > Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname > > > > Output: > > [Gen2Node3:54039] [[16643,0],0]

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, debug-daemons isn't going to help as we aren't launching any daemons. This is all one node. So try adding "--mca odls_base_verbose 10 --mca state_base_verbose 10" to the cmd line and let's see what is going on. I agree with Josh - neither mpirun nor hostname are invoking the Mellanox

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Wonderful! I am happy to confirm that this resolves the issue! Many thanks to everyone for their assistance, Collin

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Collin Strassburger via users
Same result. (It works though 102 but not greater than that) Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname Output: [Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs [Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd [Gen2Node3:54348]

Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-28 Thread Ralph Castain via users
Okay, that nailed it down - the problem is the number of open file descriptors is exceeding your system limit. I suspect the connection to the Mellanox drivers is solely due to it also having some descriptors open, and you are just close enough to the boundary that it causes you to hit it. See