I just tried that and it does indeed work with pbs and without Mellanox (until
a reboot makes it complain about Mellanox/IB related defaults as no drivers
were installed, etc).
After installing the Mellanox drivers, I used
./configure --prefix=/usr/ --with-tm=/opt/pbs/ --with-slurm=no
I opened a case with pgroup support regarding this.
We are also using Slurm along with HCOLL.
-Ray Muno
On 1/26/20 5:52 AM, Åke Sandgren via users wrote:
Note that when built against SLURM it will pick up pthread from
libslurm.la too.
On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:
Does it work with pbs but not Mellanox? Just trying to isolate the problem.
On Jan 28, 2020, at 6:39 AM, Collin Strassburger via users
mailto:users@lists.open-mpi.org> > wrote:
Hello,
I have done some additional testing and I can say that it works correctly with
gcc8 and no mellanox or pbs
Also, can you try running:
mpirun -np 128 hostname
Josh
On Tue, Jan 28, 2020 at 11:49 AM Joshua Ladd wrote:
> I don't see how this can be diagnosed as a "problem with the Mellanox
> Software". This is on a single node. What happens when you try to launch on
> more than one node?
>
> Josh
>
>
Josh - if you read thru the thread, you will see that disabling Mellanox/IB
drivers allows the program to run. It only fails when they are enabled.
On Jan 28, 2020, at 8:49 AM, Joshua Ladd mailto:jladd.m...@gmail.com> > wrote:
I don't see how this can be diagnosed as a "problem with the
Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname
Output:
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54039] [[16643,0],0] orted_cmd: received exit cmd
[Gen2Node3:54039] [[16643,0],0] orted_cmd: all routes and children gone -
exiting
Can you send the output of a failed run including your command line.
Josh
On Tue, Jan 28, 2020 at 11:26 AM Ralph Castain via users <
users@lists.open-mpi.org> wrote:
> Okay, so this is a problem with the Mellanox software - copying Artem.
>
> On Jan 28, 2020, at 8:15 AM, Collin Strassburger
>
Here’s the I/O for these high local core count runs. (“xhpcg” is the standard
hpcg benchmark)
Run command: mpirun -np 128 bin/xhpcg
Output:
--
mpirun was unable to start the specified application as it encountered an
error:
Interesting. Can you try:
mpirun -np 128 --debug-daemons hostname
Josh
On Tue, Jan 28, 2020 at 12:14 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> In relation to the multi-node attempt, I haven’t yet set that up yet as
> the per-node configuration doesn’t pass its tests (full
Input: mpirun -np 128 --debug-daemons hostname
Output:
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54023] [[16659,0],0] orted_cmd: received exit cmd
[Gen2Node3:54023] [[16659,0],0] orted_cmd: all routes and children gone -
exiting
Okay, so this is a problem with the Mellanox software - copying Artem.
On Jan 28, 2020, at 8:15 AM, Collin Strassburger mailto:cstrassbur...@bihrle.com> > wrote:
I just tried that and it does indeed work with pbs and without Mellanox (until
a reboot makes it complain about Mellanox/IB related
I don't see how this can be diagnosed as a "problem with the Mellanox
Software". This is on a single node. What happens when you try to launch on
more than one node?
Josh
On Tue, Jan 28, 2020 at 11:43 AM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> Here’s the I/O for these high
In relation to the multi-node attempt, I haven’t yet set that up yet as the
per-node configuration doesn’t pass its tests (full node utilization, etc).
Here are the results for the hostname test:
Input: mpirun -np 128 hostname
Output:
I agree that it is odd that the issue does not appear until after the Mellanox
drivers have been installed (and the configure flags set to use them). As
requested, here are the results
Input: mpirun -np 128 --mca odls_base_verbose 10 --mca state_base_verbose 10
hostname
Output:
OK. Please try:
mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname
Josh
On Tue, Jan 28, 2020 at 12:49 PM Collin Strassburger <
cstrassbur...@bihrle.com> wrote:
> Input: mpirun -np 128 --debug-daemons -mca plm rsh hostname
>
>
>
> Output:
>
> [Gen2Node3:54039] [[16643,0],0]
Okay, debug-daemons isn't going to help as we aren't launching any daemons.
This is all one node. So try adding "--mca odls_base_verbose 10 --mca
state_base_verbose 10" to the cmd line and let's see what is going on.
I agree with Josh - neither mpirun nor hostname are invoking the Mellanox
Wonderful! I am happy to confirm that this resolves the issue!
Many thanks to everyone for their assistance,
Collin
Same result. (It works though 102 but not greater than that)
Input: mpirun -np 128 --debug-daemons --map-by ppr:64:socket hostname
Output:
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received add_local_procs
[Gen2Node3:54348] [[18008,0],0] orted_cmd: received exit cmd
[Gen2Node3:54348]
Okay, that nailed it down - the problem is the number of open file descriptors
is exceeding your system limit. I suspect the connection to the Mellanox
drivers is solely due to it also having some descriptors open, and you are just
close enough to the boundary that it causes you to hit it.
See
19 matches
Mail list logo