Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-02-07 Thread Diego Zuccato via users

Sorry for late answer.

I thought the same, but after more testing now I don't, since re-running 
the same code on the same data on the same node with the same parameters 
sometimes works and sometimes doesn't.

The user says it works (reliably) unmodified on other clusters.
We'll try contacting Gadget2 authors, too.

Il 27/01/2022 14:52, Jeff Squyres (jsquyres) ha scritto:

I'm afraid that without any further details, it's hard to help. I don't know 
why Gadget2 would complain about its parameters file.  From what you've stated, 
it could be a problem with the application itself.

Have you talked to the Gadget2 authors?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via users 

Sent: Wednesday, January 26, 2022 2:06 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:


I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
exactly does the application fail?

Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error
in the parameters file. But the parameters file is OK, and it actually
works if the user runs it in a single process. Or even with
multithreaded runs, sometimes and on some nodes. That's quite random :(
But the runs are usually single-node (simple examples for students).


Can you try upgrading to Open MPI v4.1.2?

That would be a real mess. I'm stuck with packages provided by Debian
stable. I lack both the manpower and the knowledge to compile everything
from scratch, given the intricate relations between slurm, openmpi,
infiniband, etc. :(


What networking are you using?

Infiniband (Mellanox cards, w/ Debian-supplied drivers and support
programs) and ethernet. Infiniband is also used by IPoIB to reach the
storage servers (gluster). Some nodes lacks IB, so access to the storage
is achieved by a couple of iptables rules.



From: users  on behalf of Diego Zuccato via users 

Sent: Tuesday, January 25, 2022 5:43 AM
To: Open MPI Users
Cc: Diego Zuccato
Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on
Gadget2.

Lately, *sometimes*, the same executable with the same parameters file
works, sometimes it fails. On the same node and submitting with the same
command. On some nodes it always fails. But if it gets reduced to
sequential (asking for just one process), it completes correctly (so the
parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to
4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC
should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process, but had insufficient information to ensure MPI
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging
SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-27 Thread Jeff Squyres (jsquyres) via users
I'm afraid that without any further details, it's hard to help. I don't know 
why Gadget2 would complain about its parameters file.  From what you've stated, 
it could be a problem with the application itself.

Have you talked to the Gadget2 authors?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Wednesday, January 26, 2022 2:06 AM
To: users@lists.open-mpi.org
Cc: Diego Zuccato
Subject: Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:

> I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
> exactly does the application fail?
Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error
in the parameters file. But the parameters file is OK, and it actually
works if the user runs it in a single process. Or even with
multithreaded runs, sometimes and on some nodes. That's quite random :(
But the runs are usually single-node (simple examples for students).

> Can you try upgrading to Open MPI v4.1.2?
That would be a real mess. I'm stuck with packages provided by Debian
stable. I lack both the manpower and the knowledge to compile everything
from scratch, given the intricate relations between slurm, openmpi,
infiniband, etc. :(

> What networking are you using?
Infiniband (Mellanox cards, w/ Debian-supplied drivers and support
programs) and ethernet. Infiniband is also used by IPoIB to reach the
storage servers (gluster). Some nodes lacks IB, so access to the storage
is achieved by a couple of iptables rules.

> 
> From: users  on behalf of Diego Zuccato via 
> users 
> Sent: Tuesday, January 25, 2022 5:43 AM
> To: Open MPI Users
> Cc: Diego Zuccato
> Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?
>
> Hello all.
>
> A user of our cluster is experiencing a weird problem that I can't pinpoint.
>
> He does have a job script that worked well on every node. I's based on
> Gadget2.
>
> Lately, *sometimes*, the same executable with the same parameters file
> works, sometimes it fails. On the same node and submitting with the same
> command. On some nodes it always fails. But if it gets reduced to
> sequential (asking for just one process), it completes correctly (so the
> parameters file, common source of Gadget2 error 818, seems innocent).
>
> The cluster uses SLURM and limits resources using cgroups, if that matters.
>
> Seems most of the issues started after upgrading from openmpi 3.1.3 to
> 4.1.0 in september.
>
> Maybe related, the nodes started spitting out these warnings (that IIUC
> should be harmless... but I'd like to debug & resolve anyway):
> -8<--
> Open MPI's OFI driver detected multiple equidistant NICs from the
> current process, but had insufficient information to ensure MPI
> processes fairly pick a NIC for use.
> This may negatively impact performance. A more modern PMIx server is
> necessary to resolve this issue.
> -8<--
>
> Code is run (from the jobfile) with:
> srun --mpi=pmix_v4 ./Gadget2 paramfile
> (we also tried with a simple mpirun w/ no extra parameters leveraging
> SLURM's integration/autodetection -- same result)
>
> Any hints?
>
> TIA
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-25 Thread Diego Zuccato via users

Il 26/01/2022 02:10, Jeff Squyres (jsquyres) via users ha scritto:


I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
exactly does the application fail?

Neither did I :(
It fails saying a 'timestep' is 0, and that's usually caused by an error 
in the parameters file. But the parameters file is OK, and it actually 
works if the user runs it in a single process. Or even with 
multithreaded runs, sometimes and on some nodes. That's quite random :(

But the runs are usually single-node (simple examples for students).


Can you try upgrading to Open MPI v4.1.2?
That would be a real mess. I'm stuck with packages provided by Debian 
stable. I lack both the manpower and the knowledge to compile everything 
from scratch, given the intricate relations between slurm, openmpi, 
infiniband, etc. :(



What networking are you using?
Infiniband (Mellanox cards, w/ Debian-supplied drivers and support 
programs) and ethernet. Infiniband is also used by IPoIB to reach the 
storage servers (gluster). Some nodes lacks IB, so access to the storage 
is achieved by a couple of iptables rules.




From: users  on behalf of Diego Zuccato via users 

Sent: Tuesday, January 25, 2022 5:43 AM
To: Open MPI Users
Cc: Diego Zuccato
Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on
Gadget2.

Lately, *sometimes*, the same executable with the same parameters file
works, sometimes it fails. On the same node and submitting with the same
command. On some nodes it always fails. But if it gets reduced to
sequential (asking for just one process), it completes correctly (so the
parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to
4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC
should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process, but had insufficient information to ensure MPI
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging
SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


Re: [OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-25 Thread Jeff Squyres (jsquyres) via users
I'm afraid I don't know anything about Gadget, so I can't comment there.  How 
exactly does the application fail?

Can you try upgrading to Open MPI v4.1.2?

What networking are you using?

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Diego Zuccato via 
users 
Sent: Tuesday, January 25, 2022 5:43 AM
To: Open MPI Users
Cc: Diego Zuccato
Subject: [OMPI users] Gadget2 error 818 when using more than 1 process?

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on
Gadget2.

Lately, *sometimes*, the same executable with the same parameters file
works, sometimes it fails. On the same node and submitting with the same
command. On some nodes it always fails. But if it gets reduced to
sequential (asking for just one process), it completes correctly (so the
parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to
4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC
should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the
current process, but had insufficient information to ensure MPI
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is
necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging
SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786


[OMPI users] Gadget2 error 818 when using more than 1 process?

2022-01-25 Thread Diego Zuccato via users

Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on 
Gadget2.


Lately, *sometimes*, the same executable with the same parameters file 
works, sometimes it fails. On the same node and submitting with the same 
command. On some nodes it always fails. But if it gets reduced to 
sequential (asking for just one process), it completes correctly (so the 
parameters file, common source of Gadget2 error 818, seems innocent).


The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to 
4.1.0 in september.


Maybe related, the nodes started spitting out these warnings (that IIUC 
should be harmless... but I'd like to debug & resolve anyway):

-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the 
current process, but had insufficient information to ensure MPI 
processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is 
necessary to resolve this issue.

-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging 
SLURM's integration/autodetection -- same result)


Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786