Hello all.

A user of our cluster is experiencing a weird problem that I can't pinpoint.

He does have a job script that worked well on every node. I's based on Gadget2.

Lately, *sometimes*, the same executable with the same parameters file works, sometimes it fails. On the same node and submitting with the same command. On some nodes it always fails. But if it gets reduced to sequential (asking for just one process), it completes correctly (so the parameters file, common source of Gadget2 error 818, seems innocent).

The cluster uses SLURM and limits resources using cgroups, if that matters.

Seems most of the issues started after upgrading from openmpi 3.1.3 to 4.1.0 in september.

Maybe related, the nodes started spitting out these warnings (that IIUC should be harmless... but I'd like to debug & resolve anyway):
-8<--
Open MPI's OFI driver detected multiple equidistant NICs from the current process, but had insufficient information to ensure MPI processes fairly pick a NIC for use. This may negatively impact performance. A more modern PMIx server is necessary to resolve this issue.
-8<--

Code is run (from the jobfile) with:
srun --mpi=pmix_v4 ./Gadget2 paramfile
(we also tried with a simple mpirun w/ no extra parameters leveraging SLURM's integration/autodetection -- same result)

Any hints?

TIA

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Reply via email to