[OMPI users] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501; error in device init Mesh created.

2023-05-19 Thread Rob Kudyba via users
RHEL 8 with OpenMPI  4.1.5a1 on a HPC cluster compute node Singularity
version 3.7.1. I see the error in another issue mentioned
 at the Git page an on SO

where
a suggestion for setting -mca orte_base_help_aggregate 0 suppresses it.

ompi_info | head

 Package: Open MPI root@c-141-88-1-005 Distribution

Open MPI: 4.1.5a1

  Open MPI repo revision: v4.1.4-32-g5abd86c

   Open MPI release date: Sep 05, 2022

Open RTE: 4.1.5a1

  Open RTE repo revision: v4.1.4-32-g5abd86c

   Open RTE release date: Sep 05, 2022

OPAL: 4.1.5a1

  OPAL repo revision: v4.1.4-32-g5abd86c

   OPAL release date: Sep 05, 2022



mpirun -np 8 -debug-devel -v snappyHexMesh -overwrite -parallel  >
snappyHexMesh.out

[g279:2750943] procdir: /tmp/ompi.g279.547289/pid.2750943/0/0

[g279:2750943] jobdir: /tmp/ompi.g279.547289/pid.2750943/0

[g279:2750943] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750943] top: /tmp/ompi.g279.547289

[g279:2750943] tmp: /tmp

[g279:2750943] sess_dir_cleanup: job session dir does not exist

[g279:2750943] sess_dir_cleanup: top session dir not empty - leaving

[g279:2750943] procdir: /tmp/ompi.g279.547289/pid.2750943/0/0

[g279:2750943] jobdir: /tmp/ompi.g279.547289/pid.2750943/0

[g279:2750943] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750943] top: /tmp/ompi.g279.547289

[g279:2750943] tmp: /tmp

[g279:2750943] [[20506,0],0] Releasing job data for [INVALID]

--

By default, for Open MPI 4.0 and later, infiniband ports on a device

are not used by default.  The intent is to use UCX for these devices.

You can override this policy by setting the btl_openib_allow_ib MCA
parameter

to true.


  Local host:  g279

  Local adapter:   mlx5_0

  Local port:  1


--

[g279:2750947] procdir: /tmp/ompi.g279.547289/pid.2750943/1/0

[g279:2750947] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750947] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750947] top: /tmp/ompi.g279.547289

[g279:2750947] tmp: /tmp

[g279:2750949] procdir: /tmp/ompi.g279.547289/pid.2750943/1/2

[g279:2750949] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750949] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750949] top: /tmp/ompi.g279.547289

[g279:2750949] tmp: /tmp

--

WARNING: There was an error initializing an OpenFabrics device.


  Local host:   g279

  Local device: mlx5_0

--

[g279:2750948] procdir: /tmp/ompi.g279.547289/pid.2750943/1/1

[g279:2750948] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750948] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750948] top: /tmp/ompi.g279.547289

[g279:2750948] tmp: /tmp

[g279:2750953] procdir: /tmp/ompi.g279.547289/pid.2750943/1/4

[g279:2750953] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750953] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750953] top: /tmp/ompi.g279.547289

[g279:2750953] tmp: /tmp

[g279:2750950] procdir: /tmp/ompi.g279.547289/pid.2750943/1/3

[g279:2750950] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750950] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750950] top: /tmp/ompi.g279.547289

[g279:2750950] tmp: /tmp

[g279:2750954] procdir: /tmp/ompi.g279.547289/pid.2750943/1/5

[g279:2750954] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750954] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750954] top: /tmp/ompi.g279.547289

[g279:2750954] tmp: /tmp

[g279:2750955] procdir: /tmp/ompi.g279.547289/pid.2750943/1/6

[g279:2750955] jobdir: /tmp/ompi.g279.547289/pid.2750943/1

[g279:2750955] top: /tmp/ompi.g279.547289/pid.2750943

[g279:2750955] top: /tmp/ompi.g279.547289

[g279:2750955] tmp: /tmp

  MPIR_being_debugged = 0

  MPIR_debug_state = 1

  MPIR_partial_attach_ok = 1

  MPIR_i_am_starter = 0

  MPIR_forward_output = 0

  MPIR_proctable_size = 8

  MPIR_proctable:

(i, host, exe, pid) = (0, g279,
/usr/lib/openfoam/openfoam2212/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh,
2750947)

(i, host, exe, pid) = (1, g279,
/usr/lib/openfoam/openfoam2212/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh,
2750948)

(i, host, exe, pid) = (2, g279,
/usr/lib/openfoam/openfoam2212/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh,
2750949)

(i, host, exe, pid) = (3, g279,
/usr/lib/openfoam/openfoam2212/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh,
2750950)

(i, host, exe, pid) = (4, g279,
/usr/lib/openfoam/openfoam2212/platforms/linux64GccDPInt32Opt/bin/snappyHexMesh,
2750953)

(i, host, exe, pid) = (5, g279,

[OMPI users] mpirun seemingly requires --host and --oversubscibe when running more than -np 2 on some nodes

2023-05-19 Thread Morgan via users

Hi All,

I am seeing some funky behavior and am hoping someone has some ideas on 
where to start looking. I have installed openmpi 4.1.4 via spack on this 
cluster, Slurm aware. I then build Orca against that via spack as well 
(for context). Orca calls mpi under the hood with simple `mpirun -np X 
`. However I am running into a case where on some nodes I am getting 
`While computing bindings, we found no available cpus on the following 
node:` when trying to use more than `-np 2`. However, when I add 
`--oversubscribe` and `--host [hostname]` I can run successfully.


The other weird part of this is that it does not happen on all of my 
compute nodes. All of the compute nodes are installed identically with 
Rocky 8.


Here are examples:

```
[user@node2428 sbatch_scripts]$ mpirun --display-allocation -np 4 hostname

==   ALLOCATED NODES   ==
    node2428: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=
--
While computing bindings, we found no available cpus on
the following node:

  Node:  node2428

Please check your allocation.
--
[user@node2428 sbatch_scripts]$ mpirun --display-allocation 
--oversubscribe -np 4 hostname


==   ALLOCATED NODES   ==
    node2428: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=
--
While computing bindings, we found no available cpus on
the following node:

  Node:  node2428

Please check your allocation.
--
[user@node2428 sbatch_scripts]$ mpirun --display-allocation 
--oversubscribe --host node2428 -np 4 hostname


==   ALLOCATED NODES   ==
    node2428: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
=
node2428
node2428
node2428
node2428
```

Thanks in advance!

--
Morgan Ludwig
Techsquare Inc.
http://www.techsquare.com/
mlud...@techsquare.com



Re: [OMPI users] psec warning when launching with srun

2023-05-19 Thread Zhéxué M. Krawutschke via users
Hello Christoph,
what exactly is your problem with OpenMPI and Slurm?
Do you compile the products yourself? Which LINUX distribution and version are 
you using?

If you compile the software yourself, could you please tell me what the 
"configure" command looks like and which MUNGE version is in use? From the 
distribution or compiled by yourself?

I would be very happy to take on this topic and help you. You can also reach me 
at +49 176 67270992.
Best regards from Berlin

Z. Matthias Krawutschke

> On Donnerstag, Mai 18, 2023 at 5:47 PM, christof.koehler--- via users 
> mailto:users@lists.open-mpi.org)> wrote:
> Hello again,
>
> I should add that the openmpi configure decides to use the internal pmix
>
> configure: WARNING: discovered external PMIx version is less than internal 
> version 3.x
> configure: WARNING: using internal PMIx
> ...
> ...
> checking if user requested PMI support... yes
> checking for pmi.h in /usr/include... not found
> checking for pmi.h in /usr/include/slurm... found
> checking pmi.h usability... yes
> checking pmi.h presence... yes
> checking for pmi.h... yes
> checking for libpmi in /usr/lib64... found
> checking for PMI_Init in -lpmi... yes
> checking for pmi2.h in /usr/include... not found
> checking for pmi2.h in /usr/include/slurm... found
> checking pmi2.h usability... yes
> checking pmi2.h presence... yes
> checking for pmi2.h... yes
> checking for libpmi2 in /usr/lib64... found
> checking for PMI2_Init in -lpmi2... yes
> checking for pmix.h in ... not found
> checking for pmix.h in /include... not found
> checking can PMI support be built... yes
> checking if user requested internal PMIx support(yes)... no
> checking for pmix.h in /usr... not found
> checking for pmix.h in /usr/include... found
> checking libpmix.* in /usr/lib64... found
> checking PMIx version... version file found
> checking version 4x... found
> checking PMIx version to be used... internal
>
> I am not sure how it decides that, the external one is already a quite
> new version.
>
> # srun --mpi=list
> MPI plugin types are...
> pmix
> cray_shasta
> none
> pmi2
> specific pmix plugin versions available: pmix_v4
>
>
> Best Regards
>
> Christof
>
> --
> Dr. rer. nat. Christof Köhler email: c.koeh...@uni-bremen.de
> Universitaet Bremen/FB1/BCCMS phone: +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.06 fax: +49-(0)421-218-62770
> 28359 Bremen
>