Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users

Hello Gilles,

thanks for your response. I'm testing with 20 task, each using 8 
threads. When using a single node or only few nodes, we do not see this 
either.


Attached is the used slurm script, which reports also the environment 
variables, and the output log from three different runs with srun, 
mpirun, and mpirun --mca 


It is correct, that when running with mpirun we do not see this issue. 
The errors are only observed when running with "srun"

Moreover, I notice that fewer tests are performed when using mpirun.

From that we can conclude that the issue is related to slurm-openmpi 
interaction.


Switching from srun to mpirun has also some negative implications w.r.t. 
to scheduling and also robustness. Therefore, we would like to start the 
job with srun.



Cheers,
    Alois








Am 5/3/22 um 12:52 schrieb Gilles Gouaillardet via users:

Alois,

Thanks for the report.

FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)

How many MPI tasks are you running?
Can you please confirm you can evidence the error with

mpirun -np  ./mpi_test_suite -d 
MPI_TYPE_MIX_ARRAY -c 0 -t collective



Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users 
 wrote:



Within our cluster (debian10/slurm16, debian11/slurm20), with
infiniband, and we have several instances of openmpi installed
through
the Lmod module system. When testing the openmpi installations
with the
mpi-test-suite 1.1 [1], it shows errors like these

...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3) The
number of
errors may vary, but the first errors are always about
    ERROR class:P2P test:Ring Send Pack (7), comm Duplicated
MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there
are no failed tests.

Typically, the openmpi/4.1.x installation is configured with
 ./configure --prefix=${PREFIX} \
 --with-ucx=$UCX_HOME \
 --enable-orterun-prefix-by-default  \
 --enable-mpi-cxx \
 --with-hwloc \
 --with-pmi \
 --with-pmix \
 --with-cuda=$CUDA_HOME \
 --with-slurm

but I've also tried different compilation options including w/ and
w/o
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the
OS, or
compiled from source. But I could not identify any pattern.

Therefore, I'd like asking you what the issue might be. Specifically,
I'm would like to know:

- Am I right in assuming that mpi-test-suite [1] suitable for testing
openmpi ?
- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
   Alois


[1] https://github.com/open-mpi/mpi-test-suite/t



job-mpi-test3.sh
Description: application/shellscript
delta197
/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/ompi_info
 running on 20*8 cores with 20 MPI-tasks and 8 threads
SHELL=/bin/bash
SLURM_JOB_USER=schloegl
SLURM_TASKS_PER_NODE=2(x10)
SLURM_JOB_UID=10103
SLURM_TASK_PID=50793
PKG_CONFIG_PATH=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/lib/pkgconfig:/mnt/nfs/clustersw/Debian/bullseye/hwloc/2.7.1/lib/pkgconfig:/mnt/nfs/clustersw/shared/cuda/11.2.2/pkgconfig
SLURM_LOCALID=0
SLURM_SUBMIT_DIR=/nfs/scistore16/jonasgrp/schloegl/slurm
HOSTNAME=delta197
LANGUAGE=en_US:en
SLURMD_NODENAME=delta197
_ModuleTable002_=ewpmbiA9ICIvbW50L25mcy9jbHVzdGVyc3cvRGViaWFuL2J1bGxzZXllL21vZHVsZWZpbGVzL0NvcmUvaHdsb2MvMi43LjEubHVhIiwKZnVsbE5hbWUgPSAiaHdsb2MvMi43LjEiLApsb2FkT3JkZXIgPSAzLApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jLzIuNy4xIiwKd1YgPSAiMDAwMDAwMDAyLjAwMDAwMDAwNy4wMDAwMDAwMDEuKnpmaW5hbCIsCn0sCm9wZW5tcGkgPSB7CmZuID0gIi9tbnQvbmZzL2NsdXN0ZXJzdy9EZWJpYW4vYnVsbHNleWUvbW9kdWxlZmlsZXMvQ29yZS9vcGVubXBpLzQuMS4zZC5sdWEiLApmdWxsTmFtZSA9ICJvcGVubXBpLzQuMS4zZCIsCmxvYWRPcmRlciA9IDQsCnByb3BUID0g
MPICC=/mnt/nfs/clustersw/Debian/bullseye/openmpi/4.1.3d/bin/mpicc
__LMOD_REF_COUNT_MODULEPATH=/mnt/nfs/clustersw/Debian/bullseye/modulefiles/MPI/openmpi/4.1.3d:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Linux:1;/mnt/nfs/clustersw/Debian/bullseye/modulefiles/Core:1;/mnt/nfs/clustersw/Debian/bullseye/lmod/lmod/modulefiles/Core:1
OMPI_MCA_btl=self,openib

Re: [OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Gilles Gouaillardet via users
Alois,

Thanks for the report.

FWIW, I am not seeing any errors on my Mac with Open MPI from brew (4.1.3)

How many MPI tasks are you running?
Can you please confirm you can evidence the error with

mpirun -np  ./mpi_test_suite -d MPI_TYPE_MIX_ARRAY -c
0 -t collective


Also, can you try the same command with
mpirun --mca pml ob1 --mca btl tcp,self ...

Cheers,

Gilles

On Tue, May 3, 2022 at 7:08 PM Alois Schlögl via users <
users@lists.open-mpi.org> wrote:

>
> Within our cluster (debian10/slurm16, debian11/slurm20), with
> infiniband, and we have several instances of openmpi installed through
> the Lmod module system. When testing the openmpi installations with the
> mpi-test-suite 1.1 [1], it shows errors like these
>
> ...
> Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
> (Rank:0) tst_test_array[46]:Allreduce Sum
> (Rank:0) tst_test_array[47]:Alltoall
> Number of failed tests: 130
> Summary of failed tests:
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
> (4), type MPI_TYPE_MIX (27) number of values:1000
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
> (4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000
> ...
>
> when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of
> errors may vary, but the first errors are always about
> ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD
>
> When testing on openmpi/3.1.3, the tests runs successfully, and there
> are no failed tests.
>
> Typically, the openmpi/4.1.x installation is configured with
>  ./configure --prefix=${PREFIX} \
>  --with-ucx=$UCX_HOME \
>  --enable-orterun-prefix-by-default  \
>  --enable-mpi-cxx \
>  --with-hwloc \
>  --with-pmi \
>  --with-pmix \
>  --with-cuda=$CUDA_HOME \
>  --with-slurm
>
> but I've also tried different compilation options including w/ and w/o
> --enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or
> compiled from source. But I could not identify any pattern.
>
> Therefore, I'd like asking you what the issue might be. Specifically,
> I'm would like to know:
>
> - Am I right in assuming that mpi-test-suite [1] suitable for testing
> openmpi ?
> - what are possible causes for these type of errors ?
> - what would you recommend how to debug these issues ?
>
> Kind regards,
>Alois
>
>
> [1] https://github.com/open-mpi/mpi-test-suite/t
>
>


[OMPI users] mpi-test-suite shows errors on openmpi 4.1.x

2022-05-03 Thread Alois Schlögl via users



Within our cluster (debian10/slurm16, debian11/slurm20), with 
infiniband, and we have several instances of openmpi installed through 
the Lmod module system. When testing the openmpi installations with the 
mpi-test-suite 1.1 [1], it shows errors like these


...
Rank:0) tst_test_array[45]:Allreduce Min/Max with MPI_IN_PLACE
(Rank:0) tst_test_array[46]:Allreduce Sum
(Rank:0) tst_test_array[47]:Alltoall
Number of failed tests: 130
Summary of failed tests:
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX (27) number of values:1000
ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD 
(4), type MPI_TYPE_MIX_ARRAY (28) number of values:1000

...

when using openmpi/4.1.x (i tested with 4.1.1 and 4.1.3)  The number of 
errors may vary, but the first errors are always about

   ERROR class:P2P test:Ring Send Pack (7), comm Duplicated MPI_COMM_WORLD

When testing on openmpi/3.1.3, the tests runs successfully, and there 
are no failed tests.


Typically, the openmpi/4.1.x installation is configured with
    ./configure --prefix=${PREFIX} \
    --with-ucx=$UCX_HOME \
    --enable-orterun-prefix-by-default  \
    --enable-mpi-cxx \
    --with-hwloc \
    --with-pmi \
    --with-pmix \
    --with-cuda=$CUDA_HOME \
    --with-slurm

but I've also tried different compilation options including w/ and w/o 
--enable-mpi1-compatibility, w/ and w/o ucx, using hwloc from the OS, or 
compiled from source. But I could not identify any pattern.


Therefore, I'd like asking you what the issue might be. Specifically, 
I'm would like to know:


- Am I right in assuming that mpi-test-suite [1] suitable for testing 
openmpi ?

- what are possible causes for these type of errors ?
- what would you recommend how to debug these issues ?

Kind regards,
  Alois


[1] https://github.com/open-mpi/mpi-test-suite/t



Re: [OMPI users] MPI test suite

2020-07-24 Thread Zhang, Junchao via users
Hi, Chris,
  The website you gave is almost empty.  svn checkout 
https://scm.projects.hlrs.de/anonscm/svn/mpitestsuite/ does not work.
  Our code uses MPI point to point, collectives, and communicator, attributes, 
basically MPI-2.1 stuff.

  Thanks
--Junchao Zhang



On Jul 24, 2020, at 2:34 AM, Christoph Niethammer 
mailto:nietham...@hlrs.de>> wrote:

Hello,

What do you wanne test in detail?

If you are interested in testing combinations of datatypes and communicators 
the mpi_test_suite [1] may be of interest for you.

Best
Christoph Niethammer

[1] https://projects.hlrs.de/projects/mpitestsuite/



- Original Message -
From: "Open MPI Users" 
mailto:users@lists.open-mpi.org>>
To: "Open MPI Users" mailto:users@lists.open-mpi.org>>
Cc: "Zhang, Junchao" mailto:jczh...@mcs.anl.gov>>
Sent: Thursday, 23 July, 2020 22:25:18
Subject: Re: [OMPI users] MPI test suite

I know OSU micro-benchmarks.  But it is not an extensive test suite.

Thanks
--Junchao Zhang



On Jul 23, 2020, at 2:00 PM, Marco Atzeri via users 
mailto:users@lists.open-mpi.org>> wrote:

On 23.07.2020 20:28, Zhang, Junchao via users wrote:
Hello,
 Does OMPI have a test suite that can let me validate MPI implementations from 
other vendors?
 Thanks
--Junchao Zhang

Have you considered the OSU Micro-Benchmarks ?

http://mvapich.cse.ohio-state.edu/benchmarks/



Re: [OMPI users] MPI test suite

2020-07-24 Thread Christoph Niethammer via users
Hi,

MTT is a testing infrastructure to automate building MPI libraries and tests, 
running tests and collecting test results but does not come with MPI testsuites 
itself.

Best
Christoph

- Original Message -
From: "Open MPI Users" 
To: "Open MPI Users" 
Cc: "Joseph Schuchart" 
Sent: Friday, 24 July, 2020 09:00:34
Subject: Re: [OMPI users] MPI test suite

You may want to look into MTT: https://github.com/open-mpi/mtt

Cheers
Joseph

On 7/23/20 8:28 PM, Zhang, Junchao via users wrote:
> Hello,
>    Does OMPI have a test suite that can let me validate MPI 
> implementations from other vendors?
> 
>    Thanks
> --Junchao Zhang
> 
> 
>


Re: [OMPI users] MPI test suite

2020-07-24 Thread Christoph Niethammer via users
Hello,

What do you wanne test in detail?

If you are interested in testing combinations of datatypes and communicators 
the mpi_test_suite [1] may be of interest for you.

Best
Christoph Niethammer

[1] https://projects.hlrs.de/projects/mpitestsuite/



- Original Message -
From: "Open MPI Users" 
To: "Open MPI Users" 
Cc: "Zhang, Junchao" 
Sent: Thursday, 23 July, 2020 22:25:18
Subject: Re: [OMPI users] MPI test suite

I know OSU micro-benchmarks.  But it is not an extensive test suite.

Thanks
--Junchao Zhang



> On Jul 23, 2020, at 2:00 PM, Marco Atzeri via users 
>  wrote:
> 
> On 23.07.2020 20:28, Zhang, Junchao via users wrote:
>> Hello,
>>   Does OMPI have a test suite that can let me validate MPI implementations 
>> from other vendors?
>>   Thanks
>> --Junchao Zhang
> 
> Have you considered the OSU Micro-Benchmarks ?
> 
> http://mvapich.cse.ohio-state.edu/benchmarks/


Re: [OMPI users] MPI test suite

2020-07-24 Thread Joseph Schuchart via users

You may want to look into MTT: https://github.com/open-mpi/mtt

Cheers
Joseph

On 7/23/20 8:28 PM, Zhang, Junchao via users wrote:

Hello,
   Does OMPI have a test suite that can let me validate MPI 
implementations from other vendors?


   Thanks
--Junchao Zhang





Re: [OMPI users] MPI test suite

2020-07-23 Thread Zhang, Junchao via users
I know OSU micro-benchmarks.  But it is not an extensive test suite.

Thanks
--Junchao Zhang



> On Jul 23, 2020, at 2:00 PM, Marco Atzeri via users 
>  wrote:
> 
> On 23.07.2020 20:28, Zhang, Junchao via users wrote:
>> Hello,
>>   Does OMPI have a test suite that can let me validate MPI implementations 
>> from other vendors?
>>   Thanks
>> --Junchao Zhang
> 
> Have you considered the OSU Micro-Benchmarks ?
> 
> http://mvapich.cse.ohio-state.edu/benchmarks/



Re: [OMPI users] MPI test suite

2020-07-23 Thread Marco Atzeri via users

On 23.07.2020 20:28, Zhang, Junchao via users wrote:

Hello,
   Does OMPI have a test suite that can let me validate MPI 
implementations from other vendors?


   Thanks
--Junchao Zhang


Have you considered the OSU Micro-Benchmarks ?

http://mvapich.cse.ohio-state.edu/benchmarks/


[OMPI users] MPI test suite

2020-07-23 Thread Zhang, Junchao via users
Hello,
  Does OMPI have a test suite that can let me validate MPI implementations from 
other vendors?

  Thanks
--Junchao Zhang