Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

2022-03-14 Thread Gilles Gouaillardet via users
in order to exclude the coll/tuned component:

mpirun --mca coll ^tuned ...


Cheers,

Gilles

On Mon, Mar 14, 2022 at 5:37 PM Ernesto Prudencio via users <
users@lists.open-mpi.org> wrote:

> Thanks for the hint on “mpirun ldd”. I will try it. The problem is that I
> am running on the cloud and it is trickier to get into a node at run time,
> or save information to be retrieved later.
>
>
>
> Sorry for my ignorance on mca stuff, but what would exactly be the
> suggested mpirun command line options on coll / tuned?
>
>
>
> Cheers,
>
>
>
> Ernesto.
>
>
>
> *From:* users  *On Behalf Of *Gilles
> Gouaillardet via users
> *Sent:* Monday, March 14, 2022 2:22 AM
> *To:* Open MPI Users 
> *Cc:* Gilles Gouaillardet 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Ernesto,
>
>
>
> you can
>
> mpirun ldd 
>
>
>
> and double check it uses the library you expect.
>
>
>
>
>
> you might want to try adapting your trick to use Open MPI 4.1.2 with your
> binary built with Open MPI 4.0.3 and see how it goes.
>
> i'd try disabling coll/tuned first though.
>
>
>
>
>
> Keep in mind PETSc might call MPI_Allreduce under the hood with matching
> but different signatures.
>
>
>
>
>
> Cheers,
>
>
>
> Gilles
>
>
>
> On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users <
> users@lists.open-mpi.org> wrote:
>
> Thanks, Gilles.
>
>
>
> In the case of the application I am working on, all ranks call MPI with
> the same signature / types of variables.
>
>
>
> I do not think there is a code error anywhere. I think this is “just” a
> configuration error from my part.
>
>
>
> Regarding the idea of changing just one item at a time: that would be the
> next step, but first I would like to check if my suspicion that the
> presence of both “/opt/openmpi_4.0.3” and
> “/appl-third-parties/openmpi-4.1.2” at run time could be an issue:
>
>- It is an issue on situation 2, when I explicitly point the runtime
>mpi to be 4.1.2 (also used in compilation)
>- It is not an issue on situation 3, when I explicitly point the
>runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the
>application and openmpi 4.1.2 with GNU, and I link the application with
>openmpi 4.1.2)
>
>
>
> Best,
>
>
>
> Ernesto.
>
>
>
> *From:* Gilles Gouaillardet 
> *Sent:* Monday, March 14, 2022 1:37 AM
> *To:* Open MPI Users 
> *Cc:* Ernesto Prudencio 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Ernesto,
>
>
>
> the coll/tuned module (that should handle collective subroutines by
> default) has a known issue when matching but non identical signatures are
> used:
>
> for example, one rank uses one vector of n bytes, and an other rank uses n
> bytes.
>
> Is there a chance your application might use this pattern?
>
>
>
> You can give try disabling this component with
>
> mpirun --mca coll ^tuned ...
>
>
>
>
>
> I noted between the successful a) case and the unsuccessful b) case, you
> changed 3 parameters:
>
>  - compiler vendor
>
>  - Open MPI version
>
>  - PETSc 3.10.4
>
> so at this stage, it is not obvious which should be blamed for the failure.
>
>
>
>
>
> In order to get a better picture, I would first try
>
>  - Intel compilers
>
>  - Open MPI 4.1.2
>
>  - PETSc 3.10.4
>
>
>
> => a failure would suggest a regression in Open MPI
>
>
>
> And then
>
>  - Intel compilers
>
>  - Open MPI 4.0.3
>
>  - PETSc 3.16.5
>
>
>
> => a failure would either suggest a regression in PETSc, or PETSc doing
> something different but legit that evidences a bug in Open MPI.
>
>
>
> If you have time, you can also try
>
>  - Intel compilers
>
>  - MPICH (or a derivative such as Intel MPI)
>
>  - PETSc 3.16.5
>
>
>
> => a success would strongly point to Open MPI
>
>
>
>
>
> Cheers,
>
>
>
> Gilles
>
>
>
> On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users <
> users@lists.open-mpi.org> wrote:
>
> Forgot to mention that in all 3 situations, mpirun is called as follows
> (35 nodes, 4 MPI ranks per node):
>
>
>
> mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt
> -np 140 -npernode 4 --mca btl_tcp_if_include eth0 
> 
>
>
>
> So I have a question 3) Should I add some extra option in the mpirun
> command line in order to make situation 2 successful?
>
>
>
> Thanks,
>
>
>
> Ernesto.
>
>
>
>
>
> Schlumberger-Private
>
>
>
> Schlumberger-Private
>
>
>
> Schlumberger-Private
>
> *From:* users  *On Behalf Of *Ernesto
> Prudencio via users
> *Sent:* Monday, March 14, 2022 12:39 AM
> *To:* Open MPI Users 
> *Cc:* Ernesto Prudencio 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Thank you for the quick answer, George. I wanted to investigate the
> problem further before replying.
>
>
>
> Below I show 3 situations of my C++ (and Fortran) application, which runs
> on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5
> compiled with INTEL.
>
>
>
> At the end, I have 2 questions.
>
>
>
> Note: all codes are 

Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

2022-03-14 Thread Ernesto Prudencio via users
Thanks for the hint on "mpirun ldd". I will try it. The problem is that I am 
running on the cloud and it is trickier to get into a node at run time, or save 
information to be retrieved later.

Sorry for my ignorance on mca stuff, but what would exactly be the suggested 
mpirun command line options on coll / tuned?

Cheers,

Ernesto.

From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Monday, March 14, 2022 2:22 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

Ernesto,

you can
mpirun ldd 

and double check it uses the library you expect.


you might want to try adapting your trick to use Open MPI 4.1.2 with your 
binary built with Open MPI 4.0.3 and see how it goes.
i'd try disabling coll/tuned first though.


Keep in mind PETSc might call MPI_Allreduce under the hood with matching but 
different signatures.


Cheers,

Gilles

On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users 
mailto:users@lists.open-mpi.org>> wrote:
Thanks, Gilles.

In the case of the application I am working on, all ranks call MPI with the 
same signature / types of variables.

I do not think there is a code error anywhere. I think this is "just" a 
configuration error from my part.

Regarding the idea of changing just one item at a time: that would be the next 
step, but first I would like to check if my suspicion that the presence of both 
"/opt/openmpi_4.0.3" and "/appl-third-parties/openmpi-4.1.2" at run time could 
be an issue:

  *   It is an issue on situation 2, when I explicitly point the runtime mpi to 
be 4.1.2 (also used in compilation)
  *   It is not an issue on situation 3, when I explicitly point the runtime 
mpi to be 4.0.3 compiled with INTEL (even though I compiled the application and 
openmpi 4.1.2 with GNU, and I link the application with openmpi 4.1.2)

Best,

Ernesto.

From: Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>>
Sent: Monday, March 14, 2022 1:37 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ernesto Prudencio mailto:epruden...@slb.com>>
Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

Ernesto,

the coll/tuned module (that should handle collective subroutines by default) 
has a known issue when matching but non identical signatures are used:
for example, one rank uses one vector of n bytes, and an other rank uses n 
bytes.
Is there a chance your application might use this pattern?

You can give try disabling this component with
mpirun --mca coll ^tuned ...


I noted between the successful a) case and the unsuccessful b) case, you 
changed 3 parameters:
 - compiler vendor
 - Open MPI version
 - PETSc 3.10.4
so at this stage, it is not obvious which should be blamed for the failure.


In order to get a better picture, I would first try
 - Intel compilers
 - Open MPI 4.1.2
 - PETSc 3.10.4

=> a failure would suggest a regression in Open MPI

And then
 - Intel compilers
 - Open MPI 4.0.3
 - PETSc 3.16.5

=> a failure would either suggest a regression in PETSc, or PETSc doing 
something different but legit that evidences a bug in Open MPI.

If you have time, you can also try
 - Intel compilers
 - MPICH (or a derivative such as Intel MPI)
 - PETSc 3.16.5

=> a success would strongly point to Open MPI


Cheers,

Gilles

On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users 
mailto:users@lists.open-mpi.org>> wrote:
Forgot to mention that in all 3 situations, mpirun is called as follows (35 
nodes, 4 MPI ranks per node):

mpirun -x LD_LIBRARY_PATH=:::... -hostfile /tmp/hostfile.txt -np 
140 -npernode 4 --mca btl_tcp_if_include eth0  

So I have a question 3) Should I add some extra option in the mpirun command 
line in order to make situation 2 successful?

Thanks,

Ernesto.



Schlumberger-Private


Schlumberger-Private


Schlumberger-Private
From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ernesto Prudencio via users
Sent: Monday, March 14, 2022 12:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ernesto Prudencio mailto:epruden...@slb.com>>
Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

Thank you for the quick answer, George. I wanted to investigate the problem 
further before replying.

Below I show 3 situations of my C++ (and Fortran) application, which runs on 
top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 compiled with 
INTEL.

At the end, I have 2 questions.

Note: all codes are compiled in a certain set of nodes, and the execution 
happens at _another_ set of nodes.

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - -

Situation 1) It has been successful for months now:

a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. The 
configuration options for OpenMPI are:

'--with-flux-pmi=no' 

Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

2022-03-14 Thread Gilles Gouaillardet via users
Ernesto,

you can
mpirun ldd 

and double check it uses the library you expect.


you might want to try adapting your trick to use Open MPI 4.1.2 with your
binary built with Open MPI 4.0.3 and see how it goes.
i'd try disabling coll/tuned first though.


Keep in mind PETSc might call MPI_Allreduce under the hood with matching
but different signatures.


Cheers,

Gilles

On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users <
users@lists.open-mpi.org> wrote:

> Thanks, Gilles.
>
>
>
> In the case of the application I am working on, all ranks call MPI with
> the same signature / types of variables.
>
>
>
> I do not think there is a code error anywhere. I think this is “just” a
> configuration error from my part.
>
>
>
> Regarding the idea of changing just one item at a time: that would be the
> next step, but first I would like to check if my suspicion that the
> presence of both “/opt/openmpi_4.0.3” and
> “/appl-third-parties/openmpi-4.1.2” at run time could be an issue:
>
>- It is an issue on situation 2, when I explicitly point the runtime
>mpi to be 4.1.2 (also used in compilation)
>- It is not an issue on situation 3, when I explicitly point the
>runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the
>application and openmpi 4.1.2 with GNU, and I link the application with
>openmpi 4.1.2)
>
>
>
> Best,
>
>
>
> Ernesto.
>
>
>
> *From:* Gilles Gouaillardet 
> *Sent:* Monday, March 14, 2022 1:37 AM
> *To:* Open MPI Users 
> *Cc:* Ernesto Prudencio 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Ernesto,
>
>
>
> the coll/tuned module (that should handle collective subroutines by
> default) has a known issue when matching but non identical signatures are
> used:
>
> for example, one rank uses one vector of n bytes, and an other rank uses n
> bytes.
>
> Is there a chance your application might use this pattern?
>
>
>
> You can give try disabling this component with
>
> mpirun --mca coll ^tuned ...
>
>
>
>
>
> I noted between the successful a) case and the unsuccessful b) case, you
> changed 3 parameters:
>
>  - compiler vendor
>
>  - Open MPI version
>
>  - PETSc 3.10.4
>
> so at this stage, it is not obvious which should be blamed for the failure.
>
>
>
>
>
> In order to get a better picture, I would first try
>
>  - Intel compilers
>
>  - Open MPI 4.1.2
>
>  - PETSc 3.10.4
>
>
>
> => a failure would suggest a regression in Open MPI
>
>
>
> And then
>
>  - Intel compilers
>
>  - Open MPI 4.0.3
>
>  - PETSc 3.16.5
>
>
>
> => a failure would either suggest a regression in PETSc, or PETSc doing
> something different but legit that evidences a bug in Open MPI.
>
>
>
> If you have time, you can also try
>
>  - Intel compilers
>
>  - MPICH (or a derivative such as Intel MPI)
>
>  - PETSc 3.16.5
>
>
>
> => a success would strongly point to Open MPI
>
>
>
>
>
> Cheers,
>
>
>
> Gilles
>
>
>
> On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users <
> users@lists.open-mpi.org> wrote:
>
> Forgot to mention that in all 3 situations, mpirun is called as follows
> (35 nodes, 4 MPI ranks per node):
>
>
>
> mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt
> -np 140 -npernode 4 --mca btl_tcp_if_include eth0 
> 
>
>
>
> So I have a question 3) Should I add some extra option in the mpirun
> command line in order to make situation 2 successful?
>
>
>
> Thanks,
>
>
>
> Ernesto.
>
>
>
>
>
> Schlumberger-Private
>
>
>
> Schlumberger-Private
>
> *From:* users  *On Behalf Of *Ernesto
> Prudencio via users
> *Sent:* Monday, March 14, 2022 12:39 AM
> *To:* Open MPI Users 
> *Cc:* Ernesto Prudencio 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Thank you for the quick answer, George. I wanted to investigate the
> problem further before replying.
>
>
>
> Below I show 3 situations of my C++ (and Fortran) application, which runs
> on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5
> compiled with INTEL.
>
>
>
> At the end, I have 2 questions.
>
>
>
> Note: all codes are compiled in a certain set of nodes, and the execution
> happens at _*another*_ set of nodes.
>
>
>
> +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - -
>
>
>
> Situation 1) It has been successful for months now:
>
>
>
> a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application.
> The configuration options for OpenMPI are:
>
>
>
> '--with-flux-pmi=no' '--enable-orterun-prefix-by-default'
> '--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1'
> 'FC=ifort' 'CC=gcc'
>
>
>
> b) At run time, each MPI rank prints this info:
>
>
>
> PATH =
> /opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>
>
>
> 

Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

2022-03-14 Thread Ernesto Prudencio via users
Thanks, Gilles.

In the case of the application I am working on, all ranks call MPI with the 
same signature / types of variables.

I do not think there is a code error anywhere. I think this is "just" a 
configuration error from my part.

Regarding the idea of changing just one item at a time: that would be the next 
step, but first I would like to check if my suspicion that the presence of both 
"/opt/openmpi_4.0.3" and "/appl-third-parties/openmpi-4.1.2" at run time could 
be an issue:

  *   It is an issue on situation 2, when I explicitly point the runtime mpi to 
be 4.1.2 (also used in compilation)
  *   It is not an issue on situation 3, when I explicitly point the runtime 
mpi to be 4.0.3 compiled with INTEL (even though I compiled the application and 
openmpi 4.1.2 with GNU, and I link the application with openmpi 4.1.2)

Best,

Ernesto.

From: Gilles Gouaillardet 
Sent: Monday, March 14, 2022 1:37 AM
To: Open MPI Users 
Cc: Ernesto Prudencio 
Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

Ernesto,

the coll/tuned module (that should handle collective subroutines by default) 
has a known issue when matching but non identical signatures are used:
for example, one rank uses one vector of n bytes, and an other rank uses n 
bytes.
Is there a chance your application might use this pattern?

You can give try disabling this component with
mpirun --mca coll ^tuned ...


I noted between the successful a) case and the unsuccessful b) case, you 
changed 3 parameters:
 - compiler vendor
 - Open MPI version
 - PETSc 3.10.4
so at this stage, it is not obvious which should be blamed for the failure.


In order to get a better picture, I would first try
 - Intel compilers
 - Open MPI 4.1.2
 - PETSc 3.10.4

=> a failure would suggest a regression in Open MPI

And then
 - Intel compilers
 - Open MPI 4.0.3
 - PETSc 3.16.5

=> a failure would either suggest a regression in PETSc, or PETSc doing 
something different but legit that evidences a bug in Open MPI.

If you have time, you can also try
 - Intel compilers
 - MPICH (or a derivative such as Intel MPI)
 - PETSc 3.16.5

=> a success would strongly point to Open MPI


Cheers,

Gilles

On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users 
mailto:users@lists.open-mpi.org>> wrote:
Forgot to mention that in all 3 situations, mpirun is called as follows (35 
nodes, 4 MPI ranks per node):

mpirun -x LD_LIBRARY_PATH=:::... -hostfile /tmp/hostfile.txt -np 
140 -npernode 4 --mca btl_tcp_if_include eth0  

So I have a question 3) Should I add some extra option in the mpirun command 
line in order to make situation 2 successful?

Thanks,

Ernesto.



Schlumberger-Private


Schlumberger-Private
From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ernesto Prudencio via users
Sent: Monday, March 14, 2022 12:39 AM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Ernesto Prudencio mailto:epruden...@slb.com>>
Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

Thank you for the quick answer, George. I wanted to investigate the problem 
further before replying.

Below I show 3 situations of my C++ (and Fortran) application, which runs on 
top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 compiled with 
INTEL.

At the end, I have 2 questions.

Note: all codes are compiled in a certain set of nodes, and the execution 
happens at _another_ set of nodes.

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - -

Situation 1) It has been successful for months now:

a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. The 
configuration options for OpenMPI are:

'--with-flux-pmi=no' '--enable-orterun-prefix-by-default' 
'--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1'
 'FC=ifort' 'CC=gcc'

b) At run time, each MPI rank prints this info:

PATH = 
/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LD_LIBRARY_PATH  = 
/opt/openmpi_4.0.3/lib::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/opt/petsc/lib:/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/opt/openmpi_4.0.3/lib:/lib64:/lib:/usr/lib64:/usr/lib

MPI version (compile time)   = 4.0.3
MPI_Get_library_version()= Open MPI v4.0.3, package: Open MPI 
root@ Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020
PETSc version (compile time) = 3.10.4

c) A test of 20 minutes with 14 nodes, 4 MPI ranks per node, runs ok.

d) A test of 2 hours with 35 nodes, 4 MPI ranks per node, runs ok.

+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - -


Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15

2022-03-14 Thread Gilles Gouaillardet via users
Ernesto,

the coll/tuned module (that should handle collective subroutines by
default) has a known issue when matching but non identical signatures are
used:
for example, one rank uses one vector of n bytes, and an other rank uses n
bytes.
Is there a chance your application might use this pattern?

You can give try disabling this component with
mpirun --mca coll ^tuned ...


I noted between the successful a) case and the unsuccessful b) case, you
changed 3 parameters:
 - compiler vendor
 - Open MPI version
 - PETSc 3.10.4
so at this stage, it is not obvious which should be blamed for the failure.


In order to get a better picture, I would first try
 - Intel compilers
 - Open MPI 4.1.2
 - PETSc 3.10.4

=> a failure would suggest a regression in Open MPI

And then
 - Intel compilers
 - Open MPI 4.0.3
 - PETSc 3.16.5

=> a failure would either suggest a regression in PETSc, or PETSc doing
something different but legit that evidences a bug in Open MPI.

If you have time, you can also try
 - Intel compilers
 - MPICH (or a derivative such as Intel MPI)
 - PETSc 3.16.5

=> a success would strongly point to Open MPI


Cheers,

Gilles

On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users <
users@lists.open-mpi.org> wrote:

> Forgot to mention that in all 3 situations, mpirun is called as follows
> (35 nodes, 4 MPI ranks per node):
>
>
>
> mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt
> -np 140 -npernode 4 --mca btl_tcp_if_include eth0 
> 
>
>
>
> So I have a question 3) Should I add some extra option in the mpirun
> command line in order to make situation 2 successful?
>
>
>
> Thanks,
>
>
>
> Ernesto.
>
>
>
>
>
> Schlumberger-Private
>
> *From:* users  *On Behalf Of *Ernesto
> Prudencio via users
> *Sent:* Monday, March 14, 2022 12:39 AM
> *To:* Open MPI Users 
> *Cc:* Ernesto Prudencio 
> *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning
> value 15
>
>
>
> Thank you for the quick answer, George. I wanted to investigate the
> problem further before replying.
>
>
>
> Below I show 3 situations of my C++ (and Fortran) application, which runs
> on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5
> compiled with INTEL.
>
>
>
> At the end, I have 2 questions.
>
>
>
> Note: all codes are compiled in a certain set of nodes, and the execution
> happens at _*another*_ set of nodes.
>
>
>
> +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - -
>
>
>
> Situation 1) It has been successful for months now:
>
>
>
> a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application.
> The configuration options for OpenMPI are:
>
>
>
> '--with-flux-pmi=no' '--enable-orterun-prefix-by-default'
> '--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1'
> 'FC=ifort' 'CC=gcc'
>
>
>
> b) At run time, each MPI rank prints this info:
>
>
>
> PATH =
> /opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>
>
>
> LD_LIBRARY_PATH  =
> /opt/openmpi_4.0.3/lib::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/opt/petsc/lib:/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/opt/openmpi_4.0.3/lib:/lib64:/lib:/usr/lib64:/usr/lib
>
>
>
> MPI version (compile time)   = 4.0.3
>
> MPI_Get_library_version()= Open MPI v4.0.3, package: Open MPI 
> root@
> Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020
>
> PETSc version (compile time) = 3.10.4
>
>
>
> c) A test of 20 minutes with 14 nodes, 4 MPI ranks per node, runs ok.
>
>
>
> d) A test of 2 hours with 35 nodes, 4 MPI ranks per node, runs ok.
>
>
>
> +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - -
>
>
>
> Situation 2) This situation is the one failing during execution.
>
>
>
> a) Use GNU compilers for OpenMPI 4.1.2, PETSc 3.16.5 , and application.
> The configuration options for OpenMPI are:
>
>
>
> '--with-flux-pmi=no' '--prefix=/appl-third-parties/openmpi-4.1.2'
> '--enable-orterun-prefix-by-default'
>
>
>
> b) At run time, each MPI rank prints this info:
>
>
>
> PATH  = /appl-third-parties/openmpi-4.1.2/bin
> :/appl-third-parties/openmpi-4.1.2/bin:/appl-third-parties/openmpi-4.1.2/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
>
>
>
> LD_LIBRARY_PATH = /appl-third-parties/openmpi-4.1.2/lib
> ::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/appl-third-parties/petsc-3.16.5/lib
>
>
> :/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/appl-third-parties/openmpi-4.1.2/lib:/lib64:/lib:/usr/lib64:/usr/lib
>
>
>
> MPI version (compile time)= 4.1.2
>
>