Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15
in order to exclude the coll/tuned component: mpirun --mca coll ^tuned ... Cheers, Gilles On Mon, Mar 14, 2022 at 5:37 PM Ernesto Prudencio via users < users@lists.open-mpi.org> wrote: > Thanks for the hint on “mpirun ldd”. I will try it. The problem is that I > am running on the cloud and it is trickier to get into a node at run time, > or save information to be retrieved later. > > > > Sorry for my ignorance on mca stuff, but what would exactly be the > suggested mpirun command line options on coll / tuned? > > > > Cheers, > > > > Ernesto. > > > > *From:* users *On Behalf Of *Gilles > Gouaillardet via users > *Sent:* Monday, March 14, 2022 2:22 AM > *To:* Open MPI Users > *Cc:* Gilles Gouaillardet > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Ernesto, > > > > you can > > mpirun ldd > > > > and double check it uses the library you expect. > > > > > > you might want to try adapting your trick to use Open MPI 4.1.2 with your > binary built with Open MPI 4.0.3 and see how it goes. > > i'd try disabling coll/tuned first though. > > > > > > Keep in mind PETSc might call MPI_Allreduce under the hood with matching > but different signatures. > > > > > > Cheers, > > > > Gilles > > > > On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users < > users@lists.open-mpi.org> wrote: > > Thanks, Gilles. > > > > In the case of the application I am working on, all ranks call MPI with > the same signature / types of variables. > > > > I do not think there is a code error anywhere. I think this is “just” a > configuration error from my part. > > > > Regarding the idea of changing just one item at a time: that would be the > next step, but first I would like to check if my suspicion that the > presence of both “/opt/openmpi_4.0.3” and > “/appl-third-parties/openmpi-4.1.2” at run time could be an issue: > >- It is an issue on situation 2, when I explicitly point the runtime >mpi to be 4.1.2 (also used in compilation) >- It is not an issue on situation 3, when I explicitly point the >runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the >application and openmpi 4.1.2 with GNU, and I link the application with >openmpi 4.1.2) > > > > Best, > > > > Ernesto. > > > > *From:* Gilles Gouaillardet > *Sent:* Monday, March 14, 2022 1:37 AM > *To:* Open MPI Users > *Cc:* Ernesto Prudencio > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Ernesto, > > > > the coll/tuned module (that should handle collective subroutines by > default) has a known issue when matching but non identical signatures are > used: > > for example, one rank uses one vector of n bytes, and an other rank uses n > bytes. > > Is there a chance your application might use this pattern? > > > > You can give try disabling this component with > > mpirun --mca coll ^tuned ... > > > > > > I noted between the successful a) case and the unsuccessful b) case, you > changed 3 parameters: > > - compiler vendor > > - Open MPI version > > - PETSc 3.10.4 > > so at this stage, it is not obvious which should be blamed for the failure. > > > > > > In order to get a better picture, I would first try > > - Intel compilers > > - Open MPI 4.1.2 > > - PETSc 3.10.4 > > > > => a failure would suggest a regression in Open MPI > > > > And then > > - Intel compilers > > - Open MPI 4.0.3 > > - PETSc 3.16.5 > > > > => a failure would either suggest a regression in PETSc, or PETSc doing > something different but legit that evidences a bug in Open MPI. > > > > If you have time, you can also try > > - Intel compilers > > - MPICH (or a derivative such as Intel MPI) > > - PETSc 3.16.5 > > > > => a success would strongly point to Open MPI > > > > > > Cheers, > > > > Gilles > > > > On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users < > users@lists.open-mpi.org> wrote: > > Forgot to mention that in all 3 situations, mpirun is called as follows > (35 nodes, 4 MPI ranks per node): > > > > mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt > -np 140 -npernode 4 --mca btl_tcp_if_include eth0 > > > > > So I have a question 3) Should I add some extra option in the mpirun > command line in order to make situation 2 successful? > > > > Thanks, > > > > Ernesto. > > > > > > Schlumberger-Private > > > > Schlumberger-Private > > > > Schlumberger-Private > > *From:* users *On Behalf Of *Ernesto > Prudencio via users > *Sent:* Monday, March 14, 2022 12:39 AM > *To:* Open MPI Users > *Cc:* Ernesto Prudencio > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Thank you for the quick answer, George. I wanted to investigate the > problem further before replying. > > > > Below I show 3 situations of my C++ (and Fortran) application, which runs > on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 > compiled with INTEL. > > > > At the end, I have 2 questions. > > > > Note: all codes are
Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15
Thanks for the hint on "mpirun ldd". I will try it. The problem is that I am running on the cloud and it is trickier to get into a node at run time, or save information to be retrieved later. Sorry for my ignorance on mca stuff, but what would exactly be the suggested mpirun command line options on coll / tuned? Cheers, Ernesto. From: users On Behalf Of Gilles Gouaillardet via users Sent: Monday, March 14, 2022 2:22 AM To: Open MPI Users Cc: Gilles Gouaillardet Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15 Ernesto, you can mpirun ldd and double check it uses the library you expect. you might want to try adapting your trick to use Open MPI 4.1.2 with your binary built with Open MPI 4.0.3 and see how it goes. i'd try disabling coll/tuned first though. Keep in mind PETSc might call MPI_Allreduce under the hood with matching but different signatures. Cheers, Gilles On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users mailto:users@lists.open-mpi.org>> wrote: Thanks, Gilles. In the case of the application I am working on, all ranks call MPI with the same signature / types of variables. I do not think there is a code error anywhere. I think this is "just" a configuration error from my part. Regarding the idea of changing just one item at a time: that would be the next step, but first I would like to check if my suspicion that the presence of both "/opt/openmpi_4.0.3" and "/appl-third-parties/openmpi-4.1.2" at run time could be an issue: * It is an issue on situation 2, when I explicitly point the runtime mpi to be 4.1.2 (also used in compilation) * It is not an issue on situation 3, when I explicitly point the runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the application and openmpi 4.1.2 with GNU, and I link the application with openmpi 4.1.2) Best, Ernesto. From: Gilles Gouaillardet mailto:gilles.gouaillar...@gmail.com>> Sent: Monday, March 14, 2022 1:37 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ernesto Prudencio mailto:epruden...@slb.com>> Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15 Ernesto, the coll/tuned module (that should handle collective subroutines by default) has a known issue when matching but non identical signatures are used: for example, one rank uses one vector of n bytes, and an other rank uses n bytes. Is there a chance your application might use this pattern? You can give try disabling this component with mpirun --mca coll ^tuned ... I noted between the successful a) case and the unsuccessful b) case, you changed 3 parameters: - compiler vendor - Open MPI version - PETSc 3.10.4 so at this stage, it is not obvious which should be blamed for the failure. In order to get a better picture, I would first try - Intel compilers - Open MPI 4.1.2 - PETSc 3.10.4 => a failure would suggest a regression in Open MPI And then - Intel compilers - Open MPI 4.0.3 - PETSc 3.16.5 => a failure would either suggest a regression in PETSc, or PETSc doing something different but legit that evidences a bug in Open MPI. If you have time, you can also try - Intel compilers - MPICH (or a derivative such as Intel MPI) - PETSc 3.16.5 => a success would strongly point to Open MPI Cheers, Gilles On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users mailto:users@lists.open-mpi.org>> wrote: Forgot to mention that in all 3 situations, mpirun is called as follows (35 nodes, 4 MPI ranks per node): mpirun -x LD_LIBRARY_PATH=:::... -hostfile /tmp/hostfile.txt -np 140 -npernode 4 --mca btl_tcp_if_include eth0 So I have a question 3) Should I add some extra option in the mpirun command line in order to make situation 2 successful? Thanks, Ernesto. Schlumberger-Private Schlumberger-Private Schlumberger-Private From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ernesto Prudencio via users Sent: Monday, March 14, 2022 12:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ernesto Prudencio mailto:epruden...@slb.com>> Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15 Thank you for the quick answer, George. I wanted to investigate the problem further before replying. Below I show 3 situations of my C++ (and Fortran) application, which runs on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 compiled with INTEL. At the end, I have 2 questions. Note: all codes are compiled in a certain set of nodes, and the execution happens at _another_ set of nodes. +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Situation 1) It has been successful for months now: a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. The configuration options for OpenMPI are: '--with-flux-pmi=no'
Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15
Ernesto, you can mpirun ldd and double check it uses the library you expect. you might want to try adapting your trick to use Open MPI 4.1.2 with your binary built with Open MPI 4.0.3 and see how it goes. i'd try disabling coll/tuned first though. Keep in mind PETSc might call MPI_Allreduce under the hood with matching but different signatures. Cheers, Gilles On Mon, Mar 14, 2022 at 4:09 PM Ernesto Prudencio via users < users@lists.open-mpi.org> wrote: > Thanks, Gilles. > > > > In the case of the application I am working on, all ranks call MPI with > the same signature / types of variables. > > > > I do not think there is a code error anywhere. I think this is “just” a > configuration error from my part. > > > > Regarding the idea of changing just one item at a time: that would be the > next step, but first I would like to check if my suspicion that the > presence of both “/opt/openmpi_4.0.3” and > “/appl-third-parties/openmpi-4.1.2” at run time could be an issue: > >- It is an issue on situation 2, when I explicitly point the runtime >mpi to be 4.1.2 (also used in compilation) >- It is not an issue on situation 3, when I explicitly point the >runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the >application and openmpi 4.1.2 with GNU, and I link the application with >openmpi 4.1.2) > > > > Best, > > > > Ernesto. > > > > *From:* Gilles Gouaillardet > *Sent:* Monday, March 14, 2022 1:37 AM > *To:* Open MPI Users > *Cc:* Ernesto Prudencio > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Ernesto, > > > > the coll/tuned module (that should handle collective subroutines by > default) has a known issue when matching but non identical signatures are > used: > > for example, one rank uses one vector of n bytes, and an other rank uses n > bytes. > > Is there a chance your application might use this pattern? > > > > You can give try disabling this component with > > mpirun --mca coll ^tuned ... > > > > > > I noted between the successful a) case and the unsuccessful b) case, you > changed 3 parameters: > > - compiler vendor > > - Open MPI version > > - PETSc 3.10.4 > > so at this stage, it is not obvious which should be blamed for the failure. > > > > > > In order to get a better picture, I would first try > > - Intel compilers > > - Open MPI 4.1.2 > > - PETSc 3.10.4 > > > > => a failure would suggest a regression in Open MPI > > > > And then > > - Intel compilers > > - Open MPI 4.0.3 > > - PETSc 3.16.5 > > > > => a failure would either suggest a regression in PETSc, or PETSc doing > something different but legit that evidences a bug in Open MPI. > > > > If you have time, you can also try > > - Intel compilers > > - MPICH (or a derivative such as Intel MPI) > > - PETSc 3.16.5 > > > > => a success would strongly point to Open MPI > > > > > > Cheers, > > > > Gilles > > > > On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users < > users@lists.open-mpi.org> wrote: > > Forgot to mention that in all 3 situations, mpirun is called as follows > (35 nodes, 4 MPI ranks per node): > > > > mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt > -np 140 -npernode 4 --mca btl_tcp_if_include eth0 > > > > > So I have a question 3) Should I add some extra option in the mpirun > command line in order to make situation 2 successful? > > > > Thanks, > > > > Ernesto. > > > > > > Schlumberger-Private > > > > Schlumberger-Private > > *From:* users *On Behalf Of *Ernesto > Prudencio via users > *Sent:* Monday, March 14, 2022 12:39 AM > *To:* Open MPI Users > *Cc:* Ernesto Prudencio > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Thank you for the quick answer, George. I wanted to investigate the > problem further before replying. > > > > Below I show 3 situations of my C++ (and Fortran) application, which runs > on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 > compiled with INTEL. > > > > At the end, I have 2 questions. > > > > Note: all codes are compiled in a certain set of nodes, and the execution > happens at _*another*_ set of nodes. > > > > +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - > > > > Situation 1) It has been successful for months now: > > > > a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. > The configuration options for OpenMPI are: > > > > '--with-flux-pmi=no' '--enable-orterun-prefix-by-default' > '--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1' > 'FC=ifort' 'CC=gcc' > > > > b) At run time, each MPI rank prints this info: > > > > PATH = > /opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin > > > >
Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15
Thanks, Gilles. In the case of the application I am working on, all ranks call MPI with the same signature / types of variables. I do not think there is a code error anywhere. I think this is "just" a configuration error from my part. Regarding the idea of changing just one item at a time: that would be the next step, but first I would like to check if my suspicion that the presence of both "/opt/openmpi_4.0.3" and "/appl-third-parties/openmpi-4.1.2" at run time could be an issue: * It is an issue on situation 2, when I explicitly point the runtime mpi to be 4.1.2 (also used in compilation) * It is not an issue on situation 3, when I explicitly point the runtime mpi to be 4.0.3 compiled with INTEL (even though I compiled the application and openmpi 4.1.2 with GNU, and I link the application with openmpi 4.1.2) Best, Ernesto. From: Gilles Gouaillardet Sent: Monday, March 14, 2022 1:37 AM To: Open MPI Users Cc: Ernesto Prudencio Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15 Ernesto, the coll/tuned module (that should handle collective subroutines by default) has a known issue when matching but non identical signatures are used: for example, one rank uses one vector of n bytes, and an other rank uses n bytes. Is there a chance your application might use this pattern? You can give try disabling this component with mpirun --mca coll ^tuned ... I noted between the successful a) case and the unsuccessful b) case, you changed 3 parameters: - compiler vendor - Open MPI version - PETSc 3.10.4 so at this stage, it is not obvious which should be blamed for the failure. In order to get a better picture, I would first try - Intel compilers - Open MPI 4.1.2 - PETSc 3.10.4 => a failure would suggest a regression in Open MPI And then - Intel compilers - Open MPI 4.0.3 - PETSc 3.16.5 => a failure would either suggest a regression in PETSc, or PETSc doing something different but legit that evidences a bug in Open MPI. If you have time, you can also try - Intel compilers - MPICH (or a derivative such as Intel MPI) - PETSc 3.16.5 => a success would strongly point to Open MPI Cheers, Gilles On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users mailto:users@lists.open-mpi.org>> wrote: Forgot to mention that in all 3 situations, mpirun is called as follows (35 nodes, 4 MPI ranks per node): mpirun -x LD_LIBRARY_PATH=:::... -hostfile /tmp/hostfile.txt -np 140 -npernode 4 --mca btl_tcp_if_include eth0 So I have a question 3) Should I add some extra option in the mpirun command line in order to make situation 2 successful? Thanks, Ernesto. Schlumberger-Private Schlumberger-Private From: users mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ernesto Prudencio via users Sent: Monday, March 14, 2022 12:39 AM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Ernesto Prudencio mailto:epruden...@slb.com>> Subject: Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15 Thank you for the quick answer, George. I wanted to investigate the problem further before replying. Below I show 3 situations of my C++ (and Fortran) application, which runs on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 compiled with INTEL. At the end, I have 2 questions. Note: all codes are compiled in a certain set of nodes, and the execution happens at _another_ set of nodes. +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Situation 1) It has been successful for months now: a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. The configuration options for OpenMPI are: '--with-flux-pmi=no' '--enable-orterun-prefix-by-default' '--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1' 'FC=ifort' 'CC=gcc' b) At run time, each MPI rank prints this info: PATH = /opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH = /opt/openmpi_4.0.3/lib::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/opt/petsc/lib:/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/opt/openmpi_4.0.3/lib:/lib64:/lib:/usr/lib64:/usr/lib MPI version (compile time) = 4.0.3 MPI_Get_library_version()= Open MPI v4.0.3, package: Open MPI root@ Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020 PETSc version (compile time) = 3.10.4 c) A test of 20 minutes with 14 nodes, 4 MPI ranks per node, runs ok. d) A test of 2 hours with 35 nodes, 4 MPI ranks per node, runs ok. +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning value 15
Ernesto, the coll/tuned module (that should handle collective subroutines by default) has a known issue when matching but non identical signatures are used: for example, one rank uses one vector of n bytes, and an other rank uses n bytes. Is there a chance your application might use this pattern? You can give try disabling this component with mpirun --mca coll ^tuned ... I noted between the successful a) case and the unsuccessful b) case, you changed 3 parameters: - compiler vendor - Open MPI version - PETSc 3.10.4 so at this stage, it is not obvious which should be blamed for the failure. In order to get a better picture, I would first try - Intel compilers - Open MPI 4.1.2 - PETSc 3.10.4 => a failure would suggest a regression in Open MPI And then - Intel compilers - Open MPI 4.0.3 - PETSc 3.16.5 => a failure would either suggest a regression in PETSc, or PETSc doing something different but legit that evidences a bug in Open MPI. If you have time, you can also try - Intel compilers - MPICH (or a derivative such as Intel MPI) - PETSc 3.16.5 => a success would strongly point to Open MPI Cheers, Gilles On Mon, Mar 14, 2022 at 2:56 PM Ernesto Prudencio via users < users@lists.open-mpi.org> wrote: > Forgot to mention that in all 3 situations, mpirun is called as follows > (35 nodes, 4 MPI ranks per node): > > > > mpirun -x LD_LIBRARY_PATH=:::… -hostfile /tmp/hostfile.txt > -np 140 -npernode 4 --mca btl_tcp_if_include eth0 > > > > > So I have a question 3) Should I add some extra option in the mpirun > command line in order to make situation 2 successful? > > > > Thanks, > > > > Ernesto. > > > > > > Schlumberger-Private > > *From:* users *On Behalf Of *Ernesto > Prudencio via users > *Sent:* Monday, March 14, 2022 12:39 AM > *To:* Open MPI Users > *Cc:* Ernesto Prudencio > *Subject:* Re: [OMPI users] [Ext] Re: Call to MPI_Allreduce() returning > value 15 > > > > Thank you for the quick answer, George. I wanted to investigate the > problem further before replying. > > > > Below I show 3 situations of my C++ (and Fortran) application, which runs > on top of PETSc, OpenMPI, and MKL. All 3 situations use MKL 2019.0.5 > compiled with INTEL. > > > > At the end, I have 2 questions. > > > > Note: all codes are compiled in a certain set of nodes, and the execution > happens at _*another*_ set of nodes. > > > > +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - > > > > Situation 1) It has been successful for months now: > > > > a) Use INTEL compilers for OpenMPI 4.0.3, PETSc 3.10.4 , and application. > The configuration options for OpenMPI are: > > > > '--with-flux-pmi=no' '--enable-orterun-prefix-by-default' > '--prefix=/mnt/disks/intel-2018-3-222-blade-runtime-env-2018-1-07-08-2018-132838/openmpi_4.0.3_intel2019.5_gcc7.3.1' > 'FC=ifort' 'CC=gcc' > > > > b) At run time, each MPI rank prints this info: > > > > PATH = > /opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/openmpi_4.0.3/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin > > > > LD_LIBRARY_PATH = > /opt/openmpi_4.0.3/lib::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/opt/petsc/lib:/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/opt/openmpi_4.0.3/lib:/lib64:/lib:/usr/lib64:/usr/lib > > > > MPI version (compile time) = 4.0.3 > > MPI_Get_library_version()= Open MPI v4.0.3, package: Open MPI > root@ > Distribution, ident: 4.0.3, repo rev: v4.0.3, Mar 03, 2020 > > PETSc version (compile time) = 3.10.4 > > > > c) A test of 20 minutes with 14 nodes, 4 MPI ranks per node, runs ok. > > > > d) A test of 2 hours with 35 nodes, 4 MPI ranks per node, runs ok. > > > > +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > - - - - - - - - - > > > > Situation 2) This situation is the one failing during execution. > > > > a) Use GNU compilers for OpenMPI 4.1.2, PETSc 3.16.5 , and application. > The configuration options for OpenMPI are: > > > > '--with-flux-pmi=no' '--prefix=/appl-third-parties/openmpi-4.1.2' > '--enable-orterun-prefix-by-default' > > > > b) At run time, each MPI rank prints this info: > > > > PATH = /appl-third-parties/openmpi-4.1.2/bin > :/appl-third-parties/openmpi-4.1.2/bin:/appl-third-parties/openmpi-4.1.2/bin:/opt/rh/devtoolset-7/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin > > > > LD_LIBRARY_PATH = /appl-third-parties/openmpi-4.1.2/lib > ::/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib/gcc/x86_64-redhat-linux/7:/appl-third-parties/petsc-3.16.5/lib > > > :/opt/2019.5/compilers_and_libraries/linux/mkl/lib/intel64:/appl-third-parties/openmpi-4.1.2/lib:/lib64:/lib:/usr/lib64:/usr/lib > > > > MPI version (compile time)= 4.1.2 > >