Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
FYI, Just noticed this post from the hdf group: https://forum.hdfgroup.org/t/hdf5-and-openmpi/5437 /Peter K pgpmcS_mBlpzB.pgp Description: OpenPGP digital signature ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
> On Feb 20, 2019, at 7:14 PM, Gilles Gouaillardet wrote: > > Ryan, > > That being said, the "Alarm clock" message looks a bit suspicious. > > Does it always occur at 20+ minutes elapsed ? > > Is there some mechanism that automatically kills a job if it does not write > anything to stdout for some time ? > > A quick way to rule that out is to > > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800 > > and see if that completes or get killed with the same error message. FWIW, the “sleep” completes just fine: [novosirj@amarel-test2 testpar]$ sacct -j 84173276 -M perceval -o jobid,jobname,start,end,node,state JobIDJobName Start EndNodeList State -- --- --- --- -- 84173276 sleep 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 COMPLETED 84173276.ex+ extern 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 COMPLETED 84173276.0sleep 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 COMPLETED -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signed with OpenPGP ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Related to this or not, I also get a hang on MVAPICH2 2.3 compiled with GCC 8.2, but on t_filters_parallel, not t_mpi. With that combo, though, I get a segfault, or at least a message about one. It’s only “Alarm clock” on the GCC 4.8 with OpenMPI 3.1.3 combo. It also happens at the ~20 minute mark, FWIW. Testing t_filters_parallel t_filters_parallel Test Log srun: job 84117363 queued and waiting for resources srun: job 84117363 has been allocated resources [slepner063.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Segmentation fault (signal 11) srun: error: slepner063: task 0: Segmentation fault srun: error: slepner063: tasks 1-3: Alarm clock 0.01user 0.01system 20:01.44elapsed 0%CPU (0avgtext+0avgdata 5144maxresident)k 0inputs+0outputs (0major+1524minor)pagefaults 0swaps make[4]: *** [t_filters_parallel.chkexe_] Error 1 make[4]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[2]: *** [test] Error 2 make[2]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make[1]: *** [check-am] Error 2 make[1]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar' make: *** [check-recursive] Error 1 > On Feb 21, 2019, at 3:03 PM, Gabriel, Edgar wrote: > > Yes, I was talking about the same thing, although for me it was not t_mpi, > but t_shapesame that was hanging. It might be an indication of the same issue > however. > >> -Original Message- >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan >> Novosielski >> Sent: Thursday, February 21, 2019 1:59 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >> 3.1.3 >> >> >>> On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar >> wrote: >>> >>>> -Original Message- >>>>> Does it always occur at 20+ minutes elapsed ? >>>> >>>> Aha! Yes, you are right: every time it fails, it’s at the 20 minute >>>> and a couple of seconds mark. For comparison, every time it runs, it >>>> runs for 2-3 seconds total. So it seems like what might actually be >>>> happening here is a hang, and not a failure of the test per se. >>>> >>> >>> I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 >> (although this was OpenSuSE, not Redhat), and it looked to me like one of >> tests were hanging, but I didn't have time to investigate it further. >> >> Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The >> OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t >> believe it ever launches any jobs or anything like that. >> >> -- >> >> || \\UTGERS, >> |---*O*--- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' > > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users signature.asc Description: Message signed with OpenPGP ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Yes, I was talking about the same thing, although for me it was not t_mpi, but t_shapesame that was hanging. It might be an indication of the same issue however. > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Thursday, February 21, 2019 1:59 PM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > > > On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar > wrote: > > > >> -Original Message- > >>> Does it always occur at 20+ minutes elapsed ? > >> > >> Aha! Yes, you are right: every time it fails, it’s at the 20 minute > >> and a couple of seconds mark. For comparison, every time it runs, it > >> runs for 2-3 seconds total. So it seems like what might actually be > >> happening here is a hang, and not a failure of the test per se. > >> > > > > I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 > (although this was OpenSuSE, not Redhat), and it looked to me like one of > tests were hanging, but I didn't have time to investigate it further. > > Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The > OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t > believe it ever launches any jobs or anything like that. > > -- > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
> On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar wrote: > >> -Original Message- >>> Does it always occur at 20+ minutes elapsed ? >> >> Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a >> couple >> of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds >> total. So it seems like what might actually be happening here is a hang, and >> not a failure of the test per se. >> > > I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 > (although this was OpenSuSE, not Redhat), and it looked to me like one of > tests were hanging, but I didn't have time to investigate it further. Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t believe it ever launches any jobs or anything like that. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signed with OpenPGP ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
icially fails ( t_pflush1) actually reports that it > passed, but then throws message that indicates that MPI_Abort has been > called, for both ompio and romio. I will try to investigate this test to see > what > is going on. > >>>> > >>>> That being said, your report shows an issue in t_mpi, which passes > without problems for me. This is however not GPFS, this was an XFS local file > system. Running the tests on GPFS are on my todo list as well. > >>>> > >>>> Thanks > >>>> Edgar > >>>> > >>>> > >>>> > >>>>> -Original Message----- > >>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > >>>>> Gabriel, Edgar > >>>>> Sent: Sunday, February 17, 2019 10:34 AM > >>>>> To: Open MPI Users > >>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>>>> w/OpenMPI > >>>>> 3.1.3 > >>>>> > >>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I > >>>>> have access to a GPFS file system since recently, and will report > >>>>> back on that, but it will take a few days. > >>>>> > >>>>> Thanks > >>>>> Edgar > >>>>> > >>>>>> -Original Message- > >>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf > >>>>>> Of Ryan Novosielski > >>>>>> Sent: Sunday, February 17, 2019 2:37 AM > >>>>>> To: users@lists.open-mpi.org > >>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>>>>> w/OpenMPI > >>>>>> 3.1.3 > >>>>>> > >>>>>> -BEGIN PGP SIGNED MESSAGE- > >>>>>> Hash: SHA1 > >>>>>> > >>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference. > >>>>>> > >>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > >>>>>>> Ryan, > >>>>>>> > >>>>>>> What filesystem are you running on ? > >>>>>>> > >>>>>>> Open MPI defaults to the ompio component, except on Lustre > >>>>>>> filesystem where ROMIO is used. (if the issue is related to > >>>>>>> ROMIO, that can explain why you did not see any difference, in > >>>>>>> that case, you might want to try an other filesystem (local > >>>>>>> filesystem or NFS for example)\ > >>>>>>> > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > >>>>>>> wrote: > >>>>>>>> I verified that it makes it through to a bash prompt, but I’m a > >>>>>>>> little less confident that something make test does doesn’t clear it. > >>>>>>>> Any recommendation for a way to verify? > >>>>>>>> > >>>>>>>> In any case, no change, unfortunately. > >>>>>>>> > >>>>>>>> Sent from my iPhone > >>>>>>>> > >>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> What file system are you running on? > >>>>>>>>> > >>>>>>>>> I will look into this, but it might be later next week. I just > >>>>>>>>> wanted to emphasize that we are regularly running the parallel > >>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding > >>>>>>>>> items that do not work (and are supposed to work). That being > >>>>>>>>> said, I run the tests manually, and not the 'make test' > >>>>>>>>> commands. Will have to check which tests are being run by that. > >>>>>>>>> > >>>>>>>>> Edgar > >>>>>>>>> >
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
srun...) From the 13 tests in the testpar directory, 12 pass correctly >>>> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, >>>> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, >>>> t_shapesame). >>>> >>>> The one tests that officially fails ( t_pflush1) actually reports that it >>>> passed, but then throws message that indicates that MPI_Abort has been >>>> called, for both ompio and romio. I will try to investigate this test to >>>> see what is going on. >>>> >>>> That being said, your report shows an issue in t_mpi, which passes without >>>> problems for me. This is however not GPFS, this was an XFS local file >>>> system. Running the tests on GPFS are on my todo list as well. >>>> >>>> Thanks >>>> Edgar >>>> >>>> >>>> >>>>> -----Original Message- >>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>>> Gabriel, Edgar >>>>> Sent: Sunday, February 17, 2019 10:34 AM >>>>> To: Open MPI Users >>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>>> 3.1.3 >>>>> >>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have >>>>> access to a >>>>> GPFS file system since recently, and will report back on that, but it >>>>> will take a >>>>> few days. >>>>> >>>>> Thanks >>>>> Edgar >>>>> >>>>>> -Original Message- >>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>>>> Ryan Novosielski >>>>>> Sent: Sunday, February 17, 2019 2:37 AM >>>>>> To: users@lists.open-mpi.org >>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>>>> 3.1.3 >>>>>> >>>>>> -BEGIN PGP SIGNED MESSAGE- >>>>>> Hash: SHA1 >>>>>> >>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference. >>>>>> >>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: >>>>>>> Ryan, >>>>>>> >>>>>>> What filesystem are you running on ? >>>>>>> >>>>>>> Open MPI defaults to the ompio component, except on Lustre >>>>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO, >>>>>>> that can explain why you did not see any difference, in that case, >>>>>>> you might want to try an other filesystem (local filesystem or NFS >>>>>>> for example)\ >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski >>>>>>> wrote: >>>>>>>> I verified that it makes it through to a bash prompt, but I’m a >>>>>>>> little less confident that something make test does doesn’t clear it. >>>>>>>> Any recommendation for a way to verify? >>>>>>>> >>>>>>>> In any case, no change, unfortunately. >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> What file system are you running on? >>>>>>>>> >>>>>>>>> I will look into this, but it might be later next week. I just >>>>>>>>> wanted to emphasize that we are regularly running the parallel >>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding items >>>>>>>>> that do not work (and are supposed to work). That being said, I >>>>>>>>> run the tests manually, and not the 'make test' >>>>>>>>> commands. Will have to check which tests are being run by that. >>>>>>>>> >>>>>>>>> Edgar >>>>>>>>> >>&
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Ryan, as Edgar explained, that could be a compiler issue (fwiw, I am unable to reproduce the bug) You can build Open MPI again and pass --disable-builtin-atomics to the configure command line. That being said, the "Alarm clock" message looks a bit suspicious. Does it always occur at 20+ minutes elapsed ? Is there some mechanism that automatically kills a job if it does not write anything to stdout for some time ? A quick way to rule that out is to srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800 and see if that completes or get killed with the same error message. You can also run use mpirun instead of srun, and even run mpirun outside of slurm (if your cluster policy allows it, you can for example use mpirun and run on the frontend node) Cheers, Gilles On 2/21/2019 3:01 AM, Ryan Novosielski wrote: Does it make any sense that it seems to work fine when OpenMPI and HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build, I did try an XFS filesystem and it didn’t help. GPFS works fine for either of the 7.4 and 8.2 builds. Just as a reminder, since it was reasonably far back in the thread, what I’m doing is running the “make check” tests in HDF5 1.10.4, in part because users use it, but also because it seems to have a good test suite and I can therefore verify the compiler and MPI stack installs. I get very little information, apart from it not working and getting that “Alarm clock” message. I originally suspected I’d somehow built some component of this with a host-specific optimization that wasn’t working on some compute nodes. But I controlled for that and it didn’t seem to make any difference. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Feb 18, 2019, at 1:34 PM, Ryan Novosielski wrote: It didn’t work any better with XFS, as it happens. Must be something else. I’m going to test some more and see if I can narrow it down any, as it seems to me that it did work with a different compiler. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar wrote: While I was working on something else, I let the tests run with Open MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), and here is what I found for the HDF5 1.10.4 tests on my local desktop: In the testpar directory, there is in fact one test that fails for both ompio and romio321 in exactly the same manner. I used 6 processes as you did (although I used mpirun directly instead of srun...) From the 13 tests in the testpar directory, 12 pass correctly (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). The one tests that officially fails ( t_pflush1) actually reports that it passed, but then throws message that indicates that MPI_Abort has been called, for both ompio and romio. I will try to investigate this test to see what is going on. That being said, your report shows an issue in t_mpi, which passes without problems for me. This is however not GPFS, this was an XFS local file system. Running the tests on GPFS are on my todo list as well. Thanks Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gabriel, Edgar Sent: Sunday, February 17, 2019 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3 I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to a GPFS file system since recently, and will report back on that, but it will take a few days. Thanks Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan Novosielski Sent: Sunday, February 17, 2019 2:37 AM To: users@lists.open-mpi.org Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 This is on GPFS. I'll try it on XFS to see if it makes any difference. On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: Ryan, What filesystem are you running on ? Open MPI defaults to the ompio component, except on Lustre filesystem where ROMIO is used. (if the issue is related to ROMIO, that can explain why you did not see any difference, in that case, you might want to try an other files
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
This is what I did for my build — not much going on there: ../openmpi-3.1.3/configure --prefix=/opt/sw/packages/gcc-4_8/openmpi/3.1.3 --with-pmi && \ make -j32 We have a mixture of types of Infiniband, using the RHEL-supplied Infiniband packages. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Feb 20, 2019, at 1:46 PM, Gabriel, Edgar wrote: > > Well, the way you describe it, it sounds to me like maybe an atomic issue > with this compiler version. What was your configure line of Open MPI, and > what network interconnect are you using? > > An easy way to test this theory would be to force OpenMPI to use the tcp > interfaces (everything will be slow however). You can do that by creating in > your home directory a directory called .openmpi, and add there a file called > mca-params.conf > > The file should look something like this: > > btl = tcp,self > > > > Thanks > Edgar > > > >> -Original Message- >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan >> Novosielski >> Sent: Wednesday, February 20, 2019 12:02 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >> 3.1.3 >> >> Does it make any sense that it seems to work fine when OpenMPI and HDF5 >> are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL- >> supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build, >> I did try an XFS filesystem and it didn’t help. GPFS works fine for either >> of the >> 7.4 and 8.2 builds. >> >> Just as a reminder, since it was reasonably far back in the thread, what I’m >> doing is running the “make check” tests in HDF5 1.10.4, in part because users >> use it, but also because it seems to have a good test suite and I can >> therefore >> verify the compiler and MPI stack installs. I get very little information, >> apart >> from it not working and getting that “Alarm clock” message. >> >> I originally suspected I’d somehow built some component of this with a host- >> specific optimization that wasn’t working on some compute nodes. But I >> controlled for that and it didn’t seem to make any difference. >> >> -- >> >> || \\UTGERS, >> |---*O*--- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' >> >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski >> wrote: >>> >>> It didn’t work any better with XFS, as it happens. Must be something else. >> I’m going to test some more and see if I can narrow it down any, as it seems >> to me that it did work with a different compiler. >>> >>> -- >>> >>> || \\UTGERS, >>> |---*O*--- >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS >> Campus >>> || \\of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >>>`' >>> >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar >> wrote: >>>> >>>> While I was working on something else, I let the tests run with Open MPI >> master (which is for parallel I/O equivalent to the upcoming v4.0.1 >> release), >> and here is what I found for the HDF5 1.10.4 tests on my local desktop: >>>> >>>> In the testpar directory, there is in fact one test that fails for both >>>> ompio >> and romio321 in exactly the same manner. >>>> I used 6 processes as you did (although I used mpirun directly instead of >> srun...) From the 13 tests in the testpar directory, 12 pass correctly >> (t_bigio, >> t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, >> t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). >>>> >>>> The one tests that officially fails ( t_pflush1) actually reports that it >>>> passed, >> but then throws message that indicates that MPI_Abort has been
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Well, the way you describe it, it sounds to me like maybe an atomic issue with this compiler version. What was your configure line of Open MPI, and what network interconnect are you using? An easy way to test this theory would be to force OpenMPI to use the tcp interfaces (everything will be slow however). You can do that by creating in your home directory a directory called .openmpi, and add there a file called mca-params.conf The file should look something like this: btl = tcp,self Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Wednesday, February 20, 2019 12:02 PM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > Does it make any sense that it seems to work fine when OpenMPI and HDF5 > are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL- > supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build, > I did try an XFS filesystem and it didn’t help. GPFS works fine for either of > the > 7.4 and 8.2 builds. > > Just as a reminder, since it was reasonably far back in the thread, what I’m > doing is running the “make check” tests in HDF5 1.10.4, in part because users > use it, but also because it seems to have a good test suite and I can > therefore > verify the compiler and MPI stack installs. I get very little information, > apart > from it not working and getting that “Alarm clock” message. > > I originally suspected I’d somehow built some component of this with a host- > specific optimization that wasn’t working on some compute nodes. But I > controlled for that and it didn’t seem to make any difference. > > -- > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > > > On Feb 18, 2019, at 1:34 PM, Ryan Novosielski > wrote: > > > > It didn’t work any better with XFS, as it happens. Must be something else. > I’m going to test some more and see if I can narrow it down any, as it seems > to me that it did work with a different compiler. > > > > -- > > > > || \\UTGERS, > > |---*O*--- > > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > Campus > > || \\of NJ | Office of Advanced Research Computing - MSB C630, > Newark > > `' > > > >> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar > wrote: > >> > >> While I was working on something else, I let the tests run with Open MPI > master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), > and here is what I found for the HDF5 1.10.4 tests on my local desktop: > >> > >> In the testpar directory, there is in fact one test that fails for both > >> ompio > and romio321 in exactly the same manner. > >> I used 6 processes as you did (although I used mpirun directly instead of > srun...) From the 13 tests in the testpar directory, 12 pass correctly > (t_bigio, > t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, > t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). > >> > >> The one tests that officially fails ( t_pflush1) actually reports that it > >> passed, > but then throws message that indicates that MPI_Abort has been called, for > both ompio and romio. I will try to investigate this test to see what is going > on. > >> > >> That being said, your report shows an issue in t_mpi, which passes > without problems for me. This is however not GPFS, this was an XFS local file > system. Running the tests on GPFS are on my todo list as well. > >> > >> Thanks > >> Edgar > >> > >> > >> > >>> -Original Message- > >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > >>> Gabriel, Edgar > >>> Sent: Sunday, February 17, 2019 10:34 AM > >>> To: Open MPI Users > >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems > >>> w/OpenMPI > >>> 3.1.3 > >>> > >>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have > >>> access to a GPFS file system since recently, and wil
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Does it make any sense that it seems to work fine when OpenMPI and HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build, I did try an XFS filesystem and it didn’t help. GPFS works fine for either of the 7.4 and 8.2 builds. Just as a reminder, since it was reasonably far back in the thread, what I’m doing is running the “make check” tests in HDF5 1.10.4, in part because users use it, but also because it seems to have a good test suite and I can therefore verify the compiler and MPI stack installs. I get very little information, apart from it not working and getting that “Alarm clock” message. I originally suspected I’d somehow built some component of this with a host-specific optimization that wasn’t working on some compute nodes. But I controlled for that and it didn’t seem to make any difference. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Feb 18, 2019, at 1:34 PM, Ryan Novosielski wrote: > > It didn’t work any better with XFS, as it happens. Must be something else. > I’m going to test some more and see if I can narrow it down any, as it seems > to me that it did work with a different compiler. > > -- > > || \\UTGERS, > |---*O*--- > ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar wrote: >> >> While I was working on something else, I let the tests run with Open MPI >> master (which is for parallel I/O equivalent to the upcoming v4.0.1 >> release), and here is what I found for the HDF5 1.10.4 tests on my local >> desktop: >> >> In the testpar directory, there is in fact one test that fails for both >> ompio and romio321 in exactly the same manner. >> I used 6 processes as you did (although I used mpirun directly instead of >> srun...) From the 13 tests in the testpar directory, 12 pass correctly >> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, >> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, >> t_shapesame). >> >> The one tests that officially fails ( t_pflush1) actually reports that it >> passed, but then throws message that indicates that MPI_Abort has been >> called, for both ompio and romio. I will try to investigate this test to see >> what is going on. >> >> That being said, your report shows an issue in t_mpi, which passes without >> problems for me. This is however not GPFS, this was an XFS local file >> system. Running the tests on GPFS are on my todo list as well. >> >> Thanks >> Edgar >> >> >> >>> -----Original Message- >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>> Gabriel, Edgar >>> Sent: Sunday, February 17, 2019 10:34 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>> 3.1.3 >>> >>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access >>> to a >>> GPFS file system since recently, and will report back on that, but it will >>> take a >>> few days. >>> >>> Thanks >>> Edgar >>> >>>> -Original Message- >>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>> Ryan Novosielski >>>> Sent: Sunday, February 17, 2019 2:37 AM >>>> To: users@lists.open-mpi.org >>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>> 3.1.3 >>>> >>>> -BEGIN PGP SIGNED MESSAGE- >>>> Hash: SHA1 >>>> >>>> This is on GPFS. I'll try it on XFS to see if it makes any difference. >>>> >>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: >>>>> Ryan, >>>>> >>>>> What filesystem are you running on ? >>>>> >>>>> Open MPI defaults to the ompio component, except on Lustre >>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO, >>&
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Edgar, t_pflush1 does not call MPI_Finalize(), that is why there is an error message regardless ompio or romio is used. I naively tried to call MPI_Finalize(), but it causes the program to hang. Cheers, Gilles On 2/19/2019 2:23 AM, Gabriel, Edgar wrote: While I was working on something else, I let the tests run with Open MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), and here is what I found for the HDF5 1.10.4 tests on my local desktop: In the testpar directory, there is in fact one test that fails for both ompio and romio321 in exactly the same manner. I used 6 processes as you did (although I used mpirun directly instead of srun...) From the 13 tests in the testpar directory, 12 pass correctly (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). The one tests that officially fails ( t_pflush1) actually reports that it passed, but then throws message that indicates that MPI_Abort has been called, for both ompio and romio. I will try to investigate this test to see what is going on. That being said, your report shows an issue in t_mpi, which passes without problems for me. This is however not GPFS, this was an XFS local file system. Running the tests on GPFS are on my todo list as well. Thanks Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gabriel, Edgar Sent: Sunday, February 17, 2019 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3 I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to a GPFS file system since recently, and will report back on that, but it will take a few days. Thanks Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan Novosielski Sent: Sunday, February 17, 2019 2:37 AM To: users@lists.open-mpi.org Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 This is on GPFS. I'll try it on XFS to see if it makes any difference. On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: Ryan, What filesystem are you running on ? Open MPI defaults to the ompio component, except on Lustre filesystem where ROMIO is used. (if the issue is related to ROMIO, that can explain why you did not see any difference, in that case, you might want to try an other filesystem (local filesystem or NFS for example)\ Cheers, Gilles On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski wrote: I verified that it makes it through to a bash prompt, but I’m a little less confident that something make test does doesn’t clear it. Any recommendation for a way to verify? In any case, no change, unfortunately. Sent from my iPhone On Feb 16, 2019, at 08:13, Gabriel, Edgar wrote: What file system are you running on? I will look into this, but it might be later next week. I just wanted to emphasize that we are regularly running the parallel hdf5 tests with ompio, and I am not aware of any outstanding items that do not work (and are supposed to work). That being said, I run the tests manually, and not the 'make test' commands. Will have to check which tests are being run by that. Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open MPI Users Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3 Ryan, Can you export OMPI_MCA_io=^ompio and try again after you made sure this environment variable is passed by srun to the MPI tasks ? We have identified and fixed several issues specific to the (default) ompio component, so that could be a valid workaround until the next release. Cheers, Gilles Ryan Novosielski wrote: Hi there, Honestly don’t know which piece of this puzzle to look at or how to get more information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is failing at the below point; I am using a value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly configured. Thanks for any help you can provide. make[4]: Entering directory `/scratch/novosirj/install-files/hdf5-1.10.4-build- gcc-4.8-openmpi-3.1.3/testpar' Testing t_mpi t_mpi Test Log srun: job 84126610 queued and waiting for resources srun: job 84126610 has been allocated resources srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k 0inputs+0outputs (0major+1529minor)pagefaults 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving directory `/scratch/novosirj/i
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
It didn’t work any better with XFS, as it happens. Must be something else. I’m going to test some more and see if I can narrow it down any, as it seems to me that it did work with a different compiler. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar wrote: > > While I was working on something else, I let the tests run with Open MPI > master (which is for parallel I/O equivalent to the upcoming v4.0.1 > release), and here is what I found for the HDF5 1.10.4 tests on my local > desktop: > > In the testpar directory, there is in fact one test that fails for both ompio > and romio321 in exactly the same manner. > I used 6 processes as you did (although I used mpirun directly instead of > srun...) From the 13 tests in the testpar directory, 12 pass correctly > (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, > t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). > > The one tests that officially fails ( t_pflush1) actually reports that it > passed, but then throws message that indicates that MPI_Abort has been > called, for both ompio and romio. I will try to investigate this test to see > what is going on. > > That being said, your report shows an issue in t_mpi, which passes without > problems for me. This is however not GPFS, this was an XFS local file system. > Running the tests on GPFS are on my todo list as well. > > Thanks > Edgar > > > >> -Original Message- >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >> Gabriel, Edgar >> Sent: Sunday, February 17, 2019 10:34 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >> 3.1.3 >> >> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access >> to a >> GPFS file system since recently, and will report back on that, but it will >> take a >> few days. >> >> Thanks >> Edgar >> >>> -Original Message- >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>> Ryan Novosielski >>> Sent: Sunday, February 17, 2019 2:37 AM >>> To: users@lists.open-mpi.org >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>> 3.1.3 >>> >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA1 >>> >>> This is on GPFS. I'll try it on XFS to see if it makes any difference. >>> >>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: >>>> Ryan, >>>> >>>> What filesystem are you running on ? >>>> >>>> Open MPI defaults to the ompio component, except on Lustre >>>> filesystem where ROMIO is used. (if the issue is related to ROMIO, >>>> that can explain why you did not see any difference, in that case, >>>> you might want to try an other filesystem (local filesystem or NFS >>>> for example)\ >>>> >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski >>>> wrote: >>>>> >>>>> I verified that it makes it through to a bash prompt, but I’m a >>>>> little less confident that something make test does doesn’t clear it. >>>>> Any recommendation for a way to verify? >>>>> >>>>> In any case, no change, unfortunately. >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar >>>>>> >>>>>> wrote: >>>>>> >>>>>> What file system are you running on? >>>>>> >>>>>> I will look into this, but it might be later next week. I just >>>>>> wanted to emphasize that we are regularly running the parallel >>>>>> hdf5 tests with ompio, and I am not aware of any outstanding items >>>>>> that do not work (and are supposed to work). That being said, I >>>>>> run the tests manually, and not the 'make test' >>>>>> commands. Will have to check which tests are being run by that. >>>>>> >>>>>> Edgar >>>>>> >&g
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
While I was working on something else, I let the tests run with Open MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1 release), and here is what I found for the HDF5 1.10.4 tests on my local desktop: In the testpar directory, there is in fact one test that fails for both ompio and romio321 in exactly the same manner. I used 6 processes as you did (although I used mpirun directly instead of srun...) From the 13 tests in the testpar directory, 12 pass correctly (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). The one tests that officially fails ( t_pflush1) actually reports that it passed, but then throws message that indicates that MPI_Abort has been called, for both ompio and romio. I will try to investigate this test to see what is going on. That being said, your report shows an issue in t_mpi, which passes without problems for me. This is however not GPFS, this was an XFS local file system. Running the tests on GPFS are on my todo list as well. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > Gabriel, Edgar > Sent: Sunday, February 17, 2019 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > I will also run our testsuite and the HDF5 testsuite on GPFS, I have access > to a > GPFS file system since recently, and will report back on that, but it will > take a > few days. > > Thanks > Edgar > > > -Original Message- > > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of > > Ryan Novosielski > > Sent: Sunday, February 17, 2019 2:37 AM > > To: users@lists.open-mpi.org > > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > > 3.1.3 > > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > This is on GPFS. I'll try it on XFS to see if it makes any difference. > > > > On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > > > Ryan, > > > > > > What filesystem are you running on ? > > > > > > Open MPI defaults to the ompio component, except on Lustre > > > filesystem where ROMIO is used. (if the issue is related to ROMIO, > > > that can explain why you did not see any difference, in that case, > > > you might want to try an other filesystem (local filesystem or NFS > > > for example)\ > > > > > > > > > Cheers, > > > > > > Gilles > > > > > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > > > wrote: > > >> > > >> I verified that it makes it through to a bash prompt, but I’m a > > >> little less confident that something make test does doesn’t clear it. > > >> Any recommendation for a way to verify? > > >> > > >> In any case, no change, unfortunately. > > >> > > >> Sent from my iPhone > > >> > > >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > > >>> > > >>> wrote: > > >>> > > >>> What file system are you running on? > > >>> > > >>> I will look into this, but it might be later next week. I just > > >>> wanted to emphasize that we are regularly running the parallel > > >>> hdf5 tests with ompio, and I am not aware of any outstanding items > > >>> that do not work (and are supposed to work). That being said, I > > >>> run the tests manually, and not the 'make test' > > >>> commands. Will have to check which tests are being run by that. > > >>> > > >>> Edgar > > >>> > > >>>> -Original Message- From: users > > >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > > >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open > > >>>> MPI Users Subject: Re: > > >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > > >>>> 3.1.3 > > >>>> > > >>>> Ryan, > > >>>> > > >>>> Can you > > >>>> > > >>>> export OMPI_MCA_io=^ompio > > >>>> > > >>>> and try again after you made sure this environment variable is > > >>>> passed by srun to the MPI tasks ? > > >>>> > > >>>> We have identified and fixed several issues specific t
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to a GPFS file system since recently, and will report back on that, but it will take a few days. Thanks Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan > Novosielski > Sent: Sunday, February 17, 2019 2:37 AM > To: users@lists.open-mpi.org > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > This is on GPFS. I'll try it on XFS to see if it makes any difference. > > On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > > Ryan, > > > > What filesystem are you running on ? > > > > Open MPI defaults to the ompio component, except on Lustre filesystem > > where ROMIO is used. (if the issue is related to ROMIO, that can > > explain why you did not see any difference, in that case, you might > > want to try an other filesystem (local filesystem or NFS for example)\ > > > > > > Cheers, > > > > Gilles > > > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > > wrote: > >> > >> I verified that it makes it through to a bash prompt, but I’m a > >> little less confident that something make test does doesn’t clear it. > >> Any recommendation for a way to verify? > >> > >> In any case, no change, unfortunately. > >> > >> Sent from my iPhone > >> > >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar > >>> wrote: > >>> > >>> What file system are you running on? > >>> > >>> I will look into this, but it might be later next week. I just > >>> wanted to emphasize that we are regularly running the parallel > >>> hdf5 tests with ompio, and I am not aware of any outstanding items > >>> that do not work (and are supposed to work). That being said, I run > >>> the tests manually, and not the 'make test' > >>> commands. Will have to check which tests are being run by that. > >>> > >>> Edgar > >>> > >>>> -Original Message- From: users > >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open MPI > >>>> Users Subject: Re: > >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > >>>> 3.1.3 > >>>> > >>>> Ryan, > >>>> > >>>> Can you > >>>> > >>>> export OMPI_MCA_io=^ompio > >>>> > >>>> and try again after you made sure this environment variable is > >>>> passed by srun to the MPI tasks ? > >>>> > >>>> We have identified and fixed several issues specific to the > >>>> (default) ompio component, so that could be a valid workaround > >>>> until the next release. > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> Ryan Novosielski wrote: > >>>>> Hi there, > >>>>> > >>>>> Honestly don’t know which piece of this puzzle to look at or how > >>>>> to get more > >>>> information for troubleshooting. I successfully built HDF5 > >>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the > >>>> “make check” in HDF5 is failing at the below point; I am using a > >>>> value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t > >>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly > >>>> configured. > >>>>> > >>>>> Thanks for any help you can provide. > >>>>> > >>>>> make[4]: Entering directory > >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >>>> gcc-4.8-openmpi-3.1.3/testpar' > >>>>> Testing t_mpi > >>>>> t_mpi Test Log > >>>>> srun: job 84126610 queued and > waiting > >>>>> for resources srun: job 84126610 has been allocated resources > >>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user > >>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata > >>>>> 5152maxresident)k 0inputs+0outputs (0major+1529minor)pagefaults > >>>>> 0swap
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 This is on GPFS. I'll try it on XFS to see if it makes any difference. On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: > Ryan, > > What filesystem are you running on ? > > Open MPI defaults to the ompio component, except on Lustre > filesystem where ROMIO is used. (if the issue is related to ROMIO, > that can explain why you did not see any difference, in that case, > you might want to try an other filesystem (local filesystem or NFS > for example)\ > > > Cheers, > > Gilles > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski > wrote: >> >> I verified that it makes it through to a bash prompt, but I’m a >> little less confident that something make test does doesn’t clear >> it. Any recommendation for a way to verify? >> >> In any case, no change, unfortunately. >> >> Sent from my iPhone >> >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar >>> wrote: >>> >>> What file system are you running on? >>> >>> I will look into this, but it might be later next week. I just >>> wanted to emphasize that we are regularly running the parallel >>> hdf5 tests with ompio, and I am not aware of any outstanding >>> items that do not work (and are supposed to work). That being >>> said, I run the tests manually, and not the 'make test' >>> commands. Will have to check which tests are being run by >>> that. >>> >>> Edgar >>> >>>> -----Original Message- From: users >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>> Gilles Gouaillardet Sent: Saturday, February 16, 2019 1:49 >>>> AM To: Open MPI Users Subject: Re: >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>> 3.1.3 >>>> >>>> Ryan, >>>> >>>> Can you >>>> >>>> export OMPI_MCA_io=^ompio >>>> >>>> and try again after you made sure this environment variable >>>> is passed by srun to the MPI tasks ? >>>> >>>> We have identified and fixed several issues specific to the >>>> (default) ompio component, so that could be a valid >>>> workaround until the next release. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> Ryan Novosielski wrote: >>>>> Hi there, >>>>> >>>>> Honestly don’t know which piece of this puzzle to look at >>>>> or how to get more >>>> information for troubleshooting. I successfully built HDF5 >>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running >>>> the “make check” in HDF5 is failing at the below point; I am >>>> using a value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t >>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly >>>> configured. >>>>> >>>>> Thanks for any help you can provide. >>>>> >>>>> make[4]: Entering directory >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>> Testing t_mpi >>>>> t_mpi Test Log >>>>> srun: job 84126610 queued and >>>>> waiting for resources srun: job 84126610 has been allocated >>>>> resources srun: error: slepner023: tasks 0-5: Alarm clock >>>>> 0.01user 0.00system 20:03.95elapsed 0%CPU >>>>> (0avgtext+0avgdata 5152maxresident)k 0inputs+0outputs >>>>> (0major+1529minor)pagefaults 0swaps make[4]: *** >>>>> [t_mpi.chkexe_] Error 1 make[4]: Leaving directory >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving >>>>> directory >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>> make: *** [check-recursive] Error 1 >>>>> >>>
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Ryan, What filesystem are you running on ? Open MPI defaults to the ompio component, except on Lustre filesystem where ROMIO is used. (if the issue is related to ROMIO, that can explain why you did not see any difference, in that case, you might want to try an other filesystem (local filesystem or NFS for example)\ Cheers, Gilles On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski wrote: > > I verified that it makes it through to a bash prompt, but I’m a little less > confident that something make test does doesn’t clear it. Any recommendation > for a way to verify? > > In any case, no change, unfortunately. > > Sent from my iPhone > > > On Feb 16, 2019, at 08:13, Gabriel, Edgar wrote: > > > > What file system are you running on? > > > > I will look into this, but it might be later next week. I just wanted to > > emphasize that we are regularly running the parallel hdf5 tests with ompio, > > and I am not aware of any outstanding items that do not work (and are > > supposed to work). That being said, I run the tests manually, and not the > > 'make test' commands. Will have to check which tests are being run by that. > > > > Edgar > > > >> -Original Message- > >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > >> Gouaillardet > >> Sent: Saturday, February 16, 2019 1:49 AM > >> To: Open MPI Users > >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > >> 3.1.3 > >> > >> Ryan, > >> > >> Can you > >> > >> export OMPI_MCA_io=^ompio > >> > >> and try again after you made sure this environment variable is passed by > >> srun > >> to the MPI tasks ? > >> > >> We have identified and fixed several issues specific to the (default) ompio > >> component, so that could be a valid workaround until the next release. > >> > >> Cheers, > >> > >> Gilles > >> > >> Ryan Novosielski wrote: > >>> Hi there, > >>> > >>> Honestly don’t know which piece of this puzzle to look at or how to get > >>> more > >> information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL > >> system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is > >> failing at the below point; I am using a value of RUNPARALLEL='srun -- > >> mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise > >> properly configured. > >>> > >>> Thanks for any help you can provide. > >>> > >>> make[4]: Entering directory > >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >> gcc-4.8-openmpi-3.1.3/testpar' > >>> > >>> Testing t_mpi > >>> > >>> t_mpi Test Log > >>> > >>> srun: job 84126610 queued and waiting for resources > >>> srun: job 84126610 has been allocated resources > >>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system > >>> 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k > >>> 0inputs+0outputs (0major+1529minor)pagefaults 0swaps > >>> make[4]: *** [t_mpi.chkexe_] Error 1 > >>> make[4]: Leaving directory > >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >> gcc-4.8-openmpi-3.1.3/testpar' > >>> make[3]: *** [build-check-p] Error 1 > >>> make[3]: Leaving directory > >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >> gcc-4.8-openmpi-3.1.3/testpar' > >>> make[2]: *** [test] Error 2 > >>> make[2]: Leaving directory > >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >> gcc-4.8-openmpi-3.1.3/testpar' > >>> make[1]: *** [check-am] Error 2 > >>> make[1]: Leaving directory > >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- > >> gcc-4.8-openmpi-3.1.3/testpar' > >>> make: *** [check-recursive] Error 1 > >>> > >>> -- > >>> > >>> || \\UTGERS, > >>> |---*O*--- > >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu > >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS > >>> Campus > >>> || \\of NJ | Office of Advanced Research Computing - MSB C630, > >>> Newark > >>> `' > >> ___ > >> users mailing list > >> users@lists.open-mpi.org > >> https://lists.open-mpi.org/mailman/listinfo/users > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
I verified that it makes it through to a bash prompt, but I’m a little less confident that something make test does doesn’t clear it. Any recommendation for a way to verify? In any case, no change, unfortunately. Sent from my iPhone > On Feb 16, 2019, at 08:13, Gabriel, Edgar wrote: > > What file system are you running on? > > I will look into this, but it might be later next week. I just wanted to > emphasize that we are regularly running the parallel hdf5 tests with ompio, > and I am not aware of any outstanding items that do not work (and are > supposed to work). That being said, I run the tests manually, and not the > 'make test' commands. Will have to check which tests are being run by that. > > Edgar > >> -Original Message- >> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles >> Gouaillardet >> Sent: Saturday, February 16, 2019 1:49 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >> 3.1.3 >> >> Ryan, >> >> Can you >> >> export OMPI_MCA_io=^ompio >> >> and try again after you made sure this environment variable is passed by srun >> to the MPI tasks ? >> >> We have identified and fixed several issues specific to the (default) ompio >> component, so that could be a valid workaround until the next release. >> >> Cheers, >> >> Gilles >> >> Ryan Novosielski wrote: >>> Hi there, >>> >>> Honestly don’t know which piece of this puzzle to look at or how to get more >> information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL >> system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is >> failing at the below point; I am using a value of RUNPARALLEL='srun -- >> mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise >> properly configured. >>> >>> Thanks for any help you can provide. >>> >>> make[4]: Entering directory >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >> gcc-4.8-openmpi-3.1.3/testpar' >>> >>> Testing t_mpi >>> >>> t_mpi Test Log >>> >>> srun: job 84126610 queued and waiting for resources >>> srun: job 84126610 has been allocated resources >>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system >>> 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k >>> 0inputs+0outputs (0major+1529minor)pagefaults 0swaps >>> make[4]: *** [t_mpi.chkexe_] Error 1 >>> make[4]: Leaving directory >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >> gcc-4.8-openmpi-3.1.3/testpar' >>> make[3]: *** [build-check-p] Error 1 >>> make[3]: Leaving directory >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >> gcc-4.8-openmpi-3.1.3/testpar' >>> make[2]: *** [test] Error 2 >>> make[2]: Leaving directory >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >> gcc-4.8-openmpi-3.1.3/testpar' >>> make[1]: *** [check-am] Error 2 >>> make[1]: Leaving directory >>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >> gcc-4.8-openmpi-3.1.3/testpar' >>> make: *** [check-recursive] Error 1 >>> >>> -- >>> >>> || \\UTGERS, >>> |---*O*--- >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >>> || \\of NJ | Office of Advanced Research Computing - MSB C630, >>> Newark >>> `' >> ___ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
What file system are you running on? I will look into this, but it might be later next week. I just wanted to emphasize that we are regularly running the parallel hdf5 tests with ompio, and I am not aware of any outstanding items that do not work (and are supposed to work). That being said, I run the tests manually, and not the 'make test' commands. Will have to check which tests are being run by that. Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles > Gouaillardet > Sent: Saturday, February 16, 2019 1:49 AM > To: Open MPI Users > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI > 3.1.3 > > Ryan, > > Can you > > export OMPI_MCA_io=^ompio > > and try again after you made sure this environment variable is passed by srun > to the MPI tasks ? > > We have identified and fixed several issues specific to the (default) ompio > component, so that could be a valid workaround until the next release. > > Cheers, > > Gilles > > Ryan Novosielski wrote: > >Hi there, > > > >Honestly don’t know which piece of this puzzle to look at or how to get more > information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL > system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is > failing at the below point; I am using a value of RUNPARALLEL='srun -- > mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise > properly configured. > > > >Thanks for any help you can provide. > > > >make[4]: Entering directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > > > >Testing t_mpi > > > >t_mpi Test Log > > > >srun: job 84126610 queued and waiting for resources > >srun: job 84126610 has been allocated resources > >srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system > >20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k > >0inputs+0outputs (0major+1529minor)pagefaults 0swaps > >make[4]: *** [t_mpi.chkexe_] Error 1 > >make[4]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[3]: *** [build-check-p] Error 1 > >make[3]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[2]: *** [test] Error 2 > >make[2]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make[1]: *** [check-am] Error 2 > >make[1]: Leaving directory > >`/scratch/novosirj/install-files/hdf5-1.10.4-build- > gcc-4.8-openmpi-3.1.3/testpar' > >make: *** [check-recursive] Error 1 > > > >-- > > > >|| \\UTGERS, > >|---*O*--- > >||_// the State | Ryan Novosielski - novos...@rutgers.edu > >|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > >|| \\of NJ | Office of Advanced Research Computing - MSB C630, > >Newark > > `' > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Ryan, Can you export OMPI_MCA_io=^ompio and try again after you made sure this environment variable is passed by srun to the MPI tasks ? We have identified and fixed several issues specific to the (default) ompio component, so that could be a valid workaround until the next release. Cheers, Gilles Ryan Novosielski wrote: >Hi there, > >Honestly don’t know which piece of this puzzle to look at or how to get more >information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL >system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is >failing at the below point; I am using a value of RUNPARALLEL='srun --mpi=pmi2 >-p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly >configured. > >Thanks for any help you can provide. > >make[4]: Entering directory >`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' > >Testing t_mpi > >t_mpi Test Log > >srun: job 84126610 queued and waiting for resources >srun: job 84126610 has been allocated resources >srun: error: slepner023: tasks 0-5: Alarm clock >0.01user 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k >0inputs+0outputs (0major+1529minor)pagefaults 0swaps >make[4]: *** [t_mpi.chkexe_] Error 1 >make[4]: Leaving directory >`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' >make[3]: *** [build-check-p] Error 1 >make[3]: Leaving directory >`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' >make[2]: *** [test] Error 2 >make[2]: Leaving directory >`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' >make[1]: *** [check-am] Error 2 >make[1]: Leaving directory >`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' >make: *** [check-recursive] Error 1 > >-- > >|| \\UTGERS,|---*O*--- >||_// the State | Ryan Novosielski - novos...@rutgers.edu >|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >|| \\of NJ | Office of Advanced Research Computing - MSB C630, >Newark > `' ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3
Hi there, Honestly don’t know which piece of this puzzle to look at or how to get more information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is failing at the below point; I am using a value of RUNPARALLEL='srun --mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly configured. Thanks for any help you can provide. make[4]: Entering directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' Testing t_mpi t_mpi Test Log srun: job 84126610 queued and waiting for resources srun: job 84126610 has been allocated resources srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k 0inputs+0outputs (0major+1529minor)pagefaults 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' make[2]: *** [test] Error 2 make[2]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' make[1]: *** [check-am] Error 2 make[1]: Leaving directory `/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar' make: *** [check-recursive] Error 1 -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' signature.asc Description: Message signed with OpenPGP ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users