> On Feb 20, 2019, at 7:14 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ryan, > > as Edgar explained, that could be a compiler issue (fwiw, I am unable to > reproduce the bug)
Same thing, OpenMPI 3.1.3, GCC 4.8.5, and HDF5 1.10.4 make check? Just making sure — that makes it seem like there’s something else going on for me here. Just for comparison’s sake: [novosirj@amarel-test2 testpar]$ rpm -qa gcc gcc-4.8.5-28.el7_5.1.x86_64 > You can build Open MPI again and pass --disable-builtin-atomics to the > configure command line. Thanks, I’ll look into that (din’t know the implications). > That being said, the "Alarm clock" message looks a bit suspicious. > > Does it always occur at 20+ minutes elapsed ? Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a couple of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds total. So it seems like what might actually be happening here is a hang, and not a failure of the test per se. > Is there some mechanism that automatically kills a job if it does not write > anything to stdout for some time ? > > A quick way to rule that out is to > > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800 > > and see if that completes or get killed with the same error message. I was not aware of anything like that, but I’ll look into it now (running your suggestion). I guess we don’t run across this sort of thing very often — most stuff at least prints output when it starts. > You can also run use mpirun instead of srun, and even run mpirun outside of > slurm > > (if your cluster policy allows it, you can for example use mpirun and run on > the frontend node) I’m on the team that manages the cluster, so we can try various things. Every piece of software we ever run, though, runs via srun — we don’t provide mpirun as a matter of course, except in some corner cases. > On 2/21/2019 3:01 AM, Ryan Novosielski wrote: >> Does it make any sense that it seems to work fine when OpenMPI and HDF5 are >> built with GCC 7.4 and GCC 8.2, but /not/ when they are built with >> RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 >> build, I did try an XFS filesystem and it didn’t help. GPFS works fine for >> either of the 7.4 and 8.2 builds. >> >> Just as a reminder, since it was reasonably far back in the thread, what I’m >> doing is running the “make check” tests in HDF5 1.10.4, in part because >> users use it, but also because it seems to have a good test suite and I can >> therefore verify the compiler and MPI stack installs. I get very little >> information, apart from it not working and getting that “Alarm clock” >> message. >> >> I originally suspected I’d somehow built some component of this with a >> host-specific optimization that wasn’t working on some compute nodes. But I >> controlled for that and it didn’t seem to make any difference. >> >> -- >> ____ >> || \\UTGERS, >> |---------------------------*O*--------------------------- >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' >> >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski <novos...@rutgers.edu> wrote: >>> >>> It didn’t work any better with XFS, as it happens. Must be something else. >>> I’m going to test some more and see if I can narrow it down any, as it >>> seems to me that it did work with a different compiler. >>> >>> -- >>> ____ >>> || \\UTGERS, >>> |---------------------------*O*--------------------------- >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >>> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >>> Newark >>> `' >>> >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar <egabr...@central.uh.edu> >>>> wrote: >>>> >>>> While I was working on something else, I let the tests run with Open MPI >>>> master (which is for parallel I/O equivalent to the upcoming v4.0.1 >>>> release), and here is what I found for the HDF5 1.10.4 tests on my local >>>> desktop: >>>> >>>> In the testpar directory, there is in fact one test that fails for both >>>> ompio and romio321 in exactly the same manner. >>>> I used 6 processes as you did (although I used mpirun directly instead of >>>> srun...) From the 13 tests in the testpar directory, 12 pass correctly >>>> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, >>>> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, >>>> t_shapesame). >>>> >>>> The one tests that officially fails ( t_pflush1) actually reports that it >>>> passed, but then throws message that indicates that MPI_Abort has been >>>> called, for both ompio and romio. I will try to investigate this test to >>>> see what is going on. >>>> >>>> That being said, your report shows an issue in t_mpi, which passes without >>>> problems for me. This is however not GPFS, this was an XFS local file >>>> system. Running the tests on GPFS are on my todo list as well. >>>> >>>> Thanks >>>> Edgar >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>>> Gabriel, Edgar >>>>> Sent: Sunday, February 17, 2019 10:34 AM >>>>> To: Open MPI Users <users@lists.open-mpi.org> >>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>>> 3.1.3 >>>>> >>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have >>>>> access to a >>>>> GPFS file system since recently, and will report back on that, but it >>>>> will take a >>>>> few days. >>>>> >>>>> Thanks >>>>> Edgar >>>>> >>>>>> -----Original Message----- >>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of >>>>>> Ryan Novosielski >>>>>> Sent: Sunday, February 17, 2019 2:37 AM >>>>>> To: users@lists.open-mpi.org >>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>>>> 3.1.3 >>>>>> >>>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>>> Hash: SHA1 >>>>>> >>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference. >>>>>> >>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote: >>>>>>> Ryan, >>>>>>> >>>>>>> What filesystem are you running on ? >>>>>>> >>>>>>> Open MPI defaults to the ompio component, except on Lustre >>>>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO, >>>>>>> that can explain why you did not see any difference, in that case, >>>>>>> you might want to try an other filesystem (local filesystem or NFS >>>>>>> for example)\ >>>>>>> >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski >>>>>>> <novos...@rutgers.edu> wrote: >>>>>>>> I verified that it makes it through to a bash prompt, but I’m a >>>>>>>> little less confident that something make test does doesn’t clear it. >>>>>>>> Any recommendation for a way to verify? >>>>>>>> >>>>>>>> In any case, no change, unfortunately. >>>>>>>> >>>>>>>> Sent from my iPhone >>>>>>>> >>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar >>>>>>>>> <egabr...@central.uh.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> What file system are you running on? >>>>>>>>> >>>>>>>>> I will look into this, but it might be later next week. I just >>>>>>>>> wanted to emphasize that we are regularly running the parallel >>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding items >>>>>>>>> that do not work (and are supposed to work). That being said, I >>>>>>>>> run the tests manually, and not the 'make test' >>>>>>>>> commands. Will have to check which tests are being run by that. >>>>>>>>> >>>>>>>>> Edgar >>>>>>>>> >>>>>>>>>> -----Original Message----- From: users >>>>>>>>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles >>>>>>>>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open >>>>>>>>>> MPI Users <users@lists.open-mpi.org> Subject: Re: >>>>>>>>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI >>>>>>>>>> 3.1.3 >>>>>>>>>> >>>>>>>>>> Ryan, >>>>>>>>>> >>>>>>>>>> Can you >>>>>>>>>> >>>>>>>>>> export OMPI_MCA_io=^ompio >>>>>>>>>> >>>>>>>>>> and try again after you made sure this environment variable is >>>>>>>>>> passed by srun to the MPI tasks ? >>>>>>>>>> >>>>>>>>>> We have identified and fixed several issues specific to the >>>>>>>>>> (default) ompio component, so that could be a valid workaround >>>>>>>>>> until the next release. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> >>>>>>>>>> Gilles >>>>>>>>>> >>>>>>>>>> Ryan Novosielski <novos...@rutgers.edu> wrote: >>>>>>>>>>> Hi there, >>>>>>>>>>> >>>>>>>>>>> Honestly don’t know which piece of this puzzle to look at or how >>>>>>>>>>> to get more >>>>>>>>>> information for troubleshooting. I successfully built HDF5 >>>>>>>>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the >>>>>>>>>> “make check” in HDF5 is failing at the below point; I am using a >>>>>>>>>> value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t >>>>>>>>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly >>>>>>>>>> configured. >>>>>>>>>>> Thanks for any help you can provide. >>>>>>>>>>> >>>>>>>>>>> make[4]: Entering directory >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>>>>>>>> ============================ Testing t_mpi >>>>>>>>>>> ============================ t_mpi Test Log >>>>>>>>>>> ============================ srun: job 84126610 queued and >>>>>> waiting >>>>>>>>>>> for resources srun: job 84126610 has been allocated resources >>>>>>>>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user >>>>>>>>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata >>>>>>>>>>> 5152maxresident)k 0inputs+0outputs >>>>> (0major+1529minor)pagefaults >>>>>>>>>>> 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving >>>>>>>>>>> directory >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>>>>>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>>>>>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>>>>>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory >>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build- >>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar' >>>>>>>>>>> make: *** [check-recursive] Error 1 >>>>>>>>>>> >>>>>>>>>>> -- ____ || \\UTGERS, >>>>>>>>>>> |---------------------------*O*--------------------------- >>>>>>>>>>> ||_// the State | Ryan Novosielski - >>>>>>>>>>> novos...@rutgers.edu || \\ University | Sr. Technologist - >>>>>>>>>>> 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | >>>>>>>>>>> Office of Advanced Research Computing - MSB C630, Newark `' >>>>>>>>>> _______________________________________________ users >>>>> mailing >>>>>> list >>>>>>>>>> users@lists.open-mpi.org >>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>>> _______________________________________________ users mailing >>>>>> list >>>>>>>>> users@lists.open-mpi.org >>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>>> _______________________________________________ users mailing >>>>> list >>>>>>>> users@lists.open-mpi.org >>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>> _______________________________________________ users mailing list >>>>>>> users@lists.open-mpi.org >>>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>>>> >>>>>> - -- >>>>>> ____ >>>>>> || \\UTGERS, |----------------------*O*------------------------ >>>>>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>>>>> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus >>>>>> || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark >>>>>> `' >>>>>> -----BEGIN PGP SIGNATURE----- >>>>>> >>>>>> >>>>> iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXGkdJQAKCRCZv6Bp >>>>>> 0Ryx >>>>>> >>>>> vvO3AKChC0/SZ74xeY95WjYEgFhVz+bXlACfYZWEKe4ZDbbbafGAcCuMF04yIgs >>>>>> = >>>>>> =6QM1 >>>>>> -----END PGP SIGNATURE----- >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> users@lists.open-mpi.org >>>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users