> On Feb 20, 2019, at 7:14 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ryan,
> 
> as Edgar explained, that could be a compiler issue (fwiw, I am unable to 
> reproduce the bug)

Same thing, OpenMPI 3.1.3, GCC 4.8.5, and HDF5 1.10.4 make check? Just making 
sure — that makes it seem like there’s something else going on for me here. 
Just for comparison’s sake:

[novosirj@amarel-test2 testpar]$ rpm -qa gcc
gcc-4.8.5-28.el7_5.1.x86_64

> You can build Open MPI again and pass --disable-builtin-atomics to the 
> configure command line.

Thanks, I’ll look into that (din’t know the implications).

> That being said, the "Alarm clock" message looks a bit suspicious.
> 
> Does it always occur at 20+ minutes elapsed ?

Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a 
couple of seconds mark. For comparison, every time it runs, it runs for 2-3 
seconds total. So it seems like what might actually be happening here is a 
hang, and not a failure of the test per se.

> Is there some mechanism that automatically kills a job if it does not write 
> anything to stdout for some time ?
> 
> A quick way to rule that out is to
> 
> srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800
> 
> and see if that completes or get killed with the same error message.

I was not aware of anything like that, but I’ll look into it now (running your 
suggestion). I guess we don’t run across this sort of thing very often — most 
stuff at least prints output when it starts.

> You can also run use mpirun instead of srun, and even run mpirun outside of 
> slurm
> 
> (if your cluster policy allows it, you can for example use mpirun and run on 
> the frontend node)

I’m on the team that manages the cluster, so we can try various things. Every 
piece of software we ever run, though, runs via srun — we don’t provide mpirun 
as a matter of course, except in some corner cases.

> On 2/21/2019 3:01 AM, Ryan Novosielski wrote:
>> Does it make any sense that it seems to work fine when OpenMPI and HDF5 are 
>> built with GCC 7.4 and GCC 8.2, but /not/ when they are built with 
>> RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 
>> build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
>> either of the 7.4 and 8.2 builds.
>> 
>> Just as a reminder, since it was reasonably far back in the thread, what I’m 
>> doing is running the “make check” tests in HDF5 1.10.4, in part because 
>> users use it, but also because it seems to have a good test suite and I can 
>> therefore verify the compiler and MPI stack installs. I get very little 
>> information, apart from it not working and getting that “Alarm clock” 
>> message.
>> 
>> I originally suspected I’d somehow built some component of this with a 
>> host-specific optimization that wasn’t working on some compute nodes. But I 
>> controlled for that and it didn’t seem to make any difference.
>> 
>> --
>> ____
>> || \\UTGERS,          
>> |---------------------------*O*---------------------------
>> ||_// the State       |         Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\    of NJ       | Office of Advanced Research Computing - MSB C630, 
>> Newark
>>      `'
>> 
>>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski <novos...@rutgers.edu> wrote:
>>> 
>>> It didn’t work any better with XFS, as it happens. Must be something else. 
>>> I’m going to test some more and see if I can narrow it down any, as it 
>>> seems to me that it did work with a different compiler.
>>> 
>>> --
>>> ____
>>> || \\UTGERS,         
>>> |---------------------------*O*---------------------------
>>> ||_// the State      |         Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>>> ||  \\    of NJ      | Office of Advanced Research Computing - MSB C630, 
>>> Newark
>>>     `'
>>> 
>>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar <egabr...@central.uh.edu> 
>>>> wrote:
>>>> 
>>>> While I was working on something else, I let the tests run with Open MPI 
>>>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>>>> release), and here is what I found for the HDF5 1.10.4 tests on my local 
>>>> desktop:
>>>> 
>>>> In the testpar directory, there is in fact one test that fails for both 
>>>> ompio and romio321 in exactly the same manner.
>>>> I used 6 processes as you did (although I used mpirun directly  instead of 
>>>> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
>>>> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, 
>>>> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, 
>>>> t_shapesame).
>>>> 
>>>> The one tests that officially fails ( t_pflush1) actually reports that it 
>>>> passed, but then throws message that indicates that MPI_Abort has been 
>>>> called, for both ompio and romio. I will try to investigate this test to 
>>>> see what is going on.
>>>> 
>>>> That being said, your report shows an issue in t_mpi, which passes without 
>>>> problems for me. This is however not GPFS, this was an XFS local file 
>>>> system. Running the tests on GPFS are on my todo list as well.
>>>> 
>>>> Thanks
>>>> Edgar
>>>> 
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>>>> Gabriel, Edgar
>>>>> Sent: Sunday, February 17, 2019 10:34 AM
>>>>> To: Open MPI Users <users@lists.open-mpi.org>
>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>>>> 3.1.3
>>>>> 
>>>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have 
>>>>> access to a
>>>>> GPFS file system since recently, and will report back on that, but it 
>>>>> will take a
>>>>> few days.
>>>>> 
>>>>> Thanks
>>>>> Edgar
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>>>>> Ryan Novosielski
>>>>>> Sent: Sunday, February 17, 2019 2:37 AM
>>>>>> To: users@lists.open-mpi.org
>>>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>>>>> 3.1.3
>>>>>> 
>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>>> Hash: SHA1
>>>>>> 
>>>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference.
>>>>>> 
>>>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
>>>>>>> Ryan,
>>>>>>> 
>>>>>>> What filesystem are you running on ?
>>>>>>> 
>>>>>>> Open MPI defaults to the ompio component, except on Lustre
>>>>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO,
>>>>>>> that can explain why you did not see any difference, in that case,
>>>>>>> you might want to try an other filesystem (local filesystem or NFS
>>>>>>> for example)\
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
>>>>>>> <novos...@rutgers.edu> wrote:
>>>>>>>> I verified that it makes it through to a bash prompt, but I’m a
>>>>>>>> little less confident that something make test does doesn’t clear it.
>>>>>>>> Any recommendation for a way to verify?
>>>>>>>> 
>>>>>>>> In any case, no change, unfortunately.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
>>>>>>>>> <egabr...@central.uh.edu>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> What file system are you running on?
>>>>>>>>> 
>>>>>>>>> I will look into this, but it might be later next week. I just
>>>>>>>>> wanted to emphasize that we are regularly running the parallel
>>>>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding items
>>>>>>>>> that do not work (and are supposed to work). That being said, I
>>>>>>>>> run the tests manually, and not the 'make test'
>>>>>>>>> commands. Will have to check which tests are being run by that.
>>>>>>>>> 
>>>>>>>>> Edgar
>>>>>>>>> 
>>>>>>>>>> -----Original Message----- From: users
>>>>>>>>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
>>>>>>>>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open
>>>>>>>>>> MPI Users <users@lists.open-mpi.org> Subject: Re:
>>>>>>>>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>>>>>>>>> 3.1.3
>>>>>>>>>> 
>>>>>>>>>> Ryan,
>>>>>>>>>> 
>>>>>>>>>> Can you
>>>>>>>>>> 
>>>>>>>>>> export OMPI_MCA_io=^ompio
>>>>>>>>>> 
>>>>>>>>>> and try again after you made sure this environment variable is
>>>>>>>>>> passed by srun to the MPI tasks ?
>>>>>>>>>> 
>>>>>>>>>> We have identified and fixed several issues specific to the
>>>>>>>>>> (default) ompio component, so that could be a valid workaround
>>>>>>>>>> until the next release.
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> 
>>>>>>>>>> Gilles
>>>>>>>>>> 
>>>>>>>>>> Ryan Novosielski <novos...@rutgers.edu> wrote:
>>>>>>>>>>> Hi there,
>>>>>>>>>>> 
>>>>>>>>>>> Honestly don’t know which piece of this puzzle to look at or how
>>>>>>>>>>> to get more
>>>>>>>>>> information for troubleshooting. I successfully built HDF5
>>>>>>>>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the
>>>>>>>>>> “make check” in HDF5 is failing at the below point; I am using a
>>>>>>>>>> value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t
>>>>>>>>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly
>>>>>>>>>> configured.
>>>>>>>>>>> Thanks for any help you can provide.
>>>>>>>>>>> 
>>>>>>>>>>> make[4]: Entering directory
>>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>>>>>>> ============================ Testing  t_mpi
>>>>>>>>>>> ============================ t_mpi  Test Log
>>>>>>>>>>> ============================ srun: job 84126610 queued and
>>>>>> waiting
>>>>>>>>>>> for resources srun: job 84126610 has been allocated resources
>>>>>>>>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user
>>>>>>>>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata
>>>>>>>>>>> 5152maxresident)k 0inputs+0outputs
>>>>> (0major+1529minor)pagefaults
>>>>>>>>>>> 0swaps make[4]: *** [t_mpi.chkexe_] Error 1 make[4]: Leaving
>>>>>>>>>>> directory
>>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>>>>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory
>>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>>>>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory
>>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>>>>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory
>>>>>>>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>>>>>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>>>>>>> make: *** [check-recursive] Error 1
>>>>>>>>>>> 
>>>>>>>>>>> -- ____ || \\UTGERS,
>>>>>>>>>>> |---------------------------*O*---------------------------
>>>>>>>>>>> ||_// the State     |         Ryan Novosielski -
>>>>>>>>>>> novos...@rutgers.edu || \\ University | Sr. Technologist -
>>>>>>>>>>> 973/972.0922 (2x0922) ~*~ RBHS Campus ||  \\    of NJ     |
>>>>>>>>>>> Office of Advanced Research Computing - MSB C630, Newark `'
>>>>>>>>>> _______________________________________________ users
>>>>> mailing
>>>>>> list
>>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>>> _______________________________________________ users mailing
>>>>>> list
>>>>>>>>> users@lists.open-mpi.org
>>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>>> _______________________________________________ users mailing
>>>>> list
>>>>>>>> users@lists.open-mpi.org
>>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> _______________________________________________ users mailing list
>>>>>>> users@lists.open-mpi.org
>>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>>>> 
>>>>>> - --
>>>>>> ____
>>>>>> || \\UTGERS,     |----------------------*O*------------------------
>>>>>> ||_// the State  |    Ryan Novosielski - novos...@rutgers.edu
>>>>>> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
>>>>>> ||  \\    of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
>>>>>>     `'
>>>>>> -----BEGIN PGP SIGNATURE-----
>>>>>> 
>>>>>> 
>>>>> iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXGkdJQAKCRCZv6Bp
>>>>>> 0Ryx
>>>>>> 
>>>>> vvO3AKChC0/SZ74xeY95WjYEgFhVz+bXlACfYZWEKe4ZDbbbafGAcCuMF04yIgs
>>>>>> =
>>>>>> =6QM1
>>>>>> -----END PGP SIGNATURE-----
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users@lists.open-mpi.org
>>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to