Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ryan Novosielski via users
Thanks. That /is/ one solution, and what I’ll do in the interim since this has 
to work in at least some fashion, but I would actually like to use UCX if 
OpenIB is going to be deprecated. How do I find out what’s actually wrong?

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users 
>  wrote:
> 
> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead
> 
>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski  wrote:
>> 
>> Thanks, Ralph. This /does/ change things, but not very much. I was not under 
>> the impression that I needed to do that, since when I ran without having 
>> built against UCX, it warned me about the openib method being deprecated. By 
>> default, does OpenMPI not use either anymore, and I need to specifically 
>> call for UCX? Seems strange.
>> 
>> Anyhow, I’ve got some variables defined still, in addition to your 
>> suggestion, for verbosity:
>> 
>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI
>> OMPI_MCA_pml=ucx
>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>> OMPI_MCA_pml_ucx_verbose=100
>> 
>> Here goes:
>> 
>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>> ./mpihello-gcc-8-openmpi-4.0.6
>> srun: job 13995650 queued and waiting for resources
>> srun: job 13995650 has been allocated resources
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>> Local host:   gpu004
>> Local device: mlx4_0
>> --
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>> Local host:   gpu004
>> Local device: mlx4_0
>> --
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> ud/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29823] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support 
>> level is none
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:29824] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ryan Novosielski via users
 were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  gpu004
  Framework: pml
--
[gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected
slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT 
2021-07-29T11:31:19 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: gpu004: tasks 0-1: Exited with exit code 1

--
#BlackLivesMatter

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users 
>  wrote:
> 
> Ryan - I suspect what Sergey was trying to say was that you need to ensure 
> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't 
> attempt to initialize it. Try adding
> 
> OMPI_MCA_pml=ucx
> 
> to your environment.
> 
> 
>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users 
>>  wrote:
>> 
>> Hi
>>  
>> This issue arrives from BTL OpenIB, not related to UCX
>>  
>> From: users  on behalf of Ryan Novosielski 
>> via users 
>> Date: Thursday, 29 July 2021, 08:25
>> To: users@lists.open-mpi.org 
>> Cc: Ryan Novosielski 
>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There 
>> was an error initializing an OpenFabrics device."
>> 
>> Hi there,
>> 
>> New to using UCX, as a result of having built OpenMPI without it and running 
>> tests and getting warned. Installed UCX from the distribution:
>> 
>> [novosirj@amarel-test2 ~]$ rpm -qa ucx
>> ucx-1.5.2-1.el7.x86_64
>> 
>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty unhelpful 
>> messages about not using the IB card. I looked around the internet some and 
>> set a couple of environment variables to get a little more information:
>> 
>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
>> export OMPI_MCA_pml_ucx_verbose=100
>> 
>> Here’s what happens:
>> 
>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc  --reservation=UCX 
>> ./mpihello-gcc-8-openmpi-4.0.6 
>> srun: job 13993927 queued and waiting for resources
>> srun: job 13993927 has been allocated resources
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>>  Local host:   gpu004
>>  Local device: mlx4_0
>> --
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>> 
>>  Local host:   gpu004
>>  Local device: mlx4_0
>> --
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using OPAL 
>> memory hooks as external events
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 
>> mca_pml_ucx_open: UCX version 1.5.2
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02327] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 self/self: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: 
>> did not match transport list
>> [gpu004.amarel.rutgers.edu:02326] 
>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 
>> rc/mlx4_0:1: did not match transport list
>> [gpu004.amarel.rutgers.ed

[OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-28 Thread Ryan Novosielski via users
latency: 80 nsec
# overhead: 10 nsec
#put_short: <= 4294967295
#put_bcopy: unlimited
#get_bcopy: unlimited
# am_short: <= 92
# am_bcopy: <= 8k
#   domain: cpu
#   atomic_add: 32, 64 bit
#   atomic_and: 32, 64 bit
#atomic_or: 32, 64 bit
#   atomic_xor: 32, 64 bit
#  atomic_fadd: 32, 64 bit
#  atomic_fand: 32, 64 bit
#   atomic_for: 32, 64 bit
#  atomic_fxor: 32, 64 bit
#  atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
#   connection: to iface
# priority: 0
#   device address: 8 bytes
#iface address: 16 bytes
#   error handling: none
#
#
# Memory domain: cma
#component: cma
# register: unlimited, cost: 9 nsec
#
#   Transport: cma
#
#   Device: cma
#
#  capabilities:
#bandwidth: 11145.00 MB/sec
#  latency: 80 nsec
# overhead: 400 nsec
#put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#put_align_mtu: <= 1
#get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#get_align_mtu: <= 1
#   connection: to iface
# priority: 0
#   device address: 8 bytes
#iface address: 4 bytes
#   error handling: none
#

[novosirj@gpu004 ~]$ ucx_info -p -u t
#
# UCP context
#
#md 0  :  self
#md 1  :  tcp
#md 2  :  ib/mlx4_0
#md 3  :  rdmacm
#md 4  :  sysv
#md 5  :  posix
#md 6  :  cma
#
#  resource 0  :  md 0  dev 0  flags -- self/self
#  resource 1  :  md 1  dev 1  flags -- tcp/eno1
#  resource 2  :  md 1  dev 2  flags -- tcp/ib0
#  resource 3  :  md 2  dev 3  flags -- rc/mlx4_0:1
#  resource 4  :  md 2  dev 3  flags -- ud/mlx4_0:1
#  resource 5  :  md 3  dev 4  flags -s rdmacm/sockaddr
#  resource 6  :  md 4  dev 5  flags -- mm/sysv
#  resource 7  :  md 5  dev 6  flags -- mm/posix
#  resource 8  :  md 6  dev 7  flags -- cma/cma
#
# memory: 0.84MB, file descriptors: 2
# create time: 5.032 ms
#

Thanks for any help you can offer. What am I missing?

--
#BlackLivesMatter
____
|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
`'



Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Ryan Novosielski
> On Feb 20, 2019, at 7:14 PM, Gilles Gouaillardet  wrote:
> 
> Ryan,
> 
> That being said, the "Alarm clock" message looks a bit suspicious.
> 
> Does it always occur at 20+ minutes elapsed ?
> 
> Is there some mechanism that automatically kills a job if it does not write 
> anything to stdout for some time ?
> 
> A quick way to rule that out is to
> 
> srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800
> 
> and see if that completes or get killed with the same error message.

FWIW, the “sleep” completes just fine:

[novosirj@amarel-test2 testpar]$ sacct -j 84173276 -M perceval -o 
jobid,jobname,start,end,node,state
   JobIDJobName   Start EndNodeList 
 State
 -- --- --- --- 
--
84173276  sleep 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 
 COMPLETED
84173276.ex+ extern 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 
 COMPLETED
84173276.0sleep 2019-02-21T14:46:03 2019-02-21T15:16:03 node077 
 COMPLETED

--

|| \\UTGERS, |---*O*-------
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Ryan Novosielski
Related to this or not, I also get a hang on MVAPICH2 2.3 compiled with GCC 
8.2, but on t_filters_parallel, not t_mpi. With that combo, though, I get a 
segfault, or at least a message about one. It’s only “Alarm clock” on the GCC 
4.8 with OpenMPI 3.1.3 combo. It also happens at the ~20 minute mark, FWIW.

Testing  t_filters_parallel

 t_filters_parallel  Test Log

srun: job 84117363 queued and waiting for resources
srun: job 84117363 has been allocated resources
[slepner063.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
Segmentation fault (signal 11)
srun: error: slepner063: task 0: Segmentation fault
srun: error: slepner063: tasks 1-3: Alarm clock
0.01user 0.01system 20:01.44elapsed 0%CPU (0avgtext+0avgdata 5144maxresident)k
0inputs+0outputs (0major+1524minor)pagefaults 0swaps
make[4]: *** [t_filters_parallel.chkexe_] Error 1
make[4]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
make[2]: *** [test] Error 2
make[2]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
make[1]: *** [check-am] Error 2
make[1]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-mvapich2-2.3/testpar'
make: *** [check-recursive] Error 1

> On Feb 21, 2019, at 3:03 PM, Gabriel, Edgar  wrote:
> 
> Yes, I was talking about the same thing, although for me it was not t_mpi, 
> but t_shapesame that was hanging. It might be an indication of the same issue 
> however.
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
>> Novosielski
>> Sent: Thursday, February 21, 2019 1:59 PM
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>> 3.1.3
>> 
>> 
>>> On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar 
>> wrote:
>>> 
>>>> -Original Message-
>>>>> Does it always occur at 20+ minutes elapsed ?
>>>> 
>>>> Aha! Yes, you are right: every time it fails, it’s at the 20 minute
>>>> and a couple of seconds mark. For comparison, every time it runs, it
>>>> runs for 2-3 seconds total. So it seems like what might actually be
>>>> happening here is a hang, and not a failure of the test per se.
>>>> 
>>> 
>>> I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8
>> (although this was OpenSuSE, not Redhat), and it looked to me like one of
>> tests were hanging, but I didn't have time to investigate it further.
>> 
>> Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The
>> OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t
>> believe it ever launches any jobs or anything like that.
>> 
>> --
>> 
>> || \\UTGERS,  
>> |---*O*---
>> ||_// the State   | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630,
>> Newark
>> `'
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Ryan Novosielski

> On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar  wrote:
> 
>> -Original Message-
>>> Does it always occur at 20+ minutes elapsed ?
>> 
>> Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a 
>> couple
>> of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds
>> total. So it seems like what might actually be happening here is a hang, and
>> not a failure of the test per se.
>> 
> 
> I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 
> (although this was OpenSuSE, not Redhat), and it looked to me like one of 
> tests were hanging, but I didn't have time to investigate it further.

Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The 
OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t 
believe it ever launches any jobs or anything like that.

--

|| \\UTGERS, |-------*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Ryan Novosielski
> On Feb 20, 2019, at 7:14 PM, Gilles Gouaillardet  wrote:
> 
> Ryan,
> 
> as Edgar explained, that could be a compiler issue (fwiw, I am unable to 
> reproduce the bug)

Same thing, OpenMPI 3.1.3, GCC 4.8.5, and HDF5 1.10.4 make check? Just making 
sure — that makes it seem like there’s something else going on for me here. 
Just for comparison’s sake:

[novosirj@amarel-test2 testpar]$ rpm -qa gcc
gcc-4.8.5-28.el7_5.1.x86_64

> You can build Open MPI again and pass --disable-builtin-atomics to the 
> configure command line.

Thanks, I’ll look into that (din’t know the implications).

> That being said, the "Alarm clock" message looks a bit suspicious.
> 
> Does it always occur at 20+ minutes elapsed ?

Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a 
couple of seconds mark. For comparison, every time it runs, it runs for 2-3 
seconds total. So it seems like what might actually be happening here is a 
hang, and not a failure of the test per se.

> Is there some mechanism that automatically kills a job if it does not write 
> anything to stdout for some time ?
> 
> A quick way to rule that out is to
> 
> srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800
> 
> and see if that completes or get killed with the same error message.

I was not aware of anything like that, but I’ll look into it now (running your 
suggestion). I guess we don’t run across this sort of thing very often — most 
stuff at least prints output when it starts.

> You can also run use mpirun instead of srun, and even run mpirun outside of 
> slurm
> 
> (if your cluster policy allows it, you can for example use mpirun and run on 
> the frontend node)

I’m on the team that manages the cluster, so we can try various things. Every 
piece of software we ever run, though, runs via srun — we don’t provide mpirun 
as a matter of course, except in some corner cases.

> On 2/21/2019 3:01 AM, Ryan Novosielski wrote:
>> Does it make any sense that it seems to work fine when OpenMPI and HDF5 are 
>> built with GCC 7.4 and GCC 8.2, but /not/ when they are built with 
>> RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 
>> build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
>> either of the 7.4 and 8.2 builds.
>> 
>> Just as a reminder, since it was reasonably far back in the thread, what I’m 
>> doing is running the “make check” tests in HDF5 1.10.4, in part because 
>> users use it, but also because it seems to have a good test suite and I can 
>> therefore verify the compiler and MPI stack installs. I get very little 
>> information, apart from it not working and getting that “Alarm clock” 
>> message.
>> 
>> I originally suspected I’d somehow built some component of this with a 
>> host-specific optimization that wasn’t working on some compute nodes. But I 
>> controlled for that and it didn’t seem to make any difference.
>> 
>> --
>> 
>> || \\UTGERS,  
>> |---*O*---
>> ||_// the State   | Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630, 
>> Newark
>>  `'
>> 
>>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski  wrote:
>>> 
>>> It didn’t work any better with XFS, as it happens. Must be something else. 
>>> I’m going to test some more and see if I can narrow it down any, as it 
>>> seems to me that it did work with a different compiler.
>>> 
>>> --
>>> 
>>> || \\UTGERS, 
>>> |---*O*---
>>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>>> ||  \\of NJ  | Office of Advanced Research Computing - MSB C630, 
>>> Newark
>>> `'
>>> 
>>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar  
>>>> wrote:
>>>> 
>>>> While I was working on something else, I let the tests run with Open MPI 
>>>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>>>> release), and here is what I found for the HDF5 1.10.4 tests on my local 
>>>> desktop:
>>>> 
>>>> In the testpar directory, there is in fact one test that fails for both 
>>>> ompio and romio321 in exactly the same manner.
>>>> I used 6 processes as you did (although I used mpirun directly  instead of 
>>>>

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Ryan Novosielski
This is what I did for my build — not much going on there:

../openmpi-3.1.3/configure --prefix=/opt/sw/packages/gcc-4_8/openmpi/3.1.3 
--with-pmi && \
make -j32

We have a mixture of types of Infiniband, using the RHEL-supplied Infiniband 
packages.

--

|| \\UTGERS, |---*O*---
||_// the State  |     Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 20, 2019, at 1:46 PM, Gabriel, Edgar  wrote:
> 
> Well, the way you describe it, it sounds to me like maybe an atomic issue 
> with this compiler version. What was your configure line of Open MPI, and 
> what network interconnect are you using?
> 
> An easy way to test this theory would be to force OpenMPI to use the tcp 
> interfaces (everything will be slow however). You can do that by creating in 
> your home directory a directory called .openmpi, and add there a file called 
> mca-params.conf
> 
> The file should look something like this:
> 
> btl = tcp,self
> 
> 
> 
> Thanks
> Edgar
> 
> 
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
>> Novosielski
>> Sent: Wednesday, February 20, 2019 12:02 PM
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>> 3.1.3
>> 
>> Does it make any sense that it seems to work fine when OpenMPI and HDF5
>> are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-
>> supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build,
>> I did try an XFS filesystem and it didn’t help. GPFS works fine for either 
>> of the
>> 7.4 and 8.2 builds.
>> 
>> Just as a reminder, since it was reasonably far back in the thread, what I’m
>> doing is running the “make check” tests in HDF5 1.10.4, in part because users
>> use it, but also because it seems to have a good test suite and I can 
>> therefore
>> verify the compiler and MPI stack installs. I get very little information, 
>> apart
>> from it not working and getting that “Alarm clock” message.
>> 
>> I originally suspected I’d somehow built some component of this with a host-
>> specific optimization that wasn’t working on some compute nodes. But I
>> controlled for that and it didn’t seem to make any difference.
>> 
>> --
>> 
>> || \\UTGERS,  
>> |---*O*---
>> ||_// the State   |     Ryan Novosielski - novos...@rutgers.edu
>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>> ||  \\of NJ   | Office of Advanced Research Computing - MSB C630,
>> Newark
>> `'
>> 
>>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski 
>> wrote:
>>> 
>>> It didn’t work any better with XFS, as it happens. Must be something else.
>> I’m going to test some more and see if I can narrow it down any, as it seems
>> to me that it did work with a different compiler.
>>> 
>>> --
>>> 
>>> || \\UTGERS, 
>>> |---*O*---
>>> ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
>> Campus
>>> ||  \\of NJ  | Office of Advanced Research Computing - MSB C630,
>> Newark
>>>`'
>>> 
>>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar 
>> wrote:
>>>> 
>>>> While I was working on something else, I let the tests run with Open MPI
>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>> release),
>> and here is what I found for the HDF5 1.10.4 tests on my local desktop:
>>>> 
>>>> In the testpar directory, there is in fact one test that fails for both 
>>>> ompio
>> and romio321 in exactly the same manner.
>>>> I used 6 processes as you did (although I used mpirun directly  instead of
>> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
>> (t_bigio,
>> t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi,
>> t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
>>>> 
>>>> The one tests that officially fails ( t_pflush1) actually reports that it 
>>>> passed,
>> but then throws message that indicates that MPI_Abort has been

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Ryan Novosielski
Does it make any sense that it seems to work fine when OpenMPI and HDF5 are 
built with GCC 7.4 and GCC 8.2, but /not/ when they are built with 
RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 
build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
either of the 7.4 and 8.2 builds.

Just as a reminder, since it was reasonably far back in the thread, what I’m 
doing is running the “make check” tests in HDF5 1.10.4, in part because users 
use it, but also because it seems to have a good test suite and I can therefore 
verify the compiler and MPI stack installs. I get very little information, 
apart from it not working and getting that “Alarm clock” message.

I originally suspected I’d somehow built some component of this with a 
host-specific optimization that wasn’t working on some compute nodes. But I 
controlled for that and it didn’t seem to make any difference.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski  wrote:
> 
> It didn’t work any better with XFS, as it happens. Must be something else. 
> I’m going to test some more and see if I can narrow it down any, as it seems 
> to me that it did work with a different compiler.
> 
> --
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State    | Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630, 
> Newark
> `'
> 
>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar  wrote:
>> 
>> While I was working on something else, I let the tests run with Open MPI 
>> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
>> release), and here is what I found for the HDF5 1.10.4 tests on my local 
>> desktop:
>> 
>> In the testpar directory, there is in fact one test that fails for both 
>> ompio and romio321 in exactly the same manner.
>> I used 6 processes as you did (although I used mpirun directly  instead of 
>> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
>> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, 
>> t_init_term, t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, 
>> t_shapesame).
>> 
>> The one tests that officially fails ( t_pflush1) actually reports that it 
>> passed, but then throws message that indicates that MPI_Abort has been 
>> called, for both ompio and romio. I will try to investigate this test to see 
>> what is going on.
>> 
>> That being said, your report shows an issue in t_mpi, which passes without 
>> problems for me. This is however not GPFS, this was an XFS local file 
>> system. Running the tests on GPFS are on my todo list as well.
>> 
>> Thanks
>> Edgar
>> 
>> 
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>> Gabriel, Edgar
>>> Sent: Sunday, February 17, 2019 10:34 AM
>>> To: Open MPI Users 
>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>> 3.1.3
>>> 
>>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access 
>>> to a
>>> GPFS file system since recently, and will report back on that, but it will 
>>> take a
>>> few days.
>>> 
>>> Thanks
>>> Edgar
>>> 
>>>> -Original Message-
>>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>>> Ryan Novosielski
>>>> Sent: Sunday, February 17, 2019 2:37 AM
>>>> To: users@lists.open-mpi.org
>>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>>> 3.1.3
>>>> 
>>>> -BEGIN PGP SIGNED MESSAGE-
>>>> Hash: SHA1
>>>> 
>>>> This is on GPFS. I'll try it on XFS to see if it makes any difference.
>>>> 
>>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
>>>>> Ryan,
>>>>> 
>>>>> What filesystem are you running on ?
>>>>> 
>>>>> Open MPI defaults to the ompio component, except on Lustre
>>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO,
>>&

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-18 Thread Ryan Novosielski
It didn’t work any better with XFS, as it happens. Must be something else. I’m 
going to test some more and see if I can narrow it down any, as it seems to me 
that it did work with a different compiler.

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'

> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar  wrote:
> 
> While I was working on something else, I let the tests run with Open MPI 
> master (which is for parallel I/O equivalent to the upcoming v4.0.1  
> release), and here is what I found for the HDF5 1.10.4 tests on my local 
> desktop:
> 
> In the testpar directory, there is in fact one test that fails for both ompio 
> and romio321 in exactly the same manner.
> I used 6 processes as you did (although I used mpirun directly  instead of 
> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, 
> t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
> 
> The one tests that officially fails ( t_pflush1) actually reports that it 
> passed, but then throws message that indicates that MPI_Abort has been 
> called, for both ompio and romio. I will try to investigate this test to see 
> what is going on.
> 
> That being said, your report shows an issue in t_mpi, which passes without 
> problems for me. This is however not GPFS, this was an XFS local file system. 
> Running the tests on GPFS are on my todo list as well.
> 
> Thanks
> Edgar
> 
> 
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>> Gabriel, Edgar
>> Sent: Sunday, February 17, 2019 10:34 AM
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>> 3.1.3
>> 
>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access 
>> to a
>> GPFS file system since recently, and will report back on that, but it will 
>> take a
>> few days.
>> 
>> Thanks
>> Edgar
>> 
>>> -Original Message-
>>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>> Ryan Novosielski
>>> Sent: Sunday, February 17, 2019 2:37 AM
>>> To: users@lists.open-mpi.org
>>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>>> 3.1.3
>>> 
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA1
>>> 
>>> This is on GPFS. I'll try it on XFS to see if it makes any difference.
>>> 
>>> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
>>>> Ryan,
>>>> 
>>>> What filesystem are you running on ?
>>>> 
>>>> Open MPI defaults to the ompio component, except on Lustre
>>>> filesystem where ROMIO is used. (if the issue is related to ROMIO,
>>>> that can explain why you did not see any difference, in that case,
>>>> you might want to try an other filesystem (local filesystem or NFS
>>>> for example)\
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
>>>>  wrote:
>>>>> 
>>>>> I verified that it makes it through to a bash prompt, but I’m a
>>>>> little less confident that something make test does doesn’t clear it.
>>>>> Any recommendation for a way to verify?
>>>>> 
>>>>> In any case, no change, unfortunately.
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
>>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>> What file system are you running on?
>>>>>> 
>>>>>> I will look into this, but it might be later next week. I just
>>>>>> wanted to emphasize that we are regularly running the parallel
>>>>>> hdf5 tests with ompio, and I am not aware of any outstanding items
>>>>>> that do not work (and are supposed to work). That being said, I
>>>>>> run the tests manually, and not the 'make test'
>>>>>> commands. Will have to check which tests are being run by that.
>>>>>> 
>>>>>> Edgar
>>>>>> 
>&g

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-17 Thread Ryan Novosielski
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

This is on GPFS. I'll try it on XFS to see if it makes any difference.

On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
> Ryan,
> 
> What filesystem are you running on ?
> 
> Open MPI defaults to the ompio component, except on Lustre
> filesystem where ROMIO is used. (if the issue is related to ROMIO,
> that can explain why you did not see any difference, in that case,
> you might want to try an other filesystem (local filesystem or NFS
> for example)\
> 
> 
> Cheers,
> 
> Gilles
> 
> On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
>  wrote:
>> 
>> I verified that it makes it through to a bash prompt, but I’m a
>> little less confident that something make test does doesn’t clear
>> it. Any recommendation for a way to verify?
>> 
>> In any case, no change, unfortunately.
>> 
>> Sent from my iPhone
>> 
>>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
>>>  wrote:
>>> 
>>> What file system are you running on?
>>> 
>>> I will look into this, but it might be later next week. I just
>>> wanted to emphasize that we are regularly running the parallel
>>> hdf5 tests with ompio, and I am not aware of any outstanding
>>> items that do not work (and are supposed to work). That being
>>> said, I run the tests manually, and not the 'make test'
>>> commands. Will have to check which tests are being run by
>>> that.
>>> 
>>> Edgar
>>> 
>>>> -Original Message- From: users
>>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
>>>> Gilles Gouaillardet Sent: Saturday, February 16, 2019 1:49
>>>> AM To: Open MPI Users  Subject: Re:
>>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 
>>>> 3.1.3
>>>> 
>>>> Ryan,
>>>> 
>>>> Can you
>>>> 
>>>> export OMPI_MCA_io=^ompio
>>>> 
>>>> and try again after you made sure this environment variable
>>>> is passed by srun to the MPI tasks ?
>>>> 
>>>> We have identified and fixed several issues specific to the
>>>> (default) ompio component, so that could be a valid
>>>> workaround until the next release.
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> Ryan Novosielski  wrote:
>>>>> Hi there,
>>>>> 
>>>>> Honestly don’t know which piece of this puzzle to look at
>>>>> or how to get more
>>>> information for troubleshooting. I successfully built HDF5
>>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running
>>>> the “make check” in HDF5 is failing at the below point; I am
>>>> using a value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t
>>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly
>>>> configured.
>>>>> 
>>>>> Thanks for any help you can provide.
>>>>> 
>>>>> make[4]: Entering directory
>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>>  Testing  t_mpi 
>>>>>  t_mpi  Test Log 
>>>>>  srun: job 84126610 queued and
>>>>> waiting for resources srun: job 84126610 has been allocated
>>>>> resources srun: error: slepner023: tasks 0-5: Alarm clock
>>>>> 0.01user 0.00system 20:03.95elapsed 0%CPU
>>>>> (0avgtext+0avgdata 5152maxresident)k 0inputs+0outputs
>>>>> (0major+1529minor)pagefaults 0swaps make[4]: ***
>>>>> [t_mpi.chkexe_] Error 1 make[4]: Leaving directory
>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>> make[3]: *** [build-check-p] Error 1 make[3]: Leaving
>>>>> directory
>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>> make[2]: *** [test] Error 2 make[2]: Leaving directory
>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>> make[1]: *** [check-am] Error 2 make[1]: Leaving directory
>>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>>>> gcc-4.8-openmpi-3.1.3/testpar'
>>>>> make: *** [check-recursive] Error 1
>>>>> 
>>>

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-16 Thread Ryan Novosielski
I verified that it makes it through to a bash prompt, but I’m a little less 
confident that something make test does doesn’t clear it. Any recommendation 
for a way to verify?

In any case, no change, unfortunately. 

Sent from my iPhone

> On Feb 16, 2019, at 08:13, Gabriel, Edgar  wrote:
> 
> What file system are you running on?
> 
> I will look into this, but it might be later next week. I just wanted to 
> emphasize that we are regularly running the parallel hdf5 tests with ompio, 
> and I am not aware of any outstanding items that do not work (and are 
> supposed to work). That being said, I run the tests manually, and not the 
> 'make test' commands. Will have to check which tests are being run by that.
> 
> Edgar
> 
>> -Original Message-
>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
>> Gouaillardet
>> Sent: Saturday, February 16, 2019 1:49 AM
>> To: Open MPI Users 
>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
>> 3.1.3
>> 
>> Ryan,
>> 
>> Can you
>> 
>> export OMPI_MCA_io=^ompio
>> 
>> and try again after you made sure this environment variable is passed by srun
>> to the MPI tasks ?
>> 
>> We have identified and fixed several issues specific to the (default) ompio
>> component, so that could be a valid workaround until the next release.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ryan Novosielski  wrote:
>>> Hi there,
>>> 
>>> Honestly don’t know which piece of this puzzle to look at or how to get more
>> information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL
>> system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is
>> failing at the below point; I am using a value of RUNPARALLEL='srun --
>> mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise
>> properly configured.
>>> 
>>> Thanks for any help you can provide.
>>> 
>>> make[4]: Entering directory 
>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>> gcc-4.8-openmpi-3.1.3/testpar'
>>> 
>>> Testing  t_mpi
>>> 
>>> t_mpi  Test Log
>>> 
>>> srun: job 84126610 queued and waiting for resources
>>> srun: job 84126610 has been allocated resources
>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system
>>> 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k
>>> 0inputs+0outputs (0major+1529minor)pagefaults 0swaps
>>> make[4]: *** [t_mpi.chkexe_] Error 1
>>> make[4]: Leaving directory 
>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>> gcc-4.8-openmpi-3.1.3/testpar'
>>> make[3]: *** [build-check-p] Error 1
>>> make[3]: Leaving directory 
>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>> gcc-4.8-openmpi-3.1.3/testpar'
>>> make[2]: *** [test] Error 2
>>> make[2]: Leaving directory 
>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>> gcc-4.8-openmpi-3.1.3/testpar'
>>> make[1]: *** [check-am] Error 2
>>> make[1]: Leaving directory 
>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
>> gcc-4.8-openmpi-3.1.3/testpar'
>>> make: *** [check-recursive] Error 1
>>> 
>>> --
>>> 
>>> || \\UTGERS,   
>>> |---*O*---
>>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
>>> ||  \\of NJ | Office of Advanced Research Computing - MSB C630, 
>>> Newark
>>>  `'
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-15 Thread Ryan Novosielski
Hi there,

Honestly don’t know which piece of this puzzle to look at or how to get more 
information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL 
system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is failing 
at the below point; I am using a value of RUNPARALLEL='srun --mpi=pmi2 -p main 
-t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly configured.

Thanks for any help you can provide.

make[4]: Entering directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar'

Testing  t_mpi

t_mpi  Test Log

srun: job 84126610 queued and waiting for resources
srun: job 84126610 has been allocated resources
srun: error: slepner023: tasks 0-5: Alarm clock
0.01user 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k
0inputs+0outputs (0major+1529minor)pagefaults 0swaps
make[4]: *** [t_mpi.chkexe_] Error 1
make[4]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar'
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar'
make[2]: *** [test] Error 2
make[2]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar'
make[1]: *** [check-am] Error 2
make[1]: Leaving directory 
`/scratch/novosirj/install-files/hdf5-1.10.4-build-gcc-4.8-openmpi-3.1.3/testpar'
make: *** [check-recursive] Error 1

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
   `'


signature.asc
Description: Message signed with OpenPGP
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Ryan Novosielski
; Elapsed time is:  5.422948
>>> Total time is:  59.668622
>>> 
>>> Thanks,-- bennet
>>> 
>>> 
>>> make check results
>>> --
>>> 
>>> make  check-TESTS
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>> PASS: predefined_gap_test
>>> PASS: predefined_pad_test
>>> SKIP: dlopen_test
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 3
>>> # PASS:  2
>>> # SKIP:  1
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> 
>>> [ elided ]
>>> PASS: atomic_cmpset_noinline
>>>   - 5 threads: Passed
>>> PASS: atomic_cmpset_noinline
>>>   - 8 threads: Passed
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 8
>>> # PASS:  8
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> 
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
>>> PASS: ompi_rb_tree
>>> PASS: opal_bitmap
>>> PASS: opal_hash_table
>>> PASS: opal_proc_table
>>> PASS: opal_tree
>>> PASS: opal_list
>>> PASS: opal_value_array
>>> PASS: opal_pointer_array
>>> PASS: opal_lifo
>>> PASS: opal_fifo
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 10
>>> # PASS:  10
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> 
>>> [ elided ]
>>> make  opal_thread opal_condition
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> CC   opal_thread.o
>>> CCLD opal_thread
>>> CC   opal_condition.o
>>> CCLD opal_condition
>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> make  check-TESTS
>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 0
>>> # PASS:  0
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> 
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
>>> PASS: opal_datatype_test
>>> PASS: unpack_hetero
>>> PASS: checksum
>>> PASS: position
>>> PASS: position_noncontig
>>> PASS: ddt_test
>>> PASS: ddt_raw
>>> PASS: unpack_ooo
>>> PASS: ddt_pack
>>> PASS: external32
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 10
>>> # PASS:  10
>>> # SKIP:  0
>>> # XFAIL: 0
>>> # FAIL:  0
>>> # XPASS: 0
>>> # ERROR: 0
>>> 
>>> [ elided ]
>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
>>> PASS: opal_bit_ops
>>> PASS: opal_path_nfs
>>> PASS: bipartite_graph
>>> 
>>> Testsuite summary for Open MPI 3.1.0
>>> 
>>> # TOTAL: 3
>>> # PASS:  3
>>

Re: [OMPI users] OpenMPI 1.6.5 on CentOS 7.1, silence ib-locked-pages?

2016-05-24 Thread Ryan Novosielski
> On May 18, 2016, at 6:59 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> On May 18, 2016, at 6:16 PM, Ryan Novosielski <novos...@rutgers.edu> wrote:
>> 
>> I’m pretty sure this is no longer relevant (having read Roland’s messages 
>> about it from a couple of years ago now). Can you please confirm that for 
>> me, and then let me know if there is any way that I can silence this old 
>> copy of OpenMPI that I need to use with some software that depends on it for 
>> some reason? It is causing my users to report it as an issue pretty 
>> regularly.
> 
> The message cites that only 32MB is able to be registered out of a total of 
> 128MB.  That seems low to me.

That’s 32768 MiB, 32GB (out of 128GB).

> Did you look at the FAQ item and see if there are system limits that you 
> should increase?

I did, but again, I’ve seen other messages about this that indicate that it’s 
not required, such as this one:

https://www.open-mpi.org/community/lists/users/2014/08/25090.php

What can happen in these circumstances where you set something that’s not 
required is that you end up finding out down the road that the default has 
changed appropriately but you’re still hard-coding the wrong settings. I’d 
prefer to avoid laying those sorts of traps for myself whenever possible. :)

I’m not as experienced with OpenMPI as I could be; is this not the same issue? 
I am using the CentOS-supplied Mellanox drivers.

[novosirj@perceval2 profile.d]$ modinfo mlx4_en
filename:   
/lib/modules/3.10.0-229.20.1.el7.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
version:2.2-1 (Feb 2014)
license:Dual BSD/GPL
description:Mellanox ConnectX HCA Ethernet driver
author: Liran Liss, Yevgeny Petrilin
rhelversion:7.1
srcversion: DC68737527B57AD77CD3AD6
depends:mlx4_core,ptp,vxlan
intree: Y
vermagic:   3.10.0-229.20.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:38:C3:70:0F:5B:84:90:11:D3:72:15:7D:E5:CD:06:17:C8:15:DE:03
sig_hashalgo:   sha256
parm:   udp_rss:Enable RSS for incoming UDP traffic or disabled (0) 
(uint)
parm:   pfctx:Priority based Flow Control policy on TX[7:0]. Per 
priority bit mask (uint)
parm:   pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
priority bit mask (uint)
parm:   inline_thold:Threshold for using inline data (range: 17-104, 
default: 104) (uint)

--

|| \\UTGERS, |-------*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
   `'

--

|| \\UTGERS, |-------*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
 `'



signature.asc
Description: Message signed with OpenPGP using GPGMail


[OMPI users] OpenMPI 1.6.5 on CentOS 7.1, silence ib-locked-pages?

2016-05-18 Thread Ryan Novosielski
Hi there,

I’m getting the following message:

---
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

   http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

 Local host:  node045.cluster.example.com
 Registerable memory: 32768 MiB
 Total memory:130636 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
---

I’m pretty sure this is no longer relevant (having read Roland’s messages about 
it from a couple of years ago now). Can you please confirm that for me, and 
then let me know if there is any way that I can silence this old copy of 
OpenMPI that I need to use with some software that depends on it for some 
reason? It is causing my users to report it as an issue pretty regularly.

Thanks!

--

|| \\UTGERS, |---*O*---
||_// the State  | Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ  | Office of Advanced Research Computing - MSB C630, Newark
`'



signature.asc
Description: Message signed with OpenPGP using GPGMail