Re: [OMPI users] mca_sharedfp_lockfile issues

2021-11-02 Thread Gabriel, Edgar via users
What file system are you running your code on ? And is the same directory 
shared across all nodes? I have seen this error if users try to use a 
non-shared directory for MPI I/O operations ( e.g. /tmp which is a different 
drive/folder on each node). 

Thanks
Edgar

-Original Message-
From: users  On Behalf Of bend linux4ms.net 
via users
Sent: Tuesday, November 2, 2021 3:33 PM
To: Open MPI Open MPI 
Cc: bend linux4ms.net 
Subject: [OMPI users] mca_sharedfp_lockfile issues

Ok, I got more issues. Maybe someone on the list can help me:

Open MPI version: 4.1.1 download from github source Compile on Centos 8.4  
using GCC 8.4.1 Configured is:

./configure --enable-shared --enable-static \
   --without-tm \
   --enable-mpi-cxx \
   --enable-wrapper-runpath \
   --enable-mpirun-prefix-by-default \
   --enable-mpi-thread-multiple \
   --enable-mpi-fortran=yes \
   --prefix=/p/app/compilers/mpi/openmpi/4.1.1 2>&1 \  | tee config.log

Intel HPC system, 850 nodes trying to launch IOR benchmark.

Top portion of the mpi command:
-

export OMPI_MCA_btl_openib_allow_ib=1
export OMPI_MCA_btl_openib_if_include="mlx5_0:1"

mpirun -machinefile ${hostlist} \
   --mca opal_common_ucx_opal_mem_hooks 1 \
\
   -np ${NP} \
   --map-by node \
   -N ${rpn} \
   -vv \
-
I am getting the message  [##] mca_sharedfp_lockedfile_file_open 
: Error during file open on all the nodes.

I've tried it with the --mca sharedfp lockedfile and without, I still get the 
errors.

What Have I done wrong ?

Thanks ..

Ben Duncan - 





Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

2021-09-23 Thread Gabriel, Edgar via users
let me amend my last email by making clear that I do not recommend using NFS 
for parallel I/O. But if you have to, make sure your code does not do things 
like read-after-write, or multiple processes writing data that ends up in the 
same file system block (can often be avoided by using collective I/O for 
example).



-Original Message-
From: users  On Behalf Of Gabriel, Edgar via 
users
Sent: Thursday, September 23, 2021 5:31 PM
To: Eric Chamberland ; Open MPI Users 

Cc: Gabriel, Edgar ; Louis Poirel 
; Vivien Clauzon 
Subject: Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

-Original Message-
From: Eric Chamberland  

Thanks for your answer Edgard!

In fact, we are able to use NFS and certainly any POSIX file system on a single 
node basis.

I should have been asking for: What are the supported file systems for 
*multiple nodes* read/write access to files?

-> We have tested it on BeeGFS, GPFS, Lustre, PVFS2/OrangeFS, and NFS, but 
again, if a parallel file system has POSIX functions  would expect it to work 
(and yes, I am aware of the strict POSIX semantics are not necessarily 
available in parallel file systems. Internally, we are using  open, close, 
(p)readv, (pwritev), lock, unlock, seek).

For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... 
except for NFS v3 with "noac" mount option (we are about to test with 
"actimeo=0" option to see if it works).

-> well, it depends on how you define that. I would suspect that our largest 
user base is actually using NFS v3/v4 with the noac option + nfslock server. 
Most I/O patterns will probably work in this scenario, and in fact, we are 
actually passing our entire testsuite on multi-node NFS setup (which does some 
nasty things). However, it is true that there could be corner cases that fail. 
In addition, parallel I/O on multi-node NFS can be outrageously slow, since we 
lock the *entire* file before every operation ( in contrary to ROMIO, which 
only locks the file range that is currently being accessed). 

Btw, is OpenMPI MPI I/O  having some "hidden" (mca?) options to make a multiple 
nodes NFS cluster to work?

-> OMPO recognizes the NFS file system automatically, without requiring an mca 
parameter. I usually recommend to users to try to relax the locking options and 
see whether they still produce correct data in order to improve performance of 
their code, since most I/O patterns do not require this super-strict locking 
behavior. This is the fs_ufs_lock_algorithm parameter. 

Thanks
Edgar


Thanks,

Eric

On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote:
> Eric,
>
> generally speaking, ompio should be able to operate correctly on all file 
> systems that have support for POSIX functions.  The generic ufs component is 
> for example being used on  BeeGFS parallel file systems without problems, we 
> are using that on a daily basis. For GPFS, the only reason we handle that 
> file system separately is because of some custom info objects that can be 
> used to configure the file during file_open. If one would not use these info 
> objects the generic ufs component would be as good as the GPFS specific 
> component.
>
> Note, the generic ufs component is also being used for NFS, it has logic 
> built in to recognize an NFS file system and handle some operations slightly 
> differently (but still relying on POSIX functions). The one big exception is 
> Lustre: due its different file locking strategy we are required to use a 
> different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs 
> would work on Lustre, too, but it would be horribly slow.
>
> I cannot comment on CephFS and pNFS since I do not have access to those file 
> systems, it would come down to test them.
>
> Thanks
> Edgar
>
>
> -Original Message-
> From: users  On Behalf Of Eric 
> Chamberland via users
> Sent: Thursday, September 23, 2021 9:28 AM
> To: Open MPI Users 
> Cc: Eric Chamberland ; Vivien 
> Clauzon 
> Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O
>
> Hi,
>
> I am looking around for information about parallel filesystems supported for 
> MPI I/O.
>
> Clearly, GFPS, Lustre are fully supported, but what about others?
>
> - CephFS
>
> - pNFS
>
> - Other?
>
> when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing...
>
> Otherwise I found this into ompi/mca/common/ompio/common_ompio.h :
>
> enum ompio_fs_type
> {
>       NONE = 0,
>       UFS = 1,
>       PVFS2 = 2,
>       LUSTRE = 3,
>       PLFS = 4,
>       IME = 5,
>       GPFS = 6
> };
>
> Does that mean that other fs types (pNFS, CephFS) does not need special 
> treatment or are not supported or not optimally supported?
>
> Thanks,
>
> Eric
>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
>
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42



Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

2021-09-23 Thread Gabriel, Edgar via users
-Original Message-
From: Eric Chamberland  

Thanks for your answer Edgard!

In fact, we are able to use NFS and certainly any POSIX file system on a single 
node basis.

I should have been asking for: What are the supported file systems for 
*multiple nodes* read/write access to files?

-> We have tested it on BeeGFS, GPFS, Lustre, PVFS2/OrangeFS, and NFS, but 
again, if a parallel file system has POSIX functions  would expect it to work 
(and yes, I am aware of the strict POSIX semantics are not necessarily 
available in parallel file systems. Internally, we are using  open, close, 
(p)readv, (pwritev), lock, unlock, seek).

For nfs, MPI I/O is known to *not* work on NFS when using multiple nodes ... 
except for NFS v3 with "noac" mount option (we are about to test with 
"actimeo=0" option to see if it works).

-> well, it depends on how you define that. I would suspect that our largest 
user base is actually using NFS v3/v4 with the noac option + nfslock server. 
Most I/O patterns will probably work in this scenario, and in fact, we are 
actually passing our entire testsuite on multi-node NFS setup (which does some 
nasty things). However, it is true that there could be corner cases that fail. 
In addition, parallel I/O on multi-node NFS can be outrageously slow, since we 
lock the *entire* file before every operation ( in contrary to ROMIO, which 
only locks the file range that is currently being accessed). 

Btw, is OpenMPI MPI I/O  having some "hidden" (mca?) options to make a multiple 
nodes NFS cluster to work?

-> OMPO recognizes the NFS file system automatically, without requiring an mca 
parameter. I usually recommend to users to try to relax the locking options and 
see whether they still produce correct data in order to improve performance of 
their code, since most I/O patterns do not require this super-strict locking 
behavior. This is the fs_ufs_lock_algorithm parameter. 

Thanks
Edgar


Thanks,

Eric

On 2021-09-23 1:57 p.m., Gabriel, Edgar wrote:
> Eric,
>
> generally speaking, ompio should be able to operate correctly on all file 
> systems that have support for POSIX functions.  The generic ufs component is 
> for example being used on  BeeGFS parallel file systems without problems, we 
> are using that on a daily basis. For GPFS, the only reason we handle that 
> file system separately is because of some custom info objects that can be 
> used to configure the file during file_open. If one would not use these info 
> objects the generic ufs component would be as good as the GPFS specific 
> component.
>
> Note, the generic ufs component is also being used for NFS, it has logic 
> built in to recognize an NFS file system and handle some operations slightly 
> differently (but still relying on POSIX functions). The one big exception is 
> Lustre: due its different file locking strategy we are required to use a 
> different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs 
> would work on Lustre, too, but it would be horribly slow.
>
> I cannot comment on CephFS and pNFS since I do not have access to those file 
> systems, it would come down to test them.
>
> Thanks
> Edgar
>
>
> -Original Message-
> From: users  On Behalf Of Eric 
> Chamberland via users
> Sent: Thursday, September 23, 2021 9:28 AM
> To: Open MPI Users 
> Cc: Eric Chamberland ; Vivien 
> Clauzon 
> Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O
>
> Hi,
>
> I am looking around for information about parallel filesystems supported for 
> MPI I/O.
>
> Clearly, GFPS, Lustre are fully supported, but what about others?
>
> - CephFS
>
> - pNFS
>
> - Other?
>
> when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing...
>
> Otherwise I found this into ompi/mca/common/ompio/common_ompio.h :
>
> enum ompio_fs_type
> {
>       NONE = 0,
>       UFS = 1,
>       PVFS2 = 2,
>       LUSTRE = 3,
>       PLFS = 4,
>       IME = 5,
>       GPFS = 6
> };
>
> Does that mean that other fs types (pNFS, CephFS) does not need special 
> treatment or are not supported or not optimally supported?
>
> Thanks,
>
> Eric
>
> --
> Eric Chamberland, ing., M. Ing
> Professionnel de recherche
> GIREF/Université Laval
> (418) 656-2131 poste 41 22 42
>
--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42



Re: [OMPI users] Status of pNFS, CephFS and MPI I/O

2021-09-23 Thread Gabriel, Edgar via users
Eric,

generally speaking, ompio should be able to operate correctly on all file 
systems that have support for POSIX functions.  The generic ufs component is 
for example being used on  BeeGFS parallel file systems without problems, we 
are using that on a daily basis. For GPFS, the only reason we handle that file 
system separately is because of some custom info objects that can be used to 
configure the file during file_open. If one would not use these info objects 
the generic ufs component would be as good as the GPFS specific component.  

Note, the generic ufs component is also being used for NFS, it has logic built 
in to recognize an NFS file system and handle some operations slightly 
differently (but still relying on POSIX functions). The one big exception is 
Lustre: due its different file locking strategy we are required to use a 
different collective I/O component (dynamic_gen2 vs. vulcan). Generic ufs would 
work on Lustre, too, but it would be horribly slow.

I cannot comment on CephFS and pNFS since I do not have access to those file 
systems, it would come down to test them.

Thanks
Edgar


-Original Message-
From: users  On Behalf Of Eric Chamberland 
via users
Sent: Thursday, September 23, 2021 9:28 AM
To: Open MPI Users 
Cc: Eric Chamberland ; Vivien Clauzon 

Subject: [OMPI users] Status of pNFS, CephFS and MPI I/O

Hi,

I am looking around for information about parallel filesystems supported for 
MPI I/O.

Clearly, GFPS, Lustre are fully supported, but what about others?

- CephFS

- pNFS

- Other?

when I "grep" for "pnfs\|cephfs" into ompi source code, I found nothing...

Otherwise I found this into ompi/mca/common/ompio/common_ompio.h :

enum ompio_fs_type
{
     NONE = 0,
     UFS = 1,
     PVFS2 = 2,
     LUSTRE = 3,
     PLFS = 4,
     IME = 5,
     GPFS = 6
};

Does that mean that other fs types (pNFS, CephFS) does not need special 
treatment or are not supported or not optimally supported?

Thanks,

Eric

--
Eric Chamberland, ing., M. Ing
Professionnel de recherche
GIREF/Université Laval
(418) 656-2131 poste 41 22 42



Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-19 Thread Gabriel, Edgar via users
ok, so what I get from this conversation is the following todo list:

1. check out the tests src/mpi/romio/test
2. revisit the atomicity issue. You are right that there scenarios where it 
might be required, the fact that we were not able to hit the issues in our 
tests is no evidence.
3. will work on an update of the FAQ section.



-Original Message-
From: users  On Behalf Of Dave Love via users
Sent: Monday, January 18, 2021 11:14 AM
To: Gabriel, Edgar via users 
Cc: Dave Love 
Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre

"Gabriel, Edgar via users"  writes:

>> How should we know that's expected to fail?  It at least shouldn't fail like 
>> that; set_atomicity doesn't return an error (which the test is prepared for 
>> on a filesystem like pvfs2).  
>> I assume doing nothing, but appearing to, can lead to corrupt data, and I'm 
>> surprised that isn't being seen already.
>> HDF5 requires atomicity -- at least to pass its tests -- so presumably 
>> anyone like us who needs it should use something mpich-based with recent or 
>> old romio, and that sounds like most general HPC systems.  
>> Am I missing something?
>> With the current romio everything I tried worked, but we don't get that 
>> option with openmpi.
>
> First of all, it is mentioned on the FAQ sites of Open MPI, although 
> admittedly it is not entirely update (it lists external32 support also 
> as missing, which is however now available since 4.1).

Yes, the FAQ was full of confusing obsolete material when I last looked.
Anyway, users can't be expected to check whether any particular operation is 
expected to fail silently.  I should have said that
MPI_File_set_atomicity(3) explicitly says the default is true for multiple 
nodes, and doesn't say the call is a no-op with the default implementation.  I 
don't know whether the MPI spec allows not implementing it, but I at least 
expect an error return if it doesn't.
As far as I remember, that's what romio does on a filesystem like pvfs2 (or 
lustre when people know better than implementers and insist on noflock); I 
mis-remembered from before, thinking that ompio would be changed to do the 
same.  From that thread, I did think atomicity was on its way.

Presumably an application requests atomicity for good reason, and can take 
appropriate action if the status indicates it's not available on that 
filesystem.

> You don't need atomicity for the HDF5 tests, we are passing all of them to 
> the best my knowledge, and this is one of the testsuites that we do run 
> regularly as part of our standard testing process.

I guess we're just better at breaking things.

> I am aware that they have an atomicity test -  which we pass for whatever 
> reason. This highlight also btw the issue(s) that I am having with the 
> atomicity option in MPI I/O. 

I don't know what the application is of atomicity in HDF5.  Maybe it isn't 
required for typical operations, but I assume it's not used blithely.  However, 
I'd have thought HDF5 should be prepared for something like pvfs2, and at least 
not abort the test at that stage.

I've learned to be wary of declaring concurrent systems working after a few 
tests.  In fact, the phdf5 test failed for me like this when I tried across 
four lustre client nodes with 4.1's defaults.  (I'm confused about the striping 
involved, because I thought I set it to four, and now it shows as one on that 
directory.)

  ...
  Testing  -- dataset atomic updates (atomicity)
  Proc 9: *** Parallel ERRProc 54: *** Parallel ERROR ***
  VRFY (H5Sset_hyperslab succeeded) failed at line 4293 in t_dset.c
  aborting MPI proceProc 53: *** Parallel ERROR ***

Unfortunately I hadn't turned on backtracing, and I wouldn't get another job 
trough for a while.

> The entire infrastructure to enforce atomicity is actually in place in ompio, 
> and I can give you the option on how to enforce strict atomic behavior for 
> all files in ompio (just not on a per file basis), just be aware that the 
> performance will nose-dive. This is not just the case with ompio, but also in 
> romio, you can read up on that various discussion boards on that topic, look 
> at NFS related posts (where you need the atomicity for correctness in 
> basically all scenarios).

I'm fairly sure I accidentally ran tests successfully on NFS4, at least 
single-node.  I never found a good discussion of the topic, and what I have 
seen about "NFS" was probably specific to NFS3 and non-POSIX compliance, though 
I don't actually care about parallel i/o on NFS.  The information we got about 
lustre was direct from Rob Latham, as nothing showed up online.

I don't like fast-but-wrong, so I think there should be the option of 
correctness, especially as it's the documented default.

> Just as another data point, in the 8+ years that ompio has been available, 
> t

Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-15 Thread Gabriel, Edgar via users
I would like to correct one of my statements:

-Original Message-
From: users  On Behalf Of Gabriel, Edgar via 
users
Sent: Friday, January 15, 2021 7:58 AM
To: Open MPI Users 
Cc: Gabriel, Edgar 
Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre

> The entire infrastructure to enforce atomicity is actually in place in ompio, 
> and I can give you the option on how to enforce strict atomic behavior for 
> all files in ompio (just not on a per file basis), just > be aware that the 
> performance will nose-dive. 

I realized that this statement is not entirely true, we are missing one aspect 
for being able to provide full atomicity.

Thanks
Edgar



Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-15 Thread Gabriel, Edgar via users
-Original Message-
From: users  On Behalf Of Dave Love via users
Sent: Friday, January 15, 2021 4:48 AM
To: Gabriel, Edgar via users 
Cc: Dave Love 
Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre

> How should we know that's expected to fail?  It at least shouldn't fail like 
> that; set_atomicity doesn't return an error (which the test is prepared for 
> on a filesystem like pvfs2).  
> I assume doing nothing, but appearing to, can lead to corrupt data, and I'm 
> surprised that isn't being seen already.
> HDF5 requires atomicity -- at least to pass its tests -- so presumably anyone 
> like us who needs it should use something mpich-based with recent or old 
> romio, and that sounds like most general HPC systems.  
> Am I missing something?
> With the current romio everything I tried worked, but we don't get that 
> option with openmpi.

First of all, it is mentioned on the FAQ sites of Open MPI, although admittedly 
it is not entirely update (it lists external32 support also as missing, which 
is however now available since 4.1). You don't need atomicity for the HDF5 
tests, we are passing all of them to the best my knowledge, and this is one of 
the testsuites that we do run regularly as part of our standard testing 
process. I am aware that they have an atomicity test -  which we pass for 
whatever reason. This highlight also btw the issue(s) that I am having with the 
atomicity option in MPI I/O. 

The entire infrastructure to enforce atomicity is actually in place in ompio, 
and I can give you the option on how to enforce strict atomic behavior for all 
files in ompio (just not on a per file basis), just be aware that the 
performance will nose-dive. This is not just the case with ompio, but also in 
romio, you can read up on that various discussion boards on that topic, look at 
NFS related posts (where you need the atomicity for correctness in basically 
all scenarios).

Just as another data point, in the 8+ years that ompio has been available, 
there was not one issue reported related to correctness due to missing the 
atomicity option.

That being said, if you feel more comfortable using romio, it is completely up 
to you. Open MPI offers this option, and it is incredibly easy to set the 
default parameters on a  platform for all users such that romio is being used.
We are doing with our limited resources the best we can, and while ompio is by 
no means perfect, we try to be responsive to issues reported by users and value 
constructive feedback and discussion.

Thanks
Edgar 



Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-14 Thread Gabriel, Edgar via users
I will have a look at those tests. The recent fixes were not correctness, but 
performance fixes.
Nevertheless, we used to pass the mpich tests, but I admit that it is not a 
testsuite that we run regularly, I will have a look at them. The atomicity 
tests are expected to fail, since this the one chapter of MPI I/O that is not 
implemented in ompio.

Thanks
Edgar

-Original Message-
From: users  On Behalf Of Dave Love via users
Sent: Thursday, January 14, 2021 5:46 AM
To: users@lists.open-mpi.org
Cc: Dave Love 
Subject: [OMPI users] 4.1 mpi-io test failures on lustre

I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system that I 
understand was used to fix ompio problems on lustre.  I'm puzzled that I still 
see failures.

I don't know why there are disjoint sets in mpich's test/mpi/io and 
src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io defaults 
across two nodes.  In src/mpi/romio/test, atomicity failed (ignoring error and 
syshints); in test/mpi/io, the failures were setviewcur, tst_fileview, 
external32_derived_dtype, i_bigtype, and i_setviewcur.  tst_fileview was 
probably killed by the 100s timeout.

It may be that some are only appropriate for romio, but no-one said so before 
and they presumably shouldn't segv or report libc errors.

I built against ucx 1.9 with cuda support.  I realize that has problems on 
ppc64le, with no action on the issue, but there's a limit to what I can do.  
cuda looks relevant since one test crashes while apparently trying to register 
cuda memory; that's presumably not ompio's fault, but we need cuda.


Re: [OMPI users] Parallel HDF5 low performance

2020-12-03 Thread Gabriel, Edgar via users
the reason for potential performance issues on NFS are very different from 
Lustre. Basically, depending on your use-case and the NFS configuration, you 
have to enforce different locking policy to ensure correct output files. The 
default value for chosen for ompio is the most conservative setting, since this 
was the only setting that we found that would result in a correct output file 
for all of our tests.  You can change settings to see whether other options 
would work you.

The parameter that you need to work with is fs_ufs_lock_algorithm. Setting it 
to 1 will completely disable it (and most likely lead to the best performance), 
setting it to 3 is a middle ground (lock specific ranges) and similar to what 
ROMIO does. So e.g.

mpiexec -n 16 --mca fs_ufs_lock_algorihtm 1 ./mytests

That being said, if you google NFS + MPI I/O, you will find a  ton of document 
and reasons for potential problems, so using MPI I/O on top of NFS (whether 
OMPIO or ROMIO) is always at your own risk.
Thanks

Edgar

-Original Message-
From: users  On Behalf Of Gilles Gouaillardet 
via users
Sent: Thursday, December 3, 2020 4:46 AM
To: Open MPI Users 
Cc: Gilles Gouaillardet 
Subject: Re: [OMPI users] Parallel HDF5 low performance

Patrick,

glad to hear you will upgrade Open MPI thanks to this workaround!

ompio has known performance issues on Lustre (this is why ROMIO is still the 
default on this filesystem) but I do not remember such performance issues have 
been reported on a NFS filesystem.

Sharing a reproducer will be very much appreciated in order to improve ompio

Cheers,

Gilles

On Thu, Dec 3, 2020 at 6:05 PM Patrick Bégou via users 
 wrote:
>
> Thanks Gilles,
>
> this is the solution.
> I will set OMPI_MCA_io=^ompio automatically when loading the parallel
> hdf5 module on the cluster.
>
> I was tracking this problem for several weeks but not looking in the 
> right direction (testing NFS server I/O, network bandwidth.)
>
> I think we will now move definitively to modern OpenMPI implementations.
>
> Patrick
>
> Le 03/12/2020 à 09:06, Gilles Gouaillardet via users a écrit :
> > Patrick,
> >
> >
> > In recent Open MPI releases, the default component for MPI-IO is 
> > ompio (and no more romio)
> >
> > unless the file is on a Lustre filesystem.
> >
> >
> > You can force romio with
> >
> > mpirun --mca io ^ompio ...
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > On 12/3/2020 4:20 PM, Patrick Bégou via users wrote:
> >> Hi,
> >>
> >> I'm using an old (but required by the codes) version of hdf5 
> >> (1.8.12) in parallel mode in 2 fortran applications. It relies on 
> >> MPI/IO. The storage is NFS mounted on the nodes of a small cluster.
> >>
> >> With OpenMPI 1.7 it runs fine but using modern OpenMPI 3.1 or 4.0.5 
> >> the I/Os are 10x to 100x slower. Are there fundamentals changes in 
> >> MPI/IO for these new releases of OpenMPI and a solution to get back 
> >> to the IO performances with this parallel HDF5 release ?
> >>
> >> Thanks for your advices
> >>
> >> Patrick
> >>
>


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-26 Thread Gabriel, Edgar via users
I will have a look at the t_bigio tests on Lustre with ompio.  We had from 
collaborators some reports about the performance problems similar to the one 
that you mentioned here (which was the reason we were hesitant to make ompio 
the default on Lustre), but part of the problem is that we were not able to 
reproduce it reliably on the systems that we had access to, which we makes 
debugging and fixing the issue very difficult. Lustre is a very unforgiving 
file system, if you get something wrong with the settings, the performance is 
not just a bit off,  but often orders of magnitude (as in your measurements).

Thanks!
Edgar

-Original Message-
From: users  On Behalf Of Mark Dixon via users
Sent: Thursday, November 26, 2020 9:38 AM
To: Dave Love via users 
Cc: Mark Dixon ; Dave Love 

Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

On Wed, 25 Nov 2020, Dave Love via users wrote:

>> The perf test says romio performs a bit better.  Also -- from overall 
>> time -- it's faster on IMB-IO (which I haven't looked at in detail, 
>> and ran with suboptimal striping).
>
> I take that back.  I can't reproduce a significant difference for 
> total IMB-IO runtime, with both run in parallel on 16 ranks, using 
> either the system default of a single 1MB stripe or using eight 
> stripes.  I haven't teased out figures for different operations yet.  
> That must have been done elsewhere, but I've never seen figures.

But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio 
parallel test appears to be a pathological case and OMPIO is 2 orders of 
magnitude slower on a Lustre filesystem:

- OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds
- OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds

End users seem to have the choice of:

- use openmpi 4.x and have some things broken (romio)
- use openmpi 4.x and have some things slow (ompio)
- use openmpi 3.x and everything works

My concern is that openmpi 3.x is near, or at, end of life.

Mark


t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, 
Lustre 2.12.5:

[login testpar]$ time mpirun -np 6 ./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection ***Express test mode on.  Several 
tests are skipped

real0m21.141s
user2m0.318s
sys 0m3.289s


[login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 
./t_bigio

Testing Dataset1 write by ROW

Testing Dataset2 write by COL

Testing Dataset3 write select ALL proc 0, NONE others

Testing Dataset4 write point selection

Read Testing Dataset1 by COL

Read Testing Dataset2 by ROW

Read Testing Dataset3 read select ALL proc 0, NONE others

Read Testing Dataset4 with Point selection ***Express test mode on.  Several 
tests are skipped

real42m34.103s
user213m22.925s
sys 8m6.742s



Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-16 Thread Gabriel, Edgar via users
hm, I think this sounds like a different issue, somebody who is more invested 
in the ROMIO Open MPI work should probably have a look.

Regarding compiling Open MPI with Lustre support for ROMIO, I cannot test it 
right now for various reasons, but if I recall correctly the trick was to 
provide the --with-lustre option twice, once inside of the 
"--with-io-romio-flags=" (along with the option that you provided), and once 
outside (for ompio).

Thanks
Edgar

-Original Message-
From: Mark Dixon  
Sent: Monday, November 16, 2020 8:19 AM
To: Gabriel, Edgar via users 
Cc: Gabriel, Edgar 
Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

Hi Edgar,

Thanks for this - good to know that ompio is an option, despite the reference 
to potential performance issues.

I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test "testphdf5" 
timeout (with the timeout set to an hour) using romio. Is it a known issue 
there, please?

When it times out, the last few lines to be printed are these:

Testing  -- multi-chunk collective chunk io (cchunk3) Testing  -- multi-chunk 
collective chunk io (cchunk3) Testing  -- multi-chunk collective chunk io 
(cchunk3) Testing  -- multi-chunk collective chunk io (cchunk3) Testing  -- 
multi-chunk collective chunk io (cchunk3) Testing  -- multi-chunk collective 
chunk io (cchunk3)

The other thing I note is that openmpi doesn't configure romio's lustre driver, 
even when given "--with-lustre". Regardless, I see the same result whether or 
not I add "--with-io-romio-flags=--with-file-system=lustre+ufs"

Cheers,

Mark

On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote:

> this is in theory still correct, the default MPI I/O library used by 
> Open MPI on Lustre file systems is ROMIO in all release versions. That 
> being said, ompio does have support for Lustre as well starting from 
> the
> 2.1 series, so you can use that as well. The main reason that we did 
> not switch to ompio for Lustre as the default MPI I/O library is a 
> performance issue that can arise under certain circumstances.
>
> Which version of Open MPI are you using? There was a bug fix in the 
> Open MPI to ROMIO integration layer sometime in the 4.0 series that 
> fixed a datatype problem, which caused some problems in the HDF5 
> tests. You might be hitting that problem.
>
> Thanks
> Edgar
>
> -Original Message-
> From: users  On Behalf Of Mark Dixon 
> via users
> Sent: Monday, November 16, 2020 4:32 AM
> To: users@lists.open-mpi.org
> Cc: Mark Dixon 
> Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
>
> Hi all,
>
> I'm confused about how openmpi supports mpi-io on Lustre these days, 
> and am hoping that someone can help.
>
> Back in the openmpi 2.0.0 release notes, it said that OMPIO is the 
> default MPI-IO implementation on everything apart from Lustre, where 
> ROMIO is used. Those release notes are pretty old, but it still 
> appears to be true.
>
> However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I 
> tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to 
> print warning messages (UCX_LOG_LEVEL=ERROR).
>
> Can I just check: are we still supposed to be using ROMIO?
>
> Thanks,
>
> Mark
>


Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-16 Thread Gabriel, Edgar via users
this is in theory still correct, the default MPI I/O library used by Open MPI 
on Lustre file systems is ROMIO in all release versions. That being said, ompio 
does have support for Lustre as well starting from the 2.1 series, so you can 
use that as well. The main reason that we did not switch to ompio for Lustre as 
the default MPI I/O library is a performance issue that can arise under certain 
circumstances.

Which version of Open MPI are you using? There was a bug fix in the Open MPI to 
ROMIO integration layer sometime in the 4.0 series that fixed a datatype 
problem, which caused some problems in the HDF5 tests. You might be hitting 
that problem.

Thanks
Edgar

-Original Message-
From: users  On Behalf Of Mark Dixon via users
Sent: Monday, November 16, 2020 4:32 AM
To: users@lists.open-mpi.org
Cc: Mark Dixon 
Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

Hi all,

I'm confused about how openmpi supports mpi-io on Lustre these days, and am 
hoping that someone can help.

Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default 
MPI-IO implementation on everything apart from Lustre, where ROMIO is used. 
Those release notes are pretty old, but it still appears to be true.

However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell 
openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning 
messages (UCX_LOG_LEVEL=ERROR).

Can I just check: are we still supposed to be using ROMIO?

Thanks,

Mark


Re: [OMPI users] ompe support for filesystems

2020-11-04 Thread Gabriel, Edgar via users
the ompio software infrastructure has multiple frameworks. 

fs framework: abstracts out file system level operations (open, close, etc)

fbtl framework: provides the abstractions and implementations of *individual* 
file I/O operations (seek,read,write, iread,iwrite)

fcoll framework: provides the abstractions and implementations of *collective* 
file I/O operations ( read_all, write_all, etc.)

sharedfp framework: provides the abstractions and implementations *shared file 
pointer* file I/O operations (read_shared, write_shared, read_ordered, 
write_ordered).

Feel free to ping me also directly if you need more assistance. If you are 
looking for a reference and more explanations, please have a look at the 
following paper:
  
Mohamad Chaarawi, Edgar Gabriel, Rainer Keller, Richard Graham, George Bosilca 
and Jack Dongarra, 'OMPIO: A Modular Software Architecture for MPI I/O', in Y. 
Cotronis, A. Danalis, D. Nikolopoulos, J. Dongarra, (Eds.) 'Recent Advances in 
Message Passing Interface', LNCS vol. 6960, pp. 81-89, Springer, 2011.

http://www2.cs.uh.edu/~gabriel/publications/EuroMPI11_OMPIO.pdf 

Best regards
Edgar

-Original Message-
From: users  On Behalf Of Ognen Duzlevski via 
users
Sent: Monday, November 2, 2020 7:54 AM
To: Open MPI Users 
Cc: Ognen Duzlevski 
Subject: Re: [OMPI users] ompe support for filesystems

Gilles,

Thank you for replying.

I took a look at the code and am curious to understand where the actual 
read/write/seek etc. operations are implemented. From what I can see/understand 
- what you pointed me to implements file open/close etc. operations that 
pertain to particular filesystems.

I then tried to figure out the read/write/seek etc. operations and can see that 
a MPI_File structure appears to have a f_io_selected_module member, whose 
v_2_0_0 member seems to have the list of pointers to all the functionals 
dealing with the actual file write/read/seek functionality. Is this correct?

What I would like to figure out is where the actual writes or reads happen (as 
in the underlying filesystem's implementations). I imagine for some filesystems 
a write, for example, is not just a simple call to the write onto disk but 
involves a bit more logic/magic.

Thanks!
Ognen


Gilles Gouaillardet via users writes:

> Hi Ognen,
>
> MPI-IO is implemented by two components:
>  - ROMIO (from MPICH)
>  - ompio ("native" Open MPI MPI-IO, default component unless running 
> on Lustre)
>
> Assuming you want to add support for a new filesystem in ompio, first 
> step is to implement a new component in the fs framework the framework 
> is in /ompi/mca/fs, and each component is in its own directory (for 
> example ompi/mca/fs/gpfs)
>
> There are a some configury tricks (create a configure.m4, add Makefile 
> to autoconf, ...) to make sure your component is even compiled.
> If you are struggling with these, feel free to open a Pull Request to 
> get some help fixing the missing bits.
>
> Cheers,
>
> Gilles
>
> On Sun, Nov 1, 2020 at 12:18 PM Ognen Duzlevski via users 
>  wrote:
>>
>> Hello!
>>
>> If I wanted to support a specific filesystem in open mpi, how is this 
>> done? What code in the source tree does it?
>>
>> Thanks!
>> Ognen



Re: [OMPI users] MPI I/O question using MPI_File_write_shared

2020-06-05 Thread Gabriel, Edgar via users
Your code looks correct, and based on your output I would actually suspect that 
the I/O part finished correctly, the error message that you see is not an IO 
error, but from the btl (which is communication related).
 
What version of Open MPI are using, and on what file system?
Thanks
Edgar

-Original Message-
From: users  On Behalf Of Stephen Siegel via 
users
Sent: Friday, June 5, 2020 5:35 PM
To: users@lists.open-mpi.org
Cc: Stephen Siegel 
Subject: [OMPI users] MPI I/O question using MPI_File_write_shared

I posted this question on StackOverflow and someone suggested I write to the 
OpenMPI community.

https://stackoverflow.com/questions/62223698/mpi-i-o-why-does-my-program-hang-or-misbehave-when-one-process-writes-using-mpi

Below is a little MPI program.  It is a simple use of MPI I/O.   Process 0 
writes an int to the file using MPI_File_write_shared; no other process writes 
anything.   It works correctly using an MPICH installation, but on two 
different machines using OpenMPI, it either hangs in the middle of the call to 
MPI_File_write_shared, or it reports an error at the end.  Not sure if it is my 
misunderstanding of the MPI Standard or a bug or configuration problem with my 
OpenMPI.

Thanks in advance if anyone can look at it, Steve


#include 
#include 
#include 

int nprocs, rank;

int main() {
  MPI_File fh;
  int err, count;
  MPI_Status status;

  MPI_Init(NULL, NULL);
  MPI_Comm_size(MPI_COMM_WORLD, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  err = MPI_File_open(MPI_COMM_WORLD, "io_byte_shared.tmp",
  MPI_MODE_CREATE | MPI_MODE_WRONLY,
  MPI_INFO_NULL, );
  assert(err==0);
  err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL);
  assert(err==0);
  printf("Proc %d: file has been opened.\n", rank); fflush(stdout);
  // Proc 0 only writes header using shared file pointer...
  MPI_Barrier(MPI_COMM_WORLD);
  if (rank == 0) {
int x = ;
printf("Proc 0: About to write to file.\n"); fflush(stdout);
err = MPI_File_write_shared(fh, , 1, MPI_INT, );
printf("Proc 0: Finished writing.\n"); fflush(stdout);
assert(err == 0);
  }
  MPI_Barrier(MPI_COMM_WORLD);
  printf("Proc %d: about to close file.\n", rank); fflush(stdout);
  err = MPI_File_close();
  assert(err==0);
  MPI_Finalize();
}

Example run:

$ mpicc io_byte_shared.c
$ mpiexec -n 4 ./a.out
Proc 0: file has been opened.
Proc 0: About to write to file.
Proc 0: Finished writing.
Proc 1: file has been opened.
Proc 2: file has been opened.
Proc 3: file has been opened.
Proc 0: about to close file.
Proc 1: about to close file.
Proc 2: about to close file.
Proc 3: about to close file.
[ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt / 
btl:no-nics [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to 
see all help / error messages




Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gabriel, Edgar via users
The one test that would give you a good idea of the upper bound for your 
scenario would be that write a benchmark where each process writes to a 
separate file, and look at the overall bandwidth achieved across all processes. 
The MPI I/O performance will be less or equal to the bandwidth achieved in this 
scenario, as long as the number of processes are moderate.

Thanks
Edgar

From: Dong-In Kang 
Sent: Monday, April 6, 2020 9:34 AM
To: Collin Strassburger 
Cc: Open MPI Users ; Gabriel, Edgar 

Subject: Re: [OMPI users] Slow collective MPI File IO

Hi Collin,

It is written in C.
So, I think it is OK.

Thank you,
David


On Mon, Apr 6, 2020 at 10:19 AM Collin Strassburger 
mailto:cstrassbur...@bihrle.com>> wrote:
Hello,

Just a quick comment on this; is your code written in C/C++ or Fortran?  
Fortran has issues with writing at a decent speed regardless of MPI setup and 
as such should be avoided for file IO (yet I still occasionally see it 
implemented).

Collin

From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Dong-In Kang via users
Sent: Monday, April 6, 2020 10:02 AM
To: Gabriel, Edgar mailto:egabr...@central.uh.edu>>
Cc: Dong-In Kang mailto:dik...@gmail.com>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] Slow collective MPI File IO


Thank you Edgar for the information.

I also tried MPI_File_write_at_all(), but it usually makes the performance 
worse.
My program is very simple.
Each MPI process writes a consecutive portion of a file.
No interleaving among the MPI processes.
I think in this case I can use MPI_File_write_at().

I tested the maximum bandwidth of the target devices and they are at least a 
few times bigger than what single process can achieve.
I tested it using the same program but open the individual files using 
MPI_COMM_SELF.
I tested 32MB chunk, but didn't show noticeable changes. I also tried 512MB 
chunk, but no noticeable difference.
(There are performance differences between using 32MB chunk and using 512MB 
chunk.
But, they still don't make multiple MPI processes file IO exceeds the 
performance of single MPI process file IO)
As for the local disk, at least 2 times faster than single MPI process can 
achieve.
As for the ramdisk, at least 5 times faster.
Luster, I know that it is at least 7-8 times or more faster depending on the 
configuration.

About caching effect, it would be the case of MPI_File_read().
I can see very high bandwidth of MPI_File_read(), which I believe comes from 
caches in RAM.
But as for MPI_File_write, I think it doesn't be affected by caching.
And I create a new file for each test and removes the file at the end of the 
testing.

I may make a very simple mistake, but I don't know what it is.
I saw MPI_File I/O could achieve multiple times of speedup over single process 
file IO,
when faster file system is used like Lustre from a few reports in the internet.
I started this experiment because I couldn't get speedup on Lustre file system.
And then I moved the experiment to ramdisk and local disk, because it can 
remove the issue of Lustre configuration.

Any comments are welcome.

David











On Mon, Apr 6, 2020 at 9:03 AM Gabriel, Edgar 
mailto:egabr...@central.uh.edu>> wrote:
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending o

Re: [OMPI users] Slow collective MPI File IO

2020-04-06 Thread Gabriel, Edgar via users
Hi,

A couple of comments. First, if you use MPI_File_write_at, this is usually not 
considered collective I/O, even if executed by multiple processes. 
MPI_File_write_at_all would be collective I/O.

Second, MPI I/O can not do ‘magic’, but is bound by hardware that you are 
providing. If already a single process is able to saturate the bandwidth of 
your file system and hardware, you will not be able to see performance 
improvements from multiple processes (some minor exceptions maybe due to 
caching effects, but that is only for smaller problem sizes, the larger the 
amount of data that you try to write, the lesser the caching effects become in 
file I/O). So the first question that you have to answer, what is the sustained 
bandwidth of your hardware, and are you able  to saturate it already with a 
single process. If you are using a single hard drive (or even 2 or 3 hard 
drives in a RAID 0 configuration), this is almost certainly the case.

Lastly, the configuration parameters of your tests also play a major role. As a 
general rule, the larger amounts of data you are able to provide per file I/O 
call, the better the performance will be. 1MB of data per call is probably on 
the smaller side. The ompio implementation of MPI I/O breaks large individual 
I/O operations (e.g. MPI_File_write_at) into chunks of 512MB for performance 
reasons internally. Large collective I/O operations (e.g. 
MPI_File_write_at_all) are broken into chunks of 32 MB. This gives you some 
hints on the quantities of data that you would have to use for performance 
reasons.

Along the same lines, one final comment. You say you did 1000 writes of 1MB 
each. For a single process that is about 1GB of data. Depending on how much 
main memory  your PC has, this amount of data can still be cached in modern 
systems, and you might have an unrealistically high bandwidth value for the 1 
process case that you are comparing against (it depends a bit on what your 
benchmark does, and whether you force flushing the data to disk inside of your 
measurement loop).

Hope this gives you some pointers on where to start to look.
Thanks
Edgar

From: users  On Behalf Of Dong-In Kang via 
users
Sent: Monday, April 6, 2020 7:14 AM
To: users@lists.open-mpi.org
Cc: Dong-In Kang 
Subject: [OMPI users] Slow collective MPI File IO

Hi,

I am running an MPI program where N processes write to a single file on a 
single shared memory machine.
I’m using OpenMPI v.4.0.2.
Each MPI process write a 1MB chunk of data for 1K times sequentially.
There is no overlap in the file between any of the two MPI processes.
I ran the program for -np = {1, 2, 4, 8}.
I am seeing that the speed of the collective write to a file for -np = {2, 4, 
8} never exceeds the speed of -np = {1}.
I did the experiment with a few different file systems {local disk, ram disk, 
Luster FS}.
For all of them, I see similar results.
The speed of collective write to a single shared file never exceeds the speed 
of single MPI process case.
Any tip or suggestions?

I used MPI_File_write_at() routine with proper offset for each MPI process.
(I also tried MPI_File_write_at_all() routine, which makes the performance 
worse as np gets bigger.)
Before writing, MPI_Barrrier() is used.
The start time is taken right after MPI_Barrier() using MPI_Timer();
The end time is taken right after another MPI_Barrier().
The speed of the collective write is calculate as
(total data amount written to the file)/(time between the first MPI_Barrier() 
and the second MPI_Barrier());

Any idea to increase the speed?

Thanks,
David



Re: [OMPI users] How to prevent linking in GPFS when it is present

2020-03-30 Thread Gabriel, Edgar via users
ompio only added recently support for gpfs, and its only available in master 
(so far). If you are using any of the released versions of Open MPI (2.x, 3.x, 
4.x) you will not find this feature in ompio yet. Thus, the issue is only how 
to disable gpfs in romio. I could not find right away an option for that, but I 
keep looking.

Thanks
Edgar

-Original Message-
From: users  On Behalf Of Jonathon A Anderson 
via users
Sent: Monday, March 30, 2020 4:36 PM
To: users@lists.open-mpi.org
Cc: Jonathon A Anderson 
Subject: Re: [OMPI users] How to prevent linking in GPFS when it is present

I'm going to try ac_cv_header_gpfs_h=no; but --without-gpfs doesn't seem to 
exist. I tried it on both 3.1.5 and 2.1.6

[joan5896@admin2 openmpi-3.1.5]$ ./configure --without-gpfs
configure: WARNING: unrecognized options: --without-gpfs


From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Sunday, March 29, 2020 6:17 PM
To: users@lists.open-mpi.org
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] How to prevent linking in GPFS when it is present

Jonathon,


GPFS is used by both the ROMIO component (that comes from MPICH) and the 
fs/gpfs component that is used by ompio

(native Open MPI MPI-IO so to speak).

you should be able to disable both by running

ac_cv_header_gpfs_h=no configure --without-gpfs ...


Note that Open MPI is modular by default (e.g. unless you configure 
--disable-dlopen), and if you run it on

a node that does not have libgpfs.so[.version], you might only see a warning 
and Open MPI will use ompio

(note that might not apply on Lustre since only ROMIO is used on this
filesystem)


Cheers,


Gilles


On 3/30/2020 8:25 AM, Jonathon A Anderson via users wrote:
> We are trying to build Open MPI on a system that happens to have GPFS 
> installed. This appears to cause Open MPI to detect gpfs.h and link against 
> libgpfs.so. We are trying to build a central software stack for use on 
> multiple clusters, some of which do not have GPFS. (It is our experience that 
> this provokes an error, as libgpfs.so is not found on these clusters.) To 
> accommodate this I want to build openmpi explicitly without linking against 
> GPFS.
>
> I tried to accomplish this with
>
> ./configure --with-io-romio-flags='--with-file-system=ufs+nfs'
>
> But gpfs was still linked.
>
> configure:397895: result: -lhwloc -ldl -lz -lpmi2 -lrt -lgpfs -lutil 
> -lm -lfabric
>
> How can I tell Open MPI to not link against GPFS?
>
> ~jonathon
>
>
> p.s., I realize that I could just build on a system that does not have GPFS 
> installed; but I am trying to genericize this to encapsulate in the Spack 
> package. I also don't understand why the Spack package is detecting gpfs.h in 
> the first place, as I thought Spack tries to isolate its build environment 
> from the host system; but I'll ask them that in a separate message.


Re: [OMPI users] Read from file performance degradation whenincreasing number of processors in some cases

2020-03-06 Thread Gabriel, Edgar via users
How is the performance if you leave a few cores for the OS, e,g. running with 
60 processes instead of 64? Reasoning being that the file read operation is 
really executed by the OS, and could potentially be quite resource intensive.

Thanks
Edgar

From: users  On Behalf Of Ali Cherry via users
Sent: Friday, March 6, 2020 8:06 AM
To: Open MPI Users 
Cc: Ali Cherry 
Subject: Re: [OMPI users] Read from file performance degradation whenincreasing 
number of processors in some cases

Hello,

Thank you for your replies.
Yes, it is only a single node with 64 cores.
The input file is copied from nfs to a tmpfs when I start the node.
The mpirun command lines were:
$  mpirun -np 64 --mca btl vader,self pms.out /run/user/10002/bigarray.in > 
pms-vader-64.log 2>&1
$ mpirun -np 32 --mca btl vader,self pms.out /run/user/10002/bigarray.in > 
pms-vader-32.log 2>&1
$  mpirun -np 32 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > 
pms-tcp-32.log 2>&1
$  mpirun -np 64 --mca btl tcp,self pms.out /run/user/10002/bigarray.in > 
pms-tcp-64.log 2>&1
$  mpirun -np 32 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > 
mpijr-vader-32.log 2>&1
$  mpirun -np 64 --mca btl vader,self mpijr.out /run/user/10002/bigarray.in > 
mpijr-vader-64.log 2>&1

I added mpi_just_read_barrier.c: 
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read_barrier-c
Unfortunately, despite running mpi_just_read_barrier with 32 cores and 
--bind-to core set, I was not unable to run it with 64 cores for the following 
reason:
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:compute-0
   #processes:  2
   #cpus:   1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
—


I will solve this and get back to you soon.


Best regards,
Ali Cherry.



On Mar 6, 2020, at 3:24 PM, Gilles Gouaillardet via users 
mailto:users@lists.open-mpi.org>> wrote:

 Also, in mpi_just_read.c, what if you add
MPI_Barrier(MPI_COMM_WORLD);
right before invoking
MPI_Finalize();

can you observe a similar performance degradation when moving from 32 to 64 
tasks ?

Cheers,

Gilles
- Original Message -
 Hi,

The log filenames suggests you are always running on a single node, is that 
correct ?
Do you create the input file on the tmpfs once for all? before each run?
Can you please post your mpirun command lines?
If you did not bind the tasks, can you try again
mpirun --bind-to core ...

Cheers,

Gilles
- Original Message -
Hi,

We faced an issue when testing the scalability of parallel merge sort using 
reduction tree on an array of size 1024^3.
Currently, only the master opens the input file and parse it into an array 
using fscanf and then distribute the array to other processors.
When using 32 processors, it took ~109 seconds to read from file.
When using 64 processors, it took ~216 seconds to read from file.
Despite varying number of processors, only one processor (the master) read the 
file.
The input file is stored in a tmpfs, its made up of 1024^3 + 1 numbers (where 
the first number is the array size).

Additionally, I ran a C program that only read the file, it took ~104 seconds.
However, I also ran an MPI program that only read the file, it took ~116 and  
~118 seconds on 32 and 64 processors respectively.

Code at  https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf
parallel_ms.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-parallel_ms-c
mpi_just_read.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-mpi_just_read-c
just_read.c:  
https://gist.github.com/alichry/84a9721bac741ffdf891e70b82274aaf#file-just_read-c

Clearly, increasing number of processors on mpi_just_read.c did not severely 
affect the elapsed time.
For parallel_ms.c, is it possible that 63 processors are in a blocking-read 
state from processor 0 somehow affecting the read from file elapsed time?

Any assistance or clarification would be appreciated.
Ali.











Re: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, Fails in Open MPI

2020-02-24 Thread Gabriel, Edgar via users
I am not an expert for the one-sided code in Open MPI, I wanted to comment 
briefly on the potential MPI -IO related item. As far as I can see, the error 
message

“Read -1, expected 48, errno = 1”

does not stem from MPI I/O, at least not from the ompio library. What file 
system did you use for these tests?

Thanks
Edgar

From: users  On Behalf Of Matt Thompson via 
users
Sent: Monday, February 24, 2020 1:20 PM
To: users@lists.open-mpi.org
Cc: Matt Thompson 
Subject: [OMPI users] Help with One-Sided Communication: Works in Intel MPI, 
Fails in Open MPI

All,

My guess is this is a "I built Open MPI incorrectly" sort of issue, but I'm not 
sure how to fix it. Namely, I'm currently trying to get an MPI project's CI 
working on CircleCI using Open MPI to run some unit tests (on a single node, so 
need some oversubscribe). I can build everything just fine, but when I try to 
run, things just...blow up:

[root@3796b115c961 build]# /opt/openmpi-4.0.2/bin/mpirun -np 18 -oversubscribe 
/root/project/MAPL/build/bin/pfio_ctest_io.x -nc 6 -nsi 6 -nso 6 -ngo 1 -ngi 1 
-v T,U -s mpi
 start app rank:   0
 start app rank:   1
 start app rank:   2
 start app rank:   3
 start app rank:   4
 start app rank:   5
[3796b115c961:03629] Read -1, expected 48, errno = 1
[3796b115c961:03629] *** An error occurred in MPI_Get
[3796b115c961:03629] *** reported by process [2144600065,12]
[3796b115c961:03629] *** on win rdma window 5
[3796b115c961:03629] *** MPI_ERR_OTHER: known error not in list
[3796b115c961:03629] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
abort,
[3796b115c961:03629] ***and potentially your MPI job)

I'm currently more concerned about the MPI_Get error, though I'm not sure what 
that "Read -1, expected 48, errno = 1" bit is about (MPI-IO error?). Now this 
code is fairly fancy MPI code, so I decided to try a simpler one. Searched the 
internet and found an example program here:

https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication

and when I build and run with Intel MPI it works:

(1027)(master) $ mpirun -V
Intel(R) MPI Library for Linux* OS, Version 2018 Update 4 Build 20180823 (id: 
18555)
Copyright 2003-2018 Intel Corporation.
(1028)(master) $ mpiicc rma_test.c
(1029)(master) $ mpirun -np 2 ./a.out
srun.slurm: cluster configuration lacks support for cpu binding
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
Rank 0 gets data from the shared memory: 10 11 12 13
Rank 1 gets data from the shared memory: 00 01 02 03
Rank 0 has new data in the shared memory:Rank 1 has new data in the shared 
memory: 10 11 12 13
 00 01 02 03

So, I have some confidence it was written correctly. Now on the same system I 
try with Open MPI (building with gcc, not Intel C):

(1032)(master) $ mpirun -V
mpirun (Open MPI) 4.0.1

Report bugs to http://www.open-mpi.org/community/help/
(1033)(master) $ mpicc rma_test.c
(1034)(master) $ mpirun -np 2 ./a.out
Rank 0 running on borgj001
Rank 1 running on borgj001
Rank 0 sets data in the shared memory: 00 01 02 03
Rank 1 sets data in the shared memory: 10 11 12 13
[borgj001:22668] *** An error occurred in MPI_Get
[borgj001:22668] *** reported by process [2514223105,1]
[borgj001:22668] *** on win rdma window 3
[borgj001:22668] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[borgj001:22668] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[borgj001:22668] ***and potentially your MPI job)
[borgj001:22642] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[borgj001:22642] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

This is a similar failure to above. Any ideas what I might be doing wrong here? 
I don't doubt I'm missing something, but I'm not sure what. Open MPI was built 
pretty boringly:

Configure command line: '--with-slurm' '--enable-shared' 
'--disable-wrapper-rpath' '--disable-wrapper-runpath' 
'--enable-mca-no-build=btl-usnic' '--prefix=...'

And I'm not sure if we need those disable-wrapper bits anymore, but long ago we 
needed them, and so they've lived on in "how to build" READMEs until something 
breaks. This btl-usnic is a bit unknown to me (this was built by sysadmins on a 
cluster), but this is pretty close to how I build on my desktop and it has the 
same issue.

Any ideas from the experts?

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton


Re: [OMPI users] Deadlock in netcdf tests

2019-10-26 Thread Gabriel, Edgar via users
Orion,
It might be a good idea. This bug is triggered from the fcoll/two_phase 
component (and having spent just two minutes in looking at it, I have a 
suspicion what triggers it, namely in int vs. long conversion issue), so it is 
probably unrelated to the other one.

I need to add running the netcdf test cases to my list of standard testsuites, 
but we didn't used to have any problems with them :-(
Thanks for the report, we will be working on them!

Edgar


> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion
> Poplawski via users
> Sent: Friday, October 25, 2019 10:21 PM
> To: Open MPI Users 
> Cc: Orion Poplawski 
> Subject: Re: [OMPI users] Deadlock in netcdf tests
> 
> Thanks for the response, the workaround helps.
> 
> With that out of the way I see:
> 
> + mpiexec -n 4 ./tst_parallel4
> Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
> num_aggregators(1)fd_size=461172966257152 off=4156705856
> Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >=
> num_aggregators(1)fd_size=4611731477435006976 off=4157193280
> 
> Should I file issues for both of these?
> 
> On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote:
> > Orion,
> >
> >
> > thanks for the report.
> >
> >
> > I can confirm this is indeed an Open MPI bug.
> >
> > FWIW, a workaround is to disable the fcoll/vulcan component.
> >
> > That can be achieved by
> >
> > mpirun --mca fcoll ^vulcan ...
> >
> > or
> >
> > OMPI_MCA_fcoll=^vulcan mpirun ...
> >
> >
> > I also noted the tst_parallel3 program crashes with the ROMIO component.
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> > On 10/25/2019 12:55 PM, Orion Poplawski via users wrote:
> >> On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
> >>> Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are
> >>> seeing a test hang with openmpi 4.0.2. Backtrace:
> >>>
> >>> (gdb) bt
> >>> #0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
> >>> #1  0x7f90c1ac8a05 in ompi_request_default_wait () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #3  0x7f90c1b2bb73 in
> >>> ompi_coll_base_allreduce_intra_recursivedoubling () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
> >>> /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
> >>> #5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from
> >>> /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
> >>> #6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from
> >>> /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
> >>> #7  0x7f90c1af033f in PMPI_File_write_at_all () from
> >>> /usr/lib64/openmpi/lib/libmpi.so.40
> >>> #8  0x7f90c1627d7b in H5FD_mpio_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #9  0x7f90c14636ee in H5FD_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #10 0x7f90c1442eb3 in H5F__accum_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #11 0x7f90c1543729 in H5PB_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #12 0x7f90c144d69c in H5F_block_write () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #13 0x7f90c161cd10 in H5C_apply_candidate_list () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #14 0x7f90c161ad02 in H5AC__run_sync_point () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #15 0x7f90c161bd4f in H5AC__flush_entries () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #16 0x7f90c13b154d in H5AC_flush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #18 0x7f90c1448e64 in H5F__flush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #20 0x7f90c144f171 in H5F_flush_mounts () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #21 0x7f90c143e3a5 in H5Fflush () from
> >>> /usr/lib64/openmpi/lib/libhdf5.so.103
> >>> #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
> >>> ../../libhdf5/hdf5file.c:222
> >>> #23 0x7f90c1c1816e in NC4_enddef (ncid=) at
> >>> ../../libhdf5/hdf5file.c:544
> >>> #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at
> >>> ../../libdispatch/dfile.c:1004
> >>> #25 0x56527d0def27 in test_pio (flag=0) at
> >>> ../../nc_test4/tst_parallel3.c:206
> >>> #26 0x56527d0de62c in main (argc=,
> argv= >>> out>) at ../../nc_test4/tst_parallel3.c:91
> >>>
> >>> processes are running full out.
> >>>
> >>> Suggestions for debugging this would be greatly appreciated.
> >>>
> >>
> >> Some more info - I think now it is more dependent on openmpi versions
> >> than netcdf itself:
> >>
> >> - last successful build was with netcdf 

Re: [OMPI users] Deadlock in netcdf tests

2019-10-25 Thread Gabriel, Edgar via users
Never mind, I see it in the backtrace :-)
Will look into it, but am currently traveling. Until then, Gilles suggestion is 
probably the right approach.
Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gabriel,
> Edgar via users
> Sent: Friday, October 25, 2019 7:43 AM
> To: Open MPI Users 
> Cc: Gabriel, Edgar 
> Subject: Re: [OMPI users] Deadlock in netcdf tests
> 
> Orion,
>  I will look into this problem, is there a specific code or testcase that 
> triggers
> this problem?
> Thanks
> Edgar
> 
> > -Original Message-
> > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> > Orion Poplawski via users
> > Sent: Thursday, October 24, 2019 11:56 PM
> > To: Open MPI Users 
> > Cc: Orion Poplawski 
> > Subject: Re: [OMPI users] Deadlock in netcdf tests
> >
> > On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
> > > Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are
> > > seeing a test hang with openmpi 4.0.2.  Backtrace:
> > >
> > > (gdb) bt
> > > #0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
> > > #1  0x7f90c1ac8a05 in ompi_request_default_wait () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #3  0x7f90c1b2bb73 in
> > > ompi_coll_base_allreduce_intra_recursivedoubling () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
> > > /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
> > > #5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from
> > > /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
> > > #6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from
> > > /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
> > > #7  0x7f90c1af033f in PMPI_File_write_at_all () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #8  0x7f90c1627d7b in H5FD_mpio_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #9  0x7f90c14636ee in H5FD_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #10 0x7f90c1442eb3 in H5F__accum_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #11 0x7f90c1543729 in H5PB_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #12 0x7f90c144d69c in H5F_block_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #13 0x7f90c161cd10 in H5C_apply_candidate_list () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #14 0x7f90c161ad02 in H5AC__run_sync_point () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #15 0x7f90c161bd4f in H5AC__flush_entries () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #16 0x7f90c13b154d in H5AC_flush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #18 0x7f90c1448e64 in H5F__flush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #20 0x7f90c144f171 in H5F_flush_mounts () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #21 0x7f90c143e3a5 in H5Fflush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
> > > ../../libhdf5/hdf5file.c:222
> > > #23 0x7f90c1c1816e in NC4_enddef (ncid=) at
> > > ../../libhdf5/hdf5file.c:544
> > > #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at
> > > ../../libdispatch/dfile.c:1004
> > > #25 0x56527d0def27 in test_pio (flag=0) at
> > > ../../nc_test4/tst_parallel3.c:206
> > > #26 0x56527d0de62c in main (argc=,
> > > argv= > > out>) at ../../nc_test4/tst_parallel3.c:91
> > >
> > > processes are running full out.
> > >
> > > Suggestions for debugging this would be greatly appreciated.
> > >
> >
> > Some more info - I think now it is more dependent on openmpi versions
> > than netcdf itself:
> >
> > - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx
> > 1.5.2, pmix-3.1.4.  Possible start of the failure was with openmpi
> > 4.0.2-rc1 and ucx 1.6.0.
> >
> > - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2,
> > ucx 1.6.1, pmix 3.1.4
> >
> > - netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with
> > internal UCX.
> >
> > --
> > Orion Poplawski
> > Manager of NWRA Technical Systems  720-772-5637
> > NWRA, Boulder/CoRA Office FAX: 303-415-9702
> > 3380 Mitchell Lane   or...@nwra.com
> > Boulder, CO 80301 https://www.nwra.com/



Re: [OMPI users] Deadlock in netcdf tests

2019-10-25 Thread Gabriel, Edgar via users
Orion,
 I will look into this problem, is there a specific code or testcase that 
triggers this problem?
Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion
> Poplawski via users
> Sent: Thursday, October 24, 2019 11:56 PM
> To: Open MPI Users 
> Cc: Orion Poplawski 
> Subject: Re: [OMPI users] Deadlock in netcdf tests
> 
> On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
> > Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are seeing a
> > test hang with openmpi 4.0.2.  Backtrace:
> >
> > (gdb) bt
> > #0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
> > #1  0x7f90c1ac8a05 in ompi_request_default_wait () from
> > /usr/lib64/openmpi/lib/libmpi.so.40
> > #2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
> > /usr/lib64/openmpi/lib/libmpi.so.40
> > #3  0x7f90c1b2bb73 in
> > ompi_coll_base_allreduce_intra_recursivedoubling () from
> > /usr/lib64/openmpi/lib/libmpi.so.40
> > #4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
> > /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
> > #5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from
> > /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
> > #6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from
> > /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
> > #7  0x7f90c1af033f in PMPI_File_write_at_all () from
> > /usr/lib64/openmpi/lib/libmpi.so.40
> > #8  0x7f90c1627d7b in H5FD_mpio_write () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #9  0x7f90c14636ee in H5FD_write () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #10 0x7f90c1442eb3 in H5F__accum_write () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #11 0x7f90c1543729 in H5PB_write () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #12 0x7f90c144d69c in H5F_block_write () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #13 0x7f90c161cd10 in H5C_apply_candidate_list () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #14 0x7f90c161ad02 in H5AC__run_sync_point () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #15 0x7f90c161bd4f in H5AC__flush_entries () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #16 0x7f90c13b154d in H5AC_flush () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #18 0x7f90c1448e64 in H5F__flush () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #20 0x7f90c144f171 in H5F_flush_mounts () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #21 0x7f90c143e3a5 in H5Fflush () from
> > /usr/lib64/openmpi/lib/libhdf5.so.103
> > #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
> > ../../libhdf5/hdf5file.c:222
> > #23 0x7f90c1c1816e in NC4_enddef (ncid=) at
> > ../../libhdf5/hdf5file.c:544
> > #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at
> > ../../libdispatch/dfile.c:1004
> > #25 0x56527d0def27 in test_pio (flag=0) at
> > ../../nc_test4/tst_parallel3.c:206
> > #26 0x56527d0de62c in main (argc=, argv= > out>) at ../../nc_test4/tst_parallel3.c:91
> >
> > processes are running full out.
> >
> > Suggestions for debugging this would be greatly appreciated.
> >
> 
> Some more info - I think now it is more dependent on openmpi versions
> than netcdf itself:
> 
> - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx 1.5.2,
> pmix-3.1.4.  Possible start of the failure was with openmpi 4.0.2-rc1
> and ucx 1.6.0.
> 
> - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2,
> ucx 1.6.1, pmix 3.1.4
> 
> - netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with
> internal UCX.
> 
> --
> Orion Poplawski
> Manager of NWRA Technical Systems  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301 https://www.nwra.com/



Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Gabriel, Edgar
Yes, I was talking about the same thing, although for me it was not t_mpi, but 
t_shapesame that was hanging. It might be an indication of the same issue 
however.

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
> Novosielski
> Sent: Thursday, February 21, 2019 1:59 PM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> 
> > On Feb 21, 2019, at 2:52 PM, Gabriel, Edgar 
> wrote:
> >
> >> -Original Message-
> >>> Does it always occur at 20+ minutes elapsed ?
> >>
> >> Aha! Yes, you are right: every time it fails, it’s at the 20 minute
> >> and a couple of seconds mark. For comparison, every time it runs, it
> >> runs for 2-3 seconds total. So it seems like what might actually be
> >> happening here is a hang, and not a failure of the test per se.
> >>
> >
> > I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8
> (although this was OpenSuSE, not Redhat), and it looked to me like one of
> tests were hanging, but I didn't have time to investigate it further.
> 
> Just to be clear, the hanging test I have is t_mpi from HDF5 1.10.4. The
> OpenMPI 3.1.3 make check passes just fine on all of our builds. But I don’t
> believe it ever launches any jobs or anything like that.
> 
> --
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630,
> Newark
>  `'

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-21 Thread Gabriel, Edgar
> -Original Message-
> > Does it always occur at 20+ minutes elapsed ?
> 
> Aha! Yes, you are right: every time it fails, it’s at the 20 minute and a 
> couple
> of seconds mark. For comparison, every time it runs, it runs for 2-3 seconds
> total. So it seems like what might actually be happening here is a hang, and
> not a failure of the test per se.
> 

I *think* I can confirm that. I compiled 3.1.3 yesterday with gcc 4.8 (although 
this was OpenSuSE, not Redhat), and it looked to me like one of tests were 
hanging, but I didn't have time to investigate it further.

Thanks
Edgar

> > Is there some mechanism that automatically kills a job if it does not write
> anything to stdout for some time ?
> >
> > A quick way to rule that out is to
> >
> > srun -- mpi=pmi2 -p main -t 1:00:00 -n6 -N1 sleep 1800
> >
> > and see if that completes or get killed with the same error message.
> 
> I was not aware of anything like that, but I’ll look into it now (running your
> suggestion). I guess we don’t run across this sort of thing very often — most
> stuff at least prints output when it starts.
> 
> > You can also run use mpirun instead of srun, and even run mpirun
> > outside of slurm
> >
> > (if your cluster policy allows it, you can for example use mpirun and
> > run on the frontend node)
> 
> I’m on the team that manages the cluster, so we can try various things. Every
> piece of software we ever run, though, runs via srun — we don’t provide
> mpirun as a matter of course, except in some corner cases.
> 
> > On 2/21/2019 3:01 AM, Ryan Novosielski wrote:
> >> Does it make any sense that it seems to work fine when OpenMPI and
> HDF5 are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with
> RHEL-supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5
> build, I did try an XFS filesystem and it didn’t help. GPFS works fine for 
> either
> of the 7.4 and 8.2 builds.
> >>
> >> Just as a reminder, since it was reasonably far back in the thread, what
> I’m doing is running the “make check” tests in HDF5 1.10.4, in part because
> users use it, but also because it seems to have a good test suite and I can
> therefore verify the compiler and MPI stack installs. I get very little
> information, apart from it not working and getting that “Alarm clock”
> message.
> >>
> >> I originally suspected I’d somehow built some component of this with a
> host-specific optimization that wasn’t working on some compute nodes. But I
> controlled for that and it didn’t seem to make any difference.
> >>
> >> --
> >> 
> >> || \\UTGERS,
> >> |---*O*---
> >> ||_// the State | Ryan Novosielski - novos...@rutgers.edu
> >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >> ||  \\of NJ | Office of Advanced Research Computing - MSB C630,
> Newark
> >>  `'
> >>
> >>> On Feb 18, 2019, at 1:34 PM, Ryan Novosielski 
> wrote:
> >>>
> >>> It didn’t work any better with XFS, as it happens. Must be something
> else. I’m going to test some more and see if I can narrow it down any, as it
> seems to me that it did work with a different compiler.
> >>>
> >>> --
> >>> 
> >>> || \\UTGERS,   
> >>> |---*O*---
> >>> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> >>> ||  \\of NJ| Office of Advanced Research Computing - MSB
> C630, Newark
> >>> `'
> >>>
> >>>> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar
>  wrote:
> >>>>
> >>>> While I was working on something else, I let the tests run with Open
> MPI master (which is for parallel I/O equivalent to the upcoming v4.0.1
> release), and here is what I found for the HDF5 1.10.4 tests on my local
> desktop:
> >>>>
> >>>> In the testpar directory, there is in fact one test that fails for both
> ompio and romio321 in exactly the same manner.
> >>>> I used 6 processes as you did (although I used mpirun directly  instead
> of srun...) From the 13 tests in the testpar directory, 12 pass correctly
> (t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term,
> t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
> >>>>
> >>>> The one tests that off

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-20 Thread Gabriel, Edgar
Well, the way you describe it, it sounds to me like maybe an atomic issue with 
this compiler version. What was your configure line of Open MPI, and what 
network interconnect are you using?

An easy way to test this theory would be to force OpenMPI to use the tcp 
interfaces (everything will be slow however). You can do that by creating in 
your home directory a directory called .openmpi, and add there a file called 
mca-params.conf

The file should look something like this:

btl = tcp,self



Thanks
Edgar



> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
> Novosielski
> Sent: Wednesday, February 20, 2019 12:02 PM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> Does it make any sense that it seems to work fine when OpenMPI and HDF5
> are built with GCC 7.4 and GCC 8.2, but /not/ when they are built with RHEL-
> supplied GCC 4.8.5? That appears to be the scenario. For the GCC 4.8.5 build,
> I did try an XFS filesystem and it didn’t help. GPFS works fine for either of 
> the
> 7.4 and 8.2 builds.
> 
> Just as a reminder, since it was reasonably far back in the thread, what I’m
> doing is running the “make check” tests in HDF5 1.10.4, in part because users
> use it, but also because it seems to have a good test suite and I can 
> therefore
> verify the compiler and MPI stack installs. I get very little information, 
> apart
> from it not working and getting that “Alarm clock” message.
> 
> I originally suspected I’d somehow built some component of this with a host-
> specific optimization that wasn’t working on some compute nodes. But I
> controlled for that and it didn’t seem to make any difference.
> 
> --
> 
> || \\UTGERS,   
> |---*O*---
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ| Office of Advanced Research Computing - MSB C630,
> Newark
>  `'
> 
> > On Feb 18, 2019, at 1:34 PM, Ryan Novosielski 
> wrote:
> >
> > It didn’t work any better with XFS, as it happens. Must be something else.
> I’m going to test some more and see if I can narrow it down any, as it seems
> to me that it did work with a different compiler.
> >
> > --
> > 
> > || \\UTGERS, 
> > |---*O*---
> > ||_// the State  | Ryan Novosielski - novos...@rutgers.edu
> > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS
> Campus
> > ||  \\of NJ  | Office of Advanced Research Computing - MSB C630,
> Newark
> > `'
> >
> >> On Feb 18, 2019, at 12:23 PM, Gabriel, Edgar 
> wrote:
> >>
> >> While I was working on something else, I let the tests run with Open MPI
> master (which is for parallel I/O equivalent to the upcoming v4.0.1  release),
> and here is what I found for the HDF5 1.10.4 tests on my local desktop:
> >>
> >> In the testpar directory, there is in fact one test that fails for both 
> >> ompio
> and romio321 in exactly the same manner.
> >> I used 6 processes as you did (although I used mpirun directly  instead of
> srun...) From the 13 tests in the testpar directory, 12 pass correctly 
> (t_bigio,
> t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, t_mpi,
> t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame).
> >>
> >> The one tests that officially fails ( t_pflush1) actually reports that it 
> >> passed,
> but then throws message that indicates that MPI_Abort has been called, for
> both ompio and romio. I will try to investigate this test to see what is going
> on.
> >>
> >> That being said, your report shows an issue in t_mpi, which passes
> without problems for me. This is however not GPFS, this was an XFS local file
> system. Running the tests on GPFS are on my todo list as well.
> >>
> >> Thanks
> >> Edgar
> >>
> >>
> >>
> >>> -Original Message-
> >>> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> >>> Gabriel, Edgar
> >>> Sent: Sunday, February 17, 2019 10:34 AM
> >>> To: Open MPI Users 
> >>> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems
> >>> w/OpenMPI
> >>> 3.1.3
> >>>
> >>> I will also run our testsuite and the HDF5 testsuite on GPFS, I have
> >>> access to a GPFS file system since recently, and wil

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-18 Thread Gabriel, Edgar
While I was working on something else, I let the tests run with Open MPI master 
(which is for parallel I/O equivalent to the upcoming v4.0.1  release), and 
here is what I found for the HDF5 1.10.4 tests on my local desktop:

In the testpar directory, there is in fact one test that fails for both ompio 
and romio321 in exactly the same manner.
I used 6 processes as you did (although I used mpirun directly  instead of 
srun...) From the 13 tests in the testpar directory, 12 pass correctly 
(t_bigio, t_cache, t_cache_image, testphdf5, t_filters_parallel, t_init_term, 
t_mpi, t_pflush2, t_pread, t_prestart, t_pshutdown, t_shapesame). 

The one tests that officially fails ( t_pflush1) actually reports that it 
passed, but then throws message that indicates that MPI_Abort has been called, 
for both ompio and romio. I will try to investigate this test to see what is 
going on.

That being said, your report shows an issue in t_mpi, which passes without 
problems for me. This is however not GPFS, this was an XFS local file system. 
Running the tests on GPFS are on my todo list as well.

Thanks
Edgar



> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> Gabriel, Edgar
> Sent: Sunday, February 17, 2019 10:34 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> I will also run our testsuite and the HDF5 testsuite on GPFS, I have access 
> to a
> GPFS file system since recently, and will report back on that, but it will 
> take a
> few days.
> 
> Thanks
> Edgar
> 
> > -Original Message-
> > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> > Ryan Novosielski
> > Sent: Sunday, February 17, 2019 2:37 AM
> > To: users@lists.open-mpi.org
> > Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> > 3.1.3
> >
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > This is on GPFS. I'll try it on XFS to see if it makes any difference.
> >
> > On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
> > > Ryan,
> > >
> > > What filesystem are you running on ?
> > >
> > > Open MPI defaults to the ompio component, except on Lustre
> > > filesystem where ROMIO is used. (if the issue is related to ROMIO,
> > > that can explain why you did not see any difference, in that case,
> > > you might want to try an other filesystem (local filesystem or NFS
> > > for example)\
> > >
> > >
> > > Cheers,
> > >
> > > Gilles
> > >
> > > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
> > >  wrote:
> > >>
> > >> I verified that it makes it through to a bash prompt, but I’m a
> > >> little less confident that something make test does doesn’t clear it.
> > >> Any recommendation for a way to verify?
> > >>
> > >> In any case, no change, unfortunately.
> > >>
> > >> Sent from my iPhone
> > >>
> > >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar
> > >>> 
> > >>> wrote:
> > >>>
> > >>> What file system are you running on?
> > >>>
> > >>> I will look into this, but it might be later next week. I just
> > >>> wanted to emphasize that we are regularly running the parallel
> > >>> hdf5 tests with ompio, and I am not aware of any outstanding items
> > >>> that do not work (and are supposed to work). That being said, I
> > >>> run the tests manually, and not the 'make test'
> > >>> commands. Will have to check which tests are being run by that.
> > >>>
> > >>> Edgar
> > >>>
> > >>>> -Original Message- From: users
> > >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
> > >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open
> > >>>> MPI Users  Subject: Re:
> > >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> > >>>> 3.1.3
> > >>>>
> > >>>> Ryan,
> > >>>>
> > >>>> Can you
> > >>>>
> > >>>> export OMPI_MCA_io=^ompio
> > >>>>
> > >>>> and try again after you made sure this environment variable is
> > >>>> passed by srun to the MPI tasks ?
> > >>>>
> > >>>> We have identified and fixed several issues specific t

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-17 Thread Gabriel, Edgar
I will also run our testsuite and the HDF5 testsuite on GPFS, I have access to 
a GPFS file system since recently, and will report back on that, but it will 
take a few days.

Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Ryan
> Novosielski
> Sent: Sunday, February 17, 2019 2:37 AM
> To: users@lists.open-mpi.org
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> This is on GPFS. I'll try it on XFS to see if it makes any difference.
> 
> On 2/16/19 11:57 PM, Gilles Gouaillardet wrote:
> > Ryan,
> >
> > What filesystem are you running on ?
> >
> > Open MPI defaults to the ompio component, except on Lustre filesystem
> > where ROMIO is used. (if the issue is related to ROMIO, that can
> > explain why you did not see any difference, in that case, you might
> > want to try an other filesystem (local filesystem or NFS for example)\
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > On Sun, Feb 17, 2019 at 3:08 AM Ryan Novosielski
> >  wrote:
> >>
> >> I verified that it makes it through to a bash prompt, but I’m a
> >> little less confident that something make test does doesn’t clear it.
> >> Any recommendation for a way to verify?
> >>
> >> In any case, no change, unfortunately.
> >>
> >> Sent from my iPhone
> >>
> >>> On Feb 16, 2019, at 08:13, Gabriel, Edgar 
> >>> wrote:
> >>>
> >>> What file system are you running on?
> >>>
> >>> I will look into this, but it might be later next week. I just
> >>> wanted to emphasize that we are regularly running the parallel
> >>> hdf5 tests with ompio, and I am not aware of any outstanding items
> >>> that do not work (and are supposed to work). That being said, I run
> >>> the tests manually, and not the 'make test'
> >>> commands. Will have to check which tests are being run by that.
> >>>
> >>> Edgar
> >>>
> >>>> -Original Message- From: users
> >>>> [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
> >>>> Gouaillardet Sent: Saturday, February 16, 2019 1:49 AM To: Open MPI
> >>>> Users  Subject: Re:
> >>>> [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> >>>> 3.1.3
> >>>>
> >>>> Ryan,
> >>>>
> >>>> Can you
> >>>>
> >>>> export OMPI_MCA_io=^ompio
> >>>>
> >>>> and try again after you made sure this environment variable is
> >>>> passed by srun to the MPI tasks ?
> >>>>
> >>>> We have identified and fixed several issues specific to the
> >>>> (default) ompio component, so that could be a valid workaround
> >>>> until the next release.
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> Ryan Novosielski  wrote:
> >>>>> Hi there,
> >>>>>
> >>>>> Honestly don’t know which piece of this puzzle to look at or how
> >>>>> to get more
> >>>> information for troubleshooting. I successfully built HDF5
> >>>> 1.10.4 with RHEL system GCC 4.8.5 and OpenMPI 3.1.3. Running the
> >>>> “make check” in HDF5 is failing at the below point; I am using a
> >>>> value of RUNPARALLEL='srun -- mpi=pmi2 -p main -t
> >>>> 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise properly
> >>>> configured.
> >>>>>
> >>>>> Thanks for any help you can provide.
> >>>>>
> >>>>> make[4]: Entering directory
> >>>>> `/scratch/novosirj/install-files/hdf5-1.10.4-build-
> >>>> gcc-4.8-openmpi-3.1.3/testpar'
> >>>>>  Testing  t_mpi
> >>>>>  t_mpi  Test Log
> >>>>>  srun: job 84126610 queued and
> waiting
> >>>>> for resources srun: job 84126610 has been allocated resources
> >>>>> srun: error: slepner023: tasks 0-5: Alarm clock 0.01user
> >>>>> 0.00system 20:03.95elapsed 0%CPU (0avgtext+0avgdata
> >>>>> 5152maxresident)k 0inputs+0outputs (0major+1529minor)pagefaults
> >>>>> 0swap

Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI 3.1.3

2019-02-16 Thread Gabriel, Edgar
What file system are you running on?

I will look into this, but it might be later next week. I just wanted to 
emphasize that we are regularly running the parallel hdf5 tests with ompio, and 
I am not aware of any outstanding items that do not work (and are supposed to 
work). That being said, I run the tests manually, and not the 'make test' 
commands. Will have to check which tests are being run by that.

Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gilles
> Gouaillardet
> Sent: Saturday, February 16, 2019 1:49 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] HDF5 1.10.4 "make check" problems w/OpenMPI
> 3.1.3
> 
> Ryan,
> 
> Can you
> 
> export OMPI_MCA_io=^ompio
> 
> and try again after you made sure this environment variable is passed by srun
> to the MPI tasks ?
> 
> We have identified and fixed several issues specific to the (default) ompio
> component, so that could be a valid workaround until the next release.
> 
> Cheers,
> 
> Gilles
> 
> Ryan Novosielski  wrote:
> >Hi there,
> >
> >Honestly don’t know which piece of this puzzle to look at or how to get more
> information for troubleshooting. I successfully built HDF5 1.10.4 with RHEL
> system GCC 4.8.5 and OpenMPI 3.1.3. Running the “make check” in HDF5 is
> failing at the below point; I am using a value of RUNPARALLEL='srun --
> mpi=pmi2 -p main -t 1:00:00 -n6 -N1’ and have a SLURM that’s otherwise
> properly configured.
> >
> >Thanks for any help you can provide.
> >
> >make[4]: Entering directory 
> >`/scratch/novosirj/install-files/hdf5-1.10.4-build-
> gcc-4.8-openmpi-3.1.3/testpar'
> >
> >Testing  t_mpi
> >
> >t_mpi  Test Log
> >
> >srun: job 84126610 queued and waiting for resources
> >srun: job 84126610 has been allocated resources
> >srun: error: slepner023: tasks 0-5: Alarm clock 0.01user 0.00system
> >20:03.95elapsed 0%CPU (0avgtext+0avgdata 5152maxresident)k
> >0inputs+0outputs (0major+1529minor)pagefaults 0swaps
> >make[4]: *** [t_mpi.chkexe_] Error 1
> >make[4]: Leaving directory 
> >`/scratch/novosirj/install-files/hdf5-1.10.4-build-
> gcc-4.8-openmpi-3.1.3/testpar'
> >make[3]: *** [build-check-p] Error 1
> >make[3]: Leaving directory 
> >`/scratch/novosirj/install-files/hdf5-1.10.4-build-
> gcc-4.8-openmpi-3.1.3/testpar'
> >make[2]: *** [test] Error 2
> >make[2]: Leaving directory 
> >`/scratch/novosirj/install-files/hdf5-1.10.4-build-
> gcc-4.8-openmpi-3.1.3/testpar'
> >make[1]: *** [check-am] Error 2
> >make[1]: Leaving directory 
> >`/scratch/novosirj/install-files/hdf5-1.10.4-build-
> gcc-4.8-openmpi-3.1.3/testpar'
> >make: *** [check-recursive] Error 1
> >
> >--
> >
> >|| \\UTGERS,  
> >|---*O*---
> >||_// the State   | Ryan Novosielski - novos...@rutgers.edu
> >|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> >||  \\of NJ   | Office of Advanced Research Computing - MSB C630, 
> >Newark
> >   `'
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails

2018-11-27 Thread Gabriel, Edgar
Gilles submitted a patch for that, and I approved it a couple of days back, I 
*think* it has not been merged however. This was a bug in the Open MPI Lustre 
configure logic, should be fixed after this one however.

https://github.com/open-mpi/ompi/pull/6080

Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Latham,
> Robert J. via users
> Sent: Tuesday, November 27, 2018 2:03 PM
> To: users@lists.open-mpi.org
> Cc: Latham, Robert J. ; gi...@rist.or.jp
> Subject: Re: [OMPI users] Building OpenMPI with Lustre support using PGI fails
> 
> On Tue, 2018-11-13 at 21:57 -0600, gil...@rist.or.jp wrote:
> > Raymond,
> >
> > can you please compress and post your config.log ?
> 
> I didn't see the config.log in response to this.  Maybe Ray and Giles took the
> discusison off list?  As someone who might have introduced the offending
> configure-time checks, I'm particularly interested in fixing lustre detection.
> 
> ==rob
> 
> >
> >
> > Cheers,
> >
> > Gilles
> >
> > - Original Message -
> > > I am trying  to build OpenMPI with Lustre support using PGI 18.7 on
> > > CentOS 7.5 (1804).
> > >
> > > It builds successfully with Intel compilers, but fails to find the
> > > necessary  Lustre components with the PGI compiler.
> > >
> > > I have tried building  OpenMPI 4.0.0, 3.1.3 and 2.1.5.   I can
> > > build
> > > OpenMPI, but configure does not find the proper Lustre files.
> > >
> > > Lustre is installed from current client RPMS, version 2.10.5
> > >
> > > Include files are in /usr/include/lustre
> > >
> > > When specifying --with-lustre, I get:
> > >
> > > --- MCA component fs:lustre (m4 configuration macro) checking for
> > > MCA component fs:lustre compile mode... dso checking --with-lustre
> > > value... simple ok (unspecified value) looking for header without
> > > includes checking lustre/lustreapi.h usability... yes checking
> > > lustre/lustreapi.h presence... yes checking for
> > > lustre/lustreapi.h... yes checking for library containing
> > > llapi_file_create... -llustreapi checking if liblustreapi requires
> > > libnl v1 or v3...
> > > checking for required lustre data structures... no
> > > configure: error: Lustre support requested but not found. Aborting
> > >
> > >
> > > --
> > >
> > >   Ray Muno
> > >   IT Manager
> > >
> > >
> > >University of Minnesota
> > >   Aerospace Engineering and Mechanics Mechanical
> > > Engineering
> > >   110 Union St. S.E.  111 Church Street SE
> > >   Minneapolis, MN 55455   Minneapolis, MN 55455
> > >
> > > ___
> > > users mailing list
> > > users@lists.open-mpi.org
> > > https://lists.open-mpi.org/mailman/listinfo/users
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-15 Thread Gabriel, Edgar
Dave,
Thank you for your detailed report and testing, that is indeed very helpful. We 
will definitely have to do something.
Here is what I think would be potentially doable.

a) if we detect a Lustre file system without flock support, we can printout an 
error message. Completely disabling MPI I/O is on the ompio architecture not 
possible at the moment, since the Lustre component can disqualify itself, but 
the generic Unix FS component would kick in in that case, and still continue 
execution. To be more precise, the query function of the Lustre component has 
no way to return anything than "I am interested to run" or "I am not interested 
to run"

b)  I can add an MCA parameter that would allow the Lustre component to abort 
execution of the job entirely. While this parameter would probably be by 
default set to 'false', a system administrator could configure it to be set to 
'true' an particular platform. 

I will discuss this also with a couple of other people in the next couple of 
days.
Thanks
Edgar 

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Monday, October 15, 2018 4:22 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> For what it's worth, I found the following from running ROMIO's tests with
> OMPIO on Lustre mounted without flock (or localflock).  I used 48 processes
> on two nodes with Lustre for tests which don't require a specific number.
> 
> OMPIO fails tests atomicity, misc, and error on ext4; it additionally fails
> noncontig_coll2, fp, shared_fp, and ordered_fp on Lustre/noflock.
> 
> On Lustre/noflock, ROMIO fails on atomicity, i_noncontig, noncontig,
> shared_fp, ordered_fp, and error.
> 
> Please can OMPIO be changed to fail in the same way as ROMIO (with a clear
> message) for the operations it can't support without flock.
> Otherwise it looks as if you can potentially get invalid data, or at least 
> waste
> time debugging other errors.
> 
> I'd debug the common failure on the "error" test, but ptrace is disabled on 
> the
> system.
> 
> In case anyone else is in the same boat and can't get mounts changed, I
> suggested staging data to and from a PVFS2^WOrangeFS ephemeral
> filesystem on jobs' TMPDIR local mounts if they will fit.  Of course other
> libraries will potentially corrupt data on nolock mounts.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-10 Thread Gabriel, Edgar
Well, good question. To be fair, the test passes if you run it with a lower 
number of processes. In addition, I had a couple of years back a discussion on 
that with one of the HDF5 developers, and it seemed to be ok to run it this way.

That being said, after thinking about it a bit, I think the fix to properly 
support it is at this point relatively easy, I will try to make it work in the 
next couple of days (there was a big chunk of code brought in for another fix 
last year in fall, and I think we have actually everything in place to properly 
support the atomicity operations).

Edgar

> -Original Message-
> From: Dave Love [mailto:dave.l...@manchester.ac.uk]
> Sent: Wednesday, October 10, 2018 3:46 AM
> To: Gabriel, Edgar 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> "Gabriel, Edgar"  writes:
> 
> > Ok, thanks. I usually run these test with 4 or 8, but the major item
> > is that atomicity is one of the areas that are not well supported in
> > ompio (along with data representations), so a failure in those tests
> > is not entirely surprising .
> 
> If it's not expected to work, could it be made to return a helpful error, 
> rather
> than just not working properly?
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-09 Thread Gabriel, Edgar
Ok, thanks. I usually run these test with 4 or 8, but the major item is that 
atomicity is one of the areas that are not well supported in ompio (along with 
data representations), so a failure in those tests is not entirely surprising . 
Most of the work to support atomicity properly is actually in place, but we 
didn't have the manpower (and requests to be honest) to finish that work.

Thanks
Edgar 


> -Original Message-
> From: Dave Love [mailto:dave.l...@manchester.ac.uk]
> Sent: Tuesday, October 9, 2018 7:05 AM
> To: Gabriel, Edgar 
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> "Gabriel, Edgar"  writes:
> 
> > Hm, thanks for the report, I will look into this. I did not run the
> > romio tests, but the hdf5 tests are run regularly and with 3.1.2 you
> > should not have any problems on a regular unix fs. How many processes
> > did you use, and which tests did you run specifically? The main tests
> > that I execute from their parallel testsuite are testphdf5 and
> > t_shapesame.
> 
> Using OMPI 3.1.2, in the hdf5 testpar directory I ran this as a 24-core SMP 
> job
> (so 24 processes), where $TMPDIR is on ext4:
> 
>   export HDF5_PARAPREFIX=$TMPDIR
>   make check RUNPARALLEL='mpirun'
> 
> It stopped after testphdf5 spewed "Atomicity Test Failed" errors.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] ompio on Lustre

2018-10-08 Thread Gabriel, Edgar
Hm, thanks for the report, I will look into this. I did not run the romio 
tests, but the hdf5 tests are run regularly and with 3.1.2 you should not have 
any problems on a regular unix fs. How many processes did you use, and which 
tests did you run specifically? The main tests that I execute from their 
parallel testsuite are testphdf5 and t_shapesame.

I will also look into the testmpio that you mentioned in the next couple of 
days.
Thanks
Edgar


> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Monday, October 8, 2018 10:20 AM
> To: Open MPI Users 
> Subject: Re: [OMPI users] ompio on Lustre
> 
> I said I'd report back about trying ompio on lustre mounted without flock.
> 
> I couldn't immediately figure out how to run MTT.  I tried the parallel
> hdf5 tests from the hdf5 1.10.3, but I got errors with that even with the
> relevant environment variable to put the files on (local) /tmp.
> Then it occurred to me rather late that romio would have tests.  Using the
> "runtests" script modified to use "--mca io ompio" in the romio/test directory
> from ompi 3.1.2 on no-flock-mounted Lustre, after building the tests with an
> installed ompi-3.1.2, it did this and apparently hung at the end:
> 
>    Testing simple.c 
>No Errors
>    Testing async.c 
>No Errors
>    Testing async-multiple.c 
>No Errors
>    Testing atomicity.c 
>   Process 3: readbuf[118] is 0, should be 10
>   Process 2: readbuf[65] is 0, should be 10
>   --
>   MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
>   with errorcode 1.
> 
>   NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>   You may or may not see output from other processes, depending on
>   exactly when Open MPI kills them.
>   --
>   Process 1: readbuf[145] is 0, should be 10
>    Testing coll_test.c 
>No Errors
>    Testing excl.c 
>   error opening file test
>   error opening file test
>   error opening file test
> 
> Then I ran on local /tmp as a sanity check and still got errors:
> 
>    Testing I/O functions 
>    Testing simple.c 
>No Errors
>    Testing async.c 
>No Errors
>    Testing async-multiple.c 
>No Errors
>    Testing atomicity.c 
>   Process 2: readbuf[155] is 0, should be 10
>   Process 1: readbuf[128] is 0, should be 10
>   Process 3: readbuf[128] is 0, should be 10
>   --
>   MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
>   with errorcode 1.
> 
>   NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>   You may or may not see output from other processes, depending on
>   exactly when Open MPI kills them.
>   --
>    Testing coll_test.c 
>No Errors
>    Testing excl.c 
>No Errors
>    Testing file_info.c 
>No Errors
>    Testing i_noncontig.c 
>No Errors
>    Testing noncontig.c 
>No Errors
>    Testing noncontig_coll.c 
>No Errors
>    Testing noncontig_coll2.c 
>No Errors
>    Testing aggregation1 
>No Errors
>    Testing aggregation2 
>No Errors
>    Testing hindexed 
>No Errors
>    Testing misc.c 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn = 265, should be 10
> 
>   byte offset = 3020, should be 1080
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   file pointer posn in bytes = 3280, should be 1000
> 
>   Found 12 errors
>    Testing shared_fp.c 
>No Errors
>    Testing ordered_fp.c 
>No Errors
>    Testing split_coll.c 
>No Errors
>    Testing psimple.c 
>No Errors
>    Testing error.c 
>   File set view did not return an error
>Found 1 errors
>    Testing status.c 
>No Errors
>    Testing types_with_zeros 
>No Errors
>    Testing darray_read 
>No Errors
> 
> I even got an error with romio on /tmp (modifying the script to use mpirun --
> mca io romi314):
> 
>    Testing error.c 
>   Unexpected error message MPI_ERR_ARG: invalid argument of some other
> kind
>Found 1 errors
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] ompio on Lustre

2018-10-05 Thread Gabriel, Edgar
It was originally for performance reasons, but this should be fixed at this 
point. I am not aware of correctness problems.

However, let me try to clarify your question about: What do you precisely mean 
by "MPI I/O on Lustre mounts without flock"? Was the Lustre filesystem mounted 
without flock? If yes, that could lead to some problems, we had that on our 
Lustre installation for a while, but problems were even occurring without MPI 
I/O in that case (although I do not recall all details, just that we had to 
change the mount options). Maybe just take a testsuite (either ours or HDF5), 
make sure to run it in a multi-node configuration and see whether it works 
correctly.

Thanks
Edgar

> -Original Message-
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Dave
> Love
> Sent: Friday, October 5, 2018 5:15 AM
> To: users@lists.open-mpi.org
> Subject: [OMPI users] ompio on Lustre
> 
> Is romio preferred over ompio on Lustre for performance or correctness?
> If it's relevant, the context is MPI-IO on Lustre mounts without flock, which
> ompio doesn't seem to require.
> Thanks.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users