Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-19 Thread Gabriel, Edgar via users
ok, so what I get from this conversation is the following todo list:

1. check out the tests src/mpi/romio/test
2. revisit the atomicity issue. You are right that there scenarios where it 
might be required, the fact that we were not able to hit the issues in our 
tests is no evidence.
3. will work on an update of the FAQ section.



-Original Message-
From: users  On Behalf Of Dave Love via users
Sent: Monday, January 18, 2021 11:14 AM
To: Gabriel, Edgar via users 
Cc: Dave Love 
Subject: Re: [OMPI users] 4.1 mpi-io test failures on lustre

"Gabriel, Edgar via users"  writes:

>> How should we know that's expected to fail?  It at least shouldn't fail like 
>> that; set_atomicity doesn't return an error (which the test is prepared for 
>> on a filesystem like pvfs2).  
>> I assume doing nothing, but appearing to, can lead to corrupt data, and I'm 
>> surprised that isn't being seen already.
>> HDF5 requires atomicity -- at least to pass its tests -- so presumably 
>> anyone like us who needs it should use something mpich-based with recent or 
>> old romio, and that sounds like most general HPC systems.  
>> Am I missing something?
>> With the current romio everything I tried worked, but we don't get that 
>> option with openmpi.
>
> First of all, it is mentioned on the FAQ sites of Open MPI, although 
> admittedly it is not entirely update (it lists external32 support also 
> as missing, which is however now available since 4.1).

Yes, the FAQ was full of confusing obsolete material when I last looked.
Anyway, users can't be expected to check whether any particular operation is 
expected to fail silently.  I should have said that
MPI_File_set_atomicity(3) explicitly says the default is true for multiple 
nodes, and doesn't say the call is a no-op with the default implementation.  I 
don't know whether the MPI spec allows not implementing it, but I at least 
expect an error return if it doesn't.
As far as I remember, that's what romio does on a filesystem like pvfs2 (or 
lustre when people know better than implementers and insist on noflock); I 
mis-remembered from before, thinking that ompio would be changed to do the 
same.  From that thread, I did think atomicity was on its way.

Presumably an application requests atomicity for good reason, and can take 
appropriate action if the status indicates it's not available on that 
filesystem.

> You don't need atomicity for the HDF5 tests, we are passing all of them to 
> the best my knowledge, and this is one of the testsuites that we do run 
> regularly as part of our standard testing process.

I guess we're just better at breaking things.

> I am aware that they have an atomicity test -  which we pass for whatever 
> reason. This highlight also btw the issue(s) that I am having with the 
> atomicity option in MPI I/O. 

I don't know what the application is of atomicity in HDF5.  Maybe it isn't 
required for typical operations, but I assume it's not used blithely.  However, 
I'd have thought HDF5 should be prepared for something like pvfs2, and at least 
not abort the test at that stage.

I've learned to be wary of declaring concurrent systems working after a few 
tests.  In fact, the phdf5 test failed for me like this when I tried across 
four lustre client nodes with 4.1's defaults.  (I'm confused about the striping 
involved, because I thought I set it to four, and now it shows as one on that 
directory.)

  ...
  Testing  -- dataset atomic updates (atomicity)
  Proc 9: *** Parallel ERRProc 54: *** Parallel ERROR ***
  VRFY (H5Sset_hyperslab succeeded) failed at line 4293 in t_dset.c
  aborting MPI proceProc 53: *** Parallel ERROR ***

Unfortunately I hadn't turned on backtracing, and I wouldn't get another job 
trough for a while.

> The entire infrastructure to enforce atomicity is actually in place in ompio, 
> and I can give you the option on how to enforce strict atomic behavior for 
> all files in ompio (just not on a per file basis), just be aware that the 
> performance will nose-dive. This is not just the case with ompio, but also in 
> romio, you can read up on that various discussion boards on that topic, look 
> at NFS related posts (where you need the atomicity for correctness in 
> basically all scenarios).

I'm fairly sure I accidentally ran tests successfully on NFS4, at least 
single-node.  I never found a good discussion of the topic, and what I have 
seen about "NFS" was probably specific to NFS3 and non-POSIX compliance, though 
I don't actually care about parallel i/o on NFS.  The information we got about 
lustre was direct from Rob Latham, as nothing showed up online.

I don't like fast-but-wrong, so I think there should be the option of 
correctness, especially as it's the documented default.

> Just as another data point, in the 8+ years that ompio has been available, 
> there was not one issue reported related to correctness due to missing the 
> atomicity option.

Yes, I forget some history over the years, like that 

Re: [OMPI users] Error with building OMPI with PGI

2021-01-19 Thread Passant A. Hafez via users
Yes it is plus zero three (I confirmed that with different trials), I don't 
know about that either, I posted the configure line I used I didn't do any 
changes beyond that.

I further did the same for this switch as Gus suggested in
https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html

And now I'm getting

"param.c", line 369: error: missing closing quote
  orte_info_out("Configured on", "config:timestamp", OPAL_CONFIGURE_DATE);
 ^

"param.c", line 374: error: missing closing quote
  orte_info_out("Built on", "build:timestamp", OMPI_BUILD_DATE);
   ^

"param.c", line 410: error: missing closing quote
  orte_info_out("Build CFLAGS", "option:build:cflags", 
OMPI_BUILD_CFLAGS);
   ^

"param.c", line 411: error: missing closing quote
  orte_info_out("Build LDFLAGS", "option:build:ldflags", 
OMPI_BUILD_LDFLAGS);
 ^

"param.c", line 412: error: missing closing quote
  orte_info_out("Build LIBS", "option:build:libs", OMPI_BUILD_LIBS);
   ^

5 errors detected in the compilation of "param.c".
make[2]: *** [param.o] Error 2
make[2]: Leaving directory `/tmp/openmpi-4.0.3/orte/tools/orte-info'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/tmp/openmpi-4.0.3/orte'
make: *** [install-recursive] Error 1



Best,
Passant

From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Tuesday, January 19, 2021 1:48 PM
To: Open MPI Users
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] Error with building OMPI with PGI

Passant,

unless this is a copy paste error, the last error message reads plus
zero three, which is clearly an unknown switch
(plus uppercase o three is a known one)

At the end of the configure, make sure Fortran bindings are generated.

If the link error persists, you can
ldd /.../libmpi_mpifh.so | grep igatherv
and confirm the symbol does indeed exists


Cheers,

Gilles

On Tue, Jan 19, 2021 at 7:39 PM Passant A. Hafez via users
 wrote:
>
> Hello Gus,
>
>
> Thanks for your reply.
>
> Yes I've read multiple threads for very old versions of OMPI and PGI before 
> posting, some said it'll be patched so I thought this is fixed in the recent 
> versions. And some fixes didn't work for me.
>
>
>
> Now I tried the first suggestion (CC="pgcc -noswitcherror" as the error is 
> with pgcc)
>
> The OMPI build was finished, but when I tried to use it to build QE GPU 
> version, I got:
>
> undefined reference to `ompi_igatherv_f'
>
>
>
> I tried the other workaround
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html
>
> to rebuild OMPI, I got
> pgcc-Error-Unknown switch: +03
>
>
> Please advise.
>
> All the best,
> Passant
> 
> From: users  on behalf of Gus Correa via 
> users 
> Sent: Friday, January 15, 2021 2:36 AM
> To: Open MPI Users
> Cc: Gus Correa
> Subject: Re: [OMPI users] Error with building OMPI with PGI
>
> Hi Passant, list
>
> This is an old problem with PGI.
> There are many threads in the OpenMPI mailing list archives about this,
> with workarounds.
> The simplest is to use FC="pgf90 -noswitcherror".
>
> Here are two out of many threads ... well,  not pthreads!  :)
> https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html
>
> I hope this helps,
> Gus Correa
>
> On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users 
>  wrote:
>>
>> Hello,
>>
>>
>> I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with 
>> PGI 20.1
>>
>>
>> ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=$PREFIX 
>> --with-ucx=$UCX_HOME --with-slurm --with-pmi=/opt/slurm/cluster/ibex/install 
>> --with-cuda=$CUDATOOLKIT_HOME
>>
>>
>> in the make install step:
>>
>> make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> Making install in mca/pmix/s1
>> make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>>   CCLD mca_pmix_s1.la
>> pgcc-Error-Unknown switch: -pthread
>> make[2]: *** [mca_pmix_s1.la] Error 1
>> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>> make[1]: *** [install-recursive] Error 1
>> make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
>> make: *** [install-recursive] Error 1
>>
>> Please advise.
>>
>>
>>
>>
>> All the best,
>> Passant


Re: [OMPI users] Error with building OMPI with PGI

2021-01-19 Thread Gilles Gouaillardet via users
Passant,

unless this is a copy paste error, the last error message reads plus
zero three, which is clearly an unknown switch
(plus uppercase o three is a known one)

At the end of the configure, make sure Fortran bindings are generated.

If the link error persists, you can
ldd /.../libmpi_mpifh.so | grep igatherv
and confirm the symbol does indeed exists


Cheers,

Gilles

On Tue, Jan 19, 2021 at 7:39 PM Passant A. Hafez via users
 wrote:
>
> Hello Gus,
>
>
> Thanks for your reply.
>
> Yes I've read multiple threads for very old versions of OMPI and PGI before 
> posting, some said it'll be patched so I thought this is fixed in the recent 
> versions. And some fixes didn't work for me.
>
>
>
> Now I tried the first suggestion (CC="pgcc -noswitcherror" as the error is 
> with pgcc)
>
> The OMPI build was finished, but when I tried to use it to build QE GPU 
> version, I got:
>
> undefined reference to `ompi_igatherv_f'
>
>
>
> I tried the other workaround
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html
>
> to rebuild OMPI, I got
> pgcc-Error-Unknown switch: +03
>
>
> Please advise.
>
> All the best,
> Passant
> 
> From: users  on behalf of Gus Correa via 
> users 
> Sent: Friday, January 15, 2021 2:36 AM
> To: Open MPI Users
> Cc: Gus Correa
> Subject: Re: [OMPI users] Error with building OMPI with PGI
>
> Hi Passant, list
>
> This is an old problem with PGI.
> There are many threads in the OpenMPI mailing list archives about this,
> with workarounds.
> The simplest is to use FC="pgf90 -noswitcherror".
>
> Here are two out of many threads ... well,  not pthreads!  :)
> https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html
> https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html
>
> I hope this helps,
> Gus Correa
>
> On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users 
>  wrote:
>>
>> Hello,
>>
>>
>> I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with 
>> PGI 20.1
>>
>>
>> ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=$PREFIX 
>> --with-ucx=$UCX_HOME --with-slurm --with-pmi=/opt/slurm/cluster/ibex/install 
>> --with-cuda=$CUDATOOLKIT_HOME
>>
>>
>> in the make install step:
>>
>> make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
>> Making install in mca/pmix/s1
>> make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>>   CCLD mca_pmix_s1.la
>> pgcc-Error-Unknown switch: -pthread
>> make[2]: *** [mca_pmix_s1.la] Error 1
>> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>> make[1]: *** [install-recursive] Error 1
>> make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
>> make: *** [install-recursive] Error 1
>>
>> Please advise.
>>
>>
>>
>>
>> All the best,
>> Passant


Re: [OMPI users] Error with building OMPI with PGI

2021-01-19 Thread Passant A. Hafez via users
Hello Gus,


Thanks for your reply.

Yes I've read multiple threads for very old versions of OMPI and PGI before 
posting, some said it'll be patched so I thought this is fixed in the recent 
versions. And some fixes didn't work for me.



Now I tried the first suggestion (CC="pgcc -noswitcherror" as the error is with 
pgcc)

The OMPI build was finished, but when I tried to use it to build QE GPU 
version, I got:

undefined reference to `ompi_igatherv_f'


I tried the other workaround
https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html

to rebuild OMPI, I got
pgcc-Error-Unknown switch: +03


Please advise.

All the best,
Passant

From: users  on behalf of Gus Correa via 
users 
Sent: Friday, January 15, 2021 2:36 AM
To: Open MPI Users
Cc: Gus Correa
Subject: Re: [OMPI users] Error with building OMPI with PGI

Hi Passant, list

This is an old problem with PGI.
There are many threads in the OpenMPI mailing list archives about this,
with workarounds.
The simplest is to use FC="pgf90 -noswitcherror".

Here are two out of many threads ... well,  not pthreads!  :)
https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html
https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html

I hope this helps,
Gus Correa

On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users 
mailto:users@lists.open-mpi.org>> wrote:

Hello,


I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with PGI 
20.1


./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=$PREFIX 
--with-ucx=$UCX_HOME --with-slurm --with-pmi=/opt/slurm/cluster/ibex/install 
--with-cuda=$CUDATOOLKIT_HOME


in the make install step:

make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
Making install in mca/pmix/s1
make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
  CCLD 
mca_pmix_s1.la
pgcc-Error-Unknown switch: -pthread
make[2]: *** 
[mca_pmix_s1.la]
 Error 1
make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
make: *** [install-recursive] Error 1

Please advise.




All the best,
Passant