Re: [OMPI users] Error with building OMPI with PGI

2021-01-14 Thread Gus Correa via users
Hi Passant, list

This is an old problem with PGI.
There are many threads in the OpenMPI mailing list archives about this,
with workarounds.
The simplest is to use FC="pgf90 -noswitcherror".

Here are two out of many threads ... well,  not pthreads!  :)
https://www.mail-archive.com/users@lists.open-mpi.org/msg08962.html
https://www.mail-archive.com/users@lists.open-mpi.org/msg10375.html

I hope this helps,
Gus Correa

On Thu, Jan 14, 2021 at 5:45 PM Passant A. Hafez via users <
users@lists.open-mpi.org> wrote:

> Hello,
>
>
> I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with
> PGI 20.1
>
>
> ./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90
> --prefix=$PREFIX --with-ucx=$UCX_HOME --with-slurm
> --with-pmi=/opt/slurm/cluster/ibex/install --with-cuda=$CUDATOOLKIT_HOME
>
>
> in the make install step:
>
> make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
> Making install in mca/pmix/s1
> make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
>   CCLD mca_pmix_s1.la
> pgcc-Error-Unknown switch: -pthread
> make[2]: *** [mca_pmix_s1.la] Error 1
> make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
> make[1]: *** [install-recursive] Error 1
> make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
> make: *** [install-recursive] Error 1
>
> Please advise.
>
>
>
>
> All the best,
> Passant
>


[OMPI users] Error with building OMPI with PGI

2021-01-14 Thread Passant A. Hafez via users
Hello,


I'm having an error when trying to build OMPI 4.0.3 (also tried 4.1) with PGI 
20.1


./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=$PREFIX 
--with-ucx=$UCX_HOME --with-slurm --with-pmi=/opt/slurm/cluster/ibex/install 
--with-cuda=$CUDATOOLKIT_HOME


in the make install step:

make[4]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
make[3]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/pmix3x'
Making install in mca/pmix/s1
make[2]: Entering directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
  CCLD mca_pmix_s1.la
pgcc-Error-Unknown switch: -pthread
make[2]: *** [mca_pmix_s1.la] Error 1
make[2]: Leaving directory `/tmp/openmpi-4.0.3/opal/mca/pmix/s1'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/tmp/openmpi-4.0.3/opal'
make: *** [install-recursive] Error 1

Please advise.
?



All the best,
Passant


Re: [OMPI users] bad defaults with ucx

2021-01-14 Thread Dave Love via users
"Jeff Squyres (jsquyres)"  writes:

> Good question.  I've filed
> https://github.com/open-mpi/ompi/issues/8379 so that we can track
> this.

For the benefit of the list:  I mis-remembered that osc=ucx was general
advice.  The UCX docs just say you need to avoid the uct btl, which can
cause memory corruption, but OMPI 4.1 still builds and uses it by
default.  (The UCX doc also suggests other changes to parameters, but
for performance rather than correctness.)

Anyway, I can get at least IMB-RMA to run on this Summit-like hardware
just with --mca btl ^uct (though there are failures with other tests
which seem to be specific to UCX on ppc64le, and not to OMPI).



Re: [OMPI users] bad defaults with ucx

2021-01-14 Thread Jeff Squyres (jsquyres) via users
Good question.  I've filed https://github.com/open-mpi/ompi/issues/8379 so that 
we can track this.


> On Jan 14, 2021, at 7:53 AM, Dave Love via users  
> wrote:
> 
> Why does 4.1 still not use the right defaults with UCX?
> 
> Without specifying osc=ucx, IMB-RMA crashes like 4.0.5.  I haven't
> checked what else it is UCX says you must set for openmpi to avoid
> memory corruption, at least, but I guess that won't be right either.
> Users surely shouldn't have to explore notes for a fundamental library
> to be able to run even IMB.


-- 
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] 4.1 mpi-io test failures on lustre

2021-01-14 Thread Gabriel, Edgar via users
I will have a look at those tests. The recent fixes were not correctness, but 
performance fixes.
Nevertheless, we used to pass the mpich tests, but I admit that it is not a 
testsuite that we run regularly, I will have a look at them. The atomicity 
tests are expected to fail, since this the one chapter of MPI I/O that is not 
implemented in ompio.

Thanks
Edgar

-Original Message-
From: users  On Behalf Of Dave Love via users
Sent: Thursday, January 14, 2021 5:46 AM
To: users@lists.open-mpi.org
Cc: Dave Love 
Subject: [OMPI users] 4.1 mpi-io test failures on lustre

I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system that I 
understand was used to fix ompio problems on lustre.  I'm puzzled that I still 
see failures.

I don't know why there are disjoint sets in mpich's test/mpi/io and 
src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io defaults 
across two nodes.  In src/mpi/romio/test, atomicity failed (ignoring error and 
syshints); in test/mpi/io, the failures were setviewcur, tst_fileview, 
external32_derived_dtype, i_bigtype, and i_setviewcur.  tst_fileview was 
probably killed by the 100s timeout.

It may be that some are only appropriate for romio, but no-one said so before 
and they presumably shouldn't segv or report libc errors.

I built against ucx 1.9 with cuda support.  I realize that has problems on 
ppc64le, with no action on the issue, but there's a limit to what I can do.  
cuda looks relevant since one test crashes while apparently trying to register 
cuda memory; that's presumably not ompio's fault, but we need cuda.


[OMPI users] bad defaults with ucx

2021-01-14 Thread Dave Love via users
Why does 4.1 still not use the right defaults with UCX?

Without specifying osc=ucx, IMB-RMA crashes like 4.0.5.  I haven't
checked what else it is UCX says you must set for openmpi to avoid
memory corruption, at least, but I guess that won't be right either.
Users surely shouldn't have to explore notes for a fundamental library
to be able to run even IMB.


[OMPI users] 4.1 mpi-io test failures on lustre

2021-01-14 Thread Dave Love via users
I tried mpi-io tests from mpich 4.3 with openmpi 4.1 on the ac922 system
that I understand was used to fix ompio problems on lustre.  I'm puzzled
that I still see failures.

I don't know why there are disjoint sets in mpich's test/mpi/io and
src/mpi/romio/test, but I ran all the non-Fortran ones with MCA io
defaults across two nodes.  In src/mpi/romio/test, atomicity failed
(ignoring error and syshints); in test/mpi/io, the failures were
setviewcur, tst_fileview, external32_derived_dtype, i_bigtype, and
i_setviewcur.  tst_fileview was probably killed by the 100s timeout.

It may be that some are only appropriate for romio, but no-one said so
before and they presumably shouldn't segv or report libc errors.

I built against ucx 1.9 with cuda support.  I realize that has problems
on ppc64le, with no action on the issue, but there's a limit to what I
can do.  cuda looks relevant since one test crashes while apparently
trying to register cuda memory; that's presumably not ompio's fault, but
we need cuda.