Re: [Users] XSEDE's Expanse and failing tests

Gabriele Bozzola Wed, 18 Aug 2021 10:25:03 -0700

Hi Roland,

thanks for your answer.


The test suites are quick running parfiles with small grids, so running
> them on large numbers of MPI ranks (they are designed for 1 or 2 MPI
> ranks) can lead to unexpected situations (such as an MPI rank having no
> grid points at all).


Generally, if the tests work for 1,2,4 ranks (4 being the largest
> number of procs requested by any test.ccl file) then this is sufficient.


Frontera and Stampede2 use 24/28 MPI processes, but the tests still pass.
I am particularly looking at the test ADMMass/tov_carpet.par, where the
numbers are off, but no error is thrown. Another example is
Exact/de_Sitter.par.
Other tests do fail because of Carpet errors, which might be what you are
describing.

Can you create a pull request for the "linux" architecture file with
> the changes for the AMD compiler you found, please? So far it sees you
> mostly only changed the detection part, does it then not also require
> some changes in the "set values" part of the file? Eg default values
> for optimization, preprocessor or so?


Where is the repo?

I am not too familiar with what that file is supposed to set. But, I only
changed
what was needed to at least start the compilation.

Gabriele

On Wed, Aug 18, 2021 at 8:20 AM Roland Haas <[email protected]> wrote:

> Hello Gabriele,
>
> Thank you for contributing these.
>
> The test suites are quick running parfiles with small grids, so running
> them on large numbers of MPI ranks (they are designed for 1 or 2 MPI
> ranks) can lead to unexpected situations (such as an MPI rank having no
> grid points at all).
>
> Generally, if the tests work for 1,2,4 ranks (4 being the largest
> number of procs requested by any test.ccl file) then this is sufficient.
>
> In principle even running on more MPI ranks should work, so if you know
> which tests fail with the larger number of MPI ranks and were to list
> them in a ticket, maybe someone could look into this.
>
> Note that you can undersubscribe  compute node, in particular for
> tests, if you do not need / want to use all cores.
>
> Can you create a pull request for the "linux" architecture file with
> the changes for the AMD compiler you found, please? So far it sees you
> mostly only changed the detection part, does it then not also require
> some changes in the "set values" part of the file? Eg default values
> for optimization, preprocessor or so?
>
> Yours,
> Roland
>
> > Hello,
> >
> > Two days ago, I opened a PR to the simfactory repo to add Expanse,
> > the newest machine at the San Diego Supercomputing Center, based on
> > AMD Epyc "Rome" CPUs and part of XSEDE. In the meantime, I realized
> > that some tests are failing miserably, but I couldn't figure out why.
> >
> > Before I describe what I found, let me start with a side node on AMD
> > compilers.
> >
> > <side node>
> >
> > There are four compilers available on Expanse: GNU, Intel, AMD, and PGI.
> > I did not touch the PGI compilers. I briefly tried (and failed) to
> compile
> > with
> > the AMD compilers (aocc and flang). I did not try hard, and it seems that
> > most of the libraries on Expanse are compiled with gcc anyways.
> >
> > A first step to support these compilers is adding the lines:
> >
> >    elif test "`$F90 --version 2>&1 | grep AMD`" ; then
> >      LINUX_F90_COMP=AMD
> >    else
> >
> >  elif test "`$CC --version 2>&1 | grep AMD`" ; then
> >    LINUX_C_COMP=AMD
> >  fi
> >
> >  elif test "`$CC --version 2>&1 | grep AMD`" ; then
> >    LINUX_CXX_COMP=AMD
> >  fi
> >
> > in the obvious places in flesh/lib/make/known-architecture/linux.
> >
> > </side node>
> >
> > I successfully compiled the Einstein Toolkit with
> > - gcc 10.2.0 and OpenMPI 4.0.4
> > - gcc 9.2.0 and OpenMPI 4.0.4
> > - intel 2019 and Intel MPI 2019
> >
> > I noticed that some tests, like ADMMass/tov_carpet.par, gave
> > completely incorrect results. For example, the expected value is 1.3,
> > but I would find 1.6.
> >
> > I disabled all the optimizations, but the test would keep failing. At the
> > end, I noticed that if I ran with 8/16/32 MPI processes per node, and
> > the corresponding number of OpenMP threads (128/N_MPI), the test
> > would fail, but if I ran with 4/2/1 MPI processes, the test would pass.
> >
> > Most of my experiments were with gcc 10, but the test fails also with
> > the Intel suite.
> >
> > I tried increasing the OMP_STACK_SIZE to a very large value, but
> > it didn't help.
> >
> > Any idea of what the problem might be?
> >
> > Gabriele
>
>
>
> --
> My email is as private as my paper mail. I therefore support encrypting
> and signing email messages. Get my PGP key from http://pgp.mit.edu .
>

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Re: [Users] XSEDE's Expanse and failing tests

Reply via email to