James
I looked at OpenMP performance in the Einstein Toolkit a few months >
ago, and I found that Carpet's prolongation operators are not well >
parallelized. There is a branch in Carpet (and a few related thorns) > that
apply a different OpenMP parallelization strategy, which seems to > be more
efficient. We are currently looking into cherry-picking the > relevant
changes from this branch (there are also many unrelated > changes, since I
experimented a lot) and putting them back into the > master branch.
These changes only help with prolongation, which seems to be a major >
contributor to non-OpenMP-scalability. I experimented with other > changes
as well. My findings (unfortunately without good solutions so > far) are:
- The standard OpenMP parallelization of loops over grid functions is >
not good for data cache locality. I experimented with padding arrays, >
ensuring that loop boundaries align with cache line boundaries, etc., > but
this never worked quite satisfactorily -- MPI parallelization is > still
faster than OpenMP. In effect, the only reason one would use > OpenMP is
once one encounters MPI's scalability limits, so that > OpenMP's
non-scalability is less worse.
- We could overlap calculations with communication. To do so, I have >
experimental changes that break loops over grid functions into tiles. >
Outer tiles need to wait for communication (synchronization or >
parallelization) to finish, while inner tiles can be calculated right >
away. Unfortunately, OpenMP does not support open-ended threads like >
this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and
FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The >
respective changes to Carpet, the scheduler, and thorns are > significant,
and I couldn't prove any performance improvements yet. > However, once we
removed other, more prominent non-scalability causes, > I hope that this
will become interesting.
I haven't been attending the ET phone calls recently because Monday >
mornings aren't good for me schedule-wise. If you are interested, then > we
can ensure that we both attend at the same time and then discuss > this. We
need to make sure the Roland Haas is then also attending.
-erik
On Sat, Jan 20, 2018 at 10:21 AM, James Healy <jch...@rit.edu >
<mailto:jch...@rit.edu>> wrote:
Hello all,
I am trying to run on the new skylake processors on Stampede2 and
while the run speeds we are obtaining are very good, we are
concerned that we aren't optimizing properly when it comes to
OpenMP. For instance, we see the best speeds when we use 8 MPI
processors per node (with 6 threads each for a total of 48 total
threads/node). Based on the architecture, we were expecting to
see the best speeds with 2 MPI/node. Here is what I have tried:
1. Using the simfactory files for stampede2-skx (config file, run
and submit scripts, and modules loaded) I compiled a version
of ET_2017_06 using LazEv (RIT's evolution thorn) and
McLachlan and submitted a series of runs that change both the
number of nodes used, and how I distribute the 48 threads/node
between MPI processes.
2. I use a standard low resolution grid, with no IO or
regridding. Parameter file attached.
3. Run speeds are measured from Carpet::physical_time_per_hour at
iteration 256.
4. I tried both with and without hwloc/SystemTopology.
5. For both McLachlan and LazEv, I see similar results, with 2
MPI/node giving the worst results (see attached plot for
McLachlan) and a slight preferences for 8 MPI/node.
So my questions are:
1. Has there been any tests run by any other users on stampede2
skx?
2. Should we expect 2 MPI/node to be the optimal choice?
3. If so, are there any other configurations we can try that
could help optimize?
Thanks in advance!
Jim Healy
_______________________________________________
Users mailing list
Users@einsteintoolkit.org <mailto:Users@einsteintoolkit.org>
http://lists.einsteintoolkit.org/mailman/listinfo/users
<http://lists.einsteintoolkit.org/mailman/listinfo/users>
-- > Erik Schnetter <schnet...@cct.lsu.edu <mailto:
schnet...@cct.lsu.edu>>
http://www.perimeterinstitute.ca/personal/eschnetter/