Hi Roland, all,

I tried the changes Roland made to the runscript on stampede2. The point being to see if by choosing a different OpenMP binding than what Stampede2 uses by default, we can achieve better run speeds without enabling hwloc/SystemTopology.  The answer is yes.

I looked at the case with 2 threads per node (24 MPI ranks) on 4, 8, 16, and 24 nodes in 4 different situations.  Attached is a plot showing the results.

The lines are labeled "Binding" with a Yes or No, and "h/ST" for hwloc/SystemTopology with either a Yes or No.  The runs with "Binding Y" include the following 2 lines in the runscript:

export OMP_PLACES=cores
export OMP_PROC_BIND=close

There is no noticeable difference between the bindings when hwloc/ST are active.  But when hwloc/ST aren't active, choosing the bindings as above brings the run speeds in line with the hwloc/ST lines.

Thanks,
Jim

On 02/21/2018 01:21 PM, Roland Haas wrote:
Hello Jim,

thank you for benchmarking these. I have just updated the defaults in
simfactory to be 2 threads per node (ie 24 MPI ranks) since this gave
you the fastest simulation when using hwloc (though not without).

I suspect the hwloc requirement is due to bad default layout of the
threads by TACC.

I have pushed (hopefully saner) settings for task binding into a branch
rhaas/stampede2 (git pull ; git checkout rhaas/stampede2). If you have
time, would you mind benchmarking using those settings as well, please?

Yours,
Roland

Very good! That looks like a 25% speed improvement in the mid-range of #MPI
processes per node.

It also looks as if the maximum speed is achieved by using between 8 and 24
MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI process.

-erik

On Mon, Feb 19, 2018 at 10:07 AM, James Healy <jch...@rit.edu> wrote:

Hello all,

I followed up our previous discussion a few weeks ago by redoing the
scaling tests but with hwloc and SystemTopology turned on.  I attached a
plot showing the difference when using or not using the openmp tasks
changes to prolongation.  I also attached the stdout files for the
ranks=24  for tasks on and off with hwloc with the output from TimerReport.

Thanks,
Jim


On 01/26/2018 10:26 AM, Roland Haas wrote:
Hello Jim,

thank you very much for giving this a spin.

Yours,
Roland

Hi Erik, Roland, all,
After our discussion on last week's telecon, I followed Roland's
instructions on how to get the branch which has changes to how Carpet
handles prolongation with respect to OpenMP.  I reran my simple scaling
test on Stampede Skylake nodes using this branch of Carpet
(rhaas/openmp-tasks) to test the scalability.

Attached is a plot showing the speeds for a variety of number of nodes
and how the 48 threads are distributed on the nodes between MPI processes
and OpenMP threads.  I did this for three versions of the ETK.  1. Fresh
checkout of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the
rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout from #2,
but without the parameters to enable the new prolongation code (labelled
"Test Off").  The run speeds used were grabbed at iteration 256 from
Carpet::physical_time_per_hour.  No IO or regridding.

For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference
between the 3 trials.  However, for 16 and 24 nodes (768 and 1152 cores),
we see some improvement in run speed (10-15%) for many choices of
distribution of threads, again with a slight preference for 8 ranks/node.

I also ran the previous test (not using the openmp-tasks branch) on
comet, and found similar results as before.

Thanks,
Jim

On 01/21/2018 01:07 PM, Erik Schnetter wrote:
James

I looked at OpenMP performance in the Einstein Toolkit a few months >
ago, and I found that Carpet's prolongation operators are not well >
parallelized. There is a branch in Carpet (and a few related thorns) > that
apply a different OpenMP parallelization strategy, which seems to > be more
efficient. We are currently looking into cherry-picking the > relevant
changes from this branch (there are also many unrelated > changes, since I
experimented a lot) and putting them back into the > master branch.

These changes only help with prolongation, which seems to be a major >
contributor to non-OpenMP-scalability. I experimented with other > changes
as well. My findings (unfortunately without good solutions so > far) are:

- The standard OpenMP parallelization of loops over grid functions is >
not good for data cache locality. I experimented with padding arrays, >
ensuring that loop boundaries align with cache line boundaries, etc., > but
this never worked quite satisfactorily -- MPI parallelization is > still
faster than OpenMP. In effect, the only reason one would use > OpenMP is
once one encounters MPI's scalability limits, so that > OpenMP's
non-scalability is less worse.

- We could overlap calculations with communication. To do so, I have >
experimental changes that break loops over grid functions into tiles. >
Outer tiles need to wait for communication (synchronization or >
parallelization) to finish, while inner tiles can be calculated right >
away. Unfortunately, OpenMP does not support open-ended threads like >
this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and
FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The >
respective changes to Carpet, the scheduler, and thorns are > significant,
and I couldn't prove any performance improvements yet. > However, once we
removed other, more prominent non-scalability causes, > I hope that this
will become interesting.

I haven't been attending the ET phone calls recently because Monday >
mornings aren't good for me schedule-wise. If you are interested, then > we
can ensure that we both attend at the same time and then discuss > this. We
need to make sure the Roland Haas is then also attending.

-erik


On Sat, Jan 20, 2018 at 10:21 AM, James Healy <jch...@rit.edu >
<mailto:jch...@rit.edu>> wrote:

      Hello all,

      I am trying to run on the new skylake processors on Stampede2 and
      while the run speeds we are obtaining are very good, we are
      concerned that we aren't optimizing properly when it comes to
      OpenMP.  For instance, we see the best speeds when we use 8 MPI
      processors per node (with 6 threads each for a total of 48 total
      threads/node).  Based on the architecture, we were expecting to
      see the best speeds with 2 MPI/node.  Here is what I have tried:

       1. Using the simfactory files for stampede2-skx (config file, run
          and submit scripts, and modules loaded) I compiled a version
          of ET_2017_06 using LazEv (RIT's evolution thorn) and
          McLachlan and submitted a series of runs that change both the
          number of nodes used, and how I distribute the 48 threads/node
          between MPI processes.
       2. I use a standard low resolution grid, with no IO or
          regridding.  Parameter file attached.
       3. Run speeds are measured from Carpet::physical_time_per_hour at
          iteration 256.
       4. I tried both with and without hwloc/SystemTopology.
       5. For both McLachlan and LazEv, I see similar results, with 2
          MPI/node giving the worst results (see attached plot for
          McLachlan) and a slight preferences for 8 MPI/node.

      So my questions are:

       1. Has there been any tests run by any other users on stampede2
skx?
       2. Should we expect 2 MPI/node to be the optimal choice?
       3. If so, are there any other configurations we can try that
          could help optimize?

      Thanks in advance!

      Jim Healy


      _______________________________________________
      Users mailing list
      Users@einsteintoolkit.org <mailto:Users@einsteintoolkit.org>
      http://lists.einsteintoolkit.org/mailman/listinfo/users
      <http://lists.einsteintoolkit.org/mailman/listinfo/users>


-- > Erik Schnetter <schnet...@cct.lsu.edu <mailto:
schnet...@cct.lsu.edu>>
http://www.perimeterinstitute.ca/personal/eschnetter/





_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to