Hello Jim,

thank you very much for giving this a spin.

Yours,
Roland

> Hi Erik, Roland, all,
> 
> After our discussion on last week's telecon, I followed Roland's instructions 
> on how to get the branch which has changes to how Carpet handles prolongation 
> with respect to OpenMP.  I reran my simple scaling test on Stampede Skylake 
> nodes using this branch of Carpet (rhaas/openmp-tasks) to test the 
> scalability.
> 
> Attached is a plot showing the speeds for a variety of number of nodes and 
> how the 48 threads are distributed on the nodes between MPI processes and 
> OpenMP threads.  I did this for three versions of the ETK.  1. Fresh checkout 
> of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the 
> rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout from #2, 
> but without the parameters to enable the new prolongation code (labelled 
> "Test Off").  The run speeds used were grabbed at iteration 256 from 
> Carpet::physical_time_per_hour.  No IO or regridding.
> 
> For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference 
> between the 3 trials.  However, for 16 and 24 nodes (768 and 1152 cores), we 
> see some improvement in run speed (10-15%) for many choices of distribution 
> of threads, again with a slight preference for 8 ranks/node.
> 
> I also ran the previous test (not using the openmp-tasks branch) on comet, 
> and found similar results as before.
> 
> Thanks,
> Jim
> 
> On 01/21/2018 01:07 PM, Erik Schnetter wrote:
> > James
> >
> > I looked at OpenMP performance in the Einstein Toolkit a few months > ago, 
> > and I found that Carpet's prolongation operators are not well > 
> > parallelized. There is a branch in Carpet (and a few related thorns) > that 
> > apply a different OpenMP parallelization strategy, which seems to > be more 
> > efficient. We are currently looking into cherry-picking the > relevant 
> > changes from this branch (there are also many unrelated > changes, since I 
> > experimented a lot) and putting them back into the > master branch.
> >
> > These changes only help with prolongation, which seems to be a major > 
> > contributor to non-OpenMP-scalability. I experimented with other > changes 
> > as well. My findings (unfortunately without good solutions so > far) are:
> >
> > - The standard OpenMP parallelization of loops over grid functions is > not 
> > good for data cache locality. I experimented with padding arrays, > 
> > ensuring that loop boundaries align with cache line boundaries, etc., > but 
> > this never worked quite satisfactorily -- MPI parallelization is > still 
> > faster than OpenMP. In effect, the only reason one would use > OpenMP is 
> > once one encounters MPI's scalability limits, so that > OpenMP's 
> > non-scalability is less worse.
> >
> > - We could overlap calculations with communication. To do so, I have > 
> > experimental changes that break loops over grid functions into tiles. > 
> > Outer tiles need to wait for communication (synchronization or > 
> > parallelization) to finish, while inner tiles can be calculated right > 
> > away. Unfortunately, OpenMP does not support open-ended threads like > 
> > this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and > 
> > FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The > 
> > respective changes to Carpet, the scheduler, and thorns are > significant, 
> > and I couldn't prove any performance improvements yet. > However, once we 
> > removed other, more prominent non-scalability causes, > I hope that this 
> > will become interesting.
> >
> > I haven't been attending the ET phone calls recently because Monday > 
> > mornings aren't good for me schedule-wise. If you are interested, then > we 
> > can ensure that we both attend at the same time and then discuss > this. We 
> > need to make sure the Roland Haas is then also attending.
> >
> > -erik
> >
> >
> > On Sat, Jan 20, 2018 at 10:21 AM, James Healy <[email protected] > 
> > <mailto:[email protected]>> wrote:
> >
> >     Hello all,
> >
> >     I am trying to run on the new skylake processors on Stampede2 and
> >     while the run speeds we are obtaining are very good, we are
> >     concerned that we aren't optimizing properly when it comes to
> >     OpenMP.  For instance, we see the best speeds when we use 8 MPI
> >     processors per node (with 6 threads each for a total of 48 total
> >     threads/node).  Based on the architecture, we were expecting to
> >     see the best speeds with 2 MPI/node.  Here is what I have tried:
> >
> >      1. Using the simfactory files for stampede2-skx (config file, run
> >         and submit scripts, and modules loaded) I compiled a version
> >         of ET_2017_06 using LazEv (RIT's evolution thorn) and
> >         McLachlan and submitted a series of runs that change both the
> >         number of nodes used, and how I distribute the 48 threads/node
> >         between MPI processes.
> >      2. I use a standard low resolution grid, with no IO or
> >         regridding.  Parameter file attached.
> >      3. Run speeds are measured from Carpet::physical_time_per_hour at
> >         iteration 256.
> >      4. I tried both with and without hwloc/SystemTopology.
> >      5. For both McLachlan and LazEv, I see similar results, with 2
> >         MPI/node giving the worst results (see attached plot for
> >         McLachlan) and a slight preferences for 8 MPI/node.
> >
> >     So my questions are:
> >
> >      1. Has there been any tests run by any other users on stampede2 skx?
> >      2. Should we expect 2 MPI/node to be the optimal choice?
> >      3. If so, are there any other configurations we can try that
> >         could help optimize?
> >
> >     Thanks in advance!
> >
> >     Jim Healy
> >
> >
> >     _______________________________________________
> >     Users mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://lists.einsteintoolkit.org/mailman/listinfo/users
> >     <http://lists.einsteintoolkit.org/mailman/listinfo/users>
> >
> >
> >
> >  
> > -- > Erik Schnetter <[email protected] <mailto:[email protected]>>  
> > http://www.perimeterinstitute.ca/personal/eschnetter/
> >  
> 



-- 
My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from http://pgp.mit.edu .

Attachment: pgpJMQYPgbzkS.pgp
Description: OpenPGP digital signature

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to