Hello Jim, thank you for benchmarking these. I have just updated the defaults in simfactory to be 2 threads per node (ie 24 MPI ranks) since this gave you the fastest simulation when using hwloc (though not without).
I suspect the hwloc requirement is due to bad default layout of the threads by TACC. I have pushed (hopefully saner) settings for task binding into a branch rhaas/stampede2 (git pull ; git checkout rhaas/stampede2). If you have time, would you mind benchmarking using those settings as well, please? Yours, Roland > Very good! That looks like a 25% speed improvement in the mid-range of #MPI > processes per node. > > It also looks as if the maximum speed is achieved by using between 8 and 24 > MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI process. > > -erik > > On Mon, Feb 19, 2018 at 10:07 AM, James Healy <[email protected]> wrote: > > > Hello all, > > > > I followed up our previous discussion a few weeks ago by redoing the > > scaling tests but with hwloc and SystemTopology turned on. I attached a > > plot showing the difference when using or not using the openmp tasks > > changes to prolongation. I also attached the stdout files for the > > ranks=24 for tasks on and off with hwloc with the output from TimerReport. > > > > Thanks, > > Jim > > > > > > On 01/26/2018 10:26 AM, Roland Haas wrote: > > > >> Hello Jim, > >> > >> thank you very much for giving this a spin. > >> > >> Yours, > >> Roland > >> > >> Hi Erik, Roland, all, > >>> > >>> After our discussion on last week's telecon, I followed Roland's > >>> instructions on how to get the branch which has changes to how Carpet > >>> handles prolongation with respect to OpenMP. I reran my simple scaling > >>> test on Stampede Skylake nodes using this branch of Carpet > >>> (rhaas/openmp-tasks) to test the scalability. > >>> > >>> Attached is a plot showing the speeds for a variety of number of nodes > >>> and how the 48 threads are distributed on the nodes between MPI processes > >>> and OpenMP threads. I did this for three versions of the ETK. 1. Fresh > >>> checkout of ET_2017_06. 2. The ET_2017_06 with Carpet switched to the > >>> rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout from > >>> #2, > >>> but without the parameters to enable the new prolongation code (labelled > >>> "Test Off"). The run speeds used were grabbed at iteration 256 from > >>> Carpet::physical_time_per_hour. No IO or regridding. > >>> > >>> For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference > >>> between the 3 trials. However, for 16 and 24 nodes (768 and 1152 cores), > >>> we see some improvement in run speed (10-15%) for many choices of > >>> distribution of threads, again with a slight preference for 8 ranks/node. > >>> > >>> I also ran the previous test (not using the openmp-tasks branch) on > >>> comet, and found similar results as before. > >>> > >>> Thanks, > >>> Jim > >>> > >>> On 01/21/2018 01:07 PM, Erik Schnetter wrote: > >>> > >>>> James > >>>> > >>>> I looked at OpenMP performance in the Einstein Toolkit a few months > > >>>> ago, and I found that Carpet's prolongation operators are not well > > >>>> parallelized. There is a branch in Carpet (and a few related thorns) > > >>>> that > >>>> apply a different OpenMP parallelization strategy, which seems to > be > >>>> more > >>>> efficient. We are currently looking into cherry-picking the > relevant > >>>> changes from this branch (there are also many unrelated > changes, since > >>>> I > >>>> experimented a lot) and putting them back into the > master branch. > >>>> > >>>> These changes only help with prolongation, which seems to be a major > > >>>> contributor to non-OpenMP-scalability. I experimented with other > > >>>> changes > >>>> as well. My findings (unfortunately without good solutions so > far) are: > >>>> > >>>> - The standard OpenMP parallelization of loops over grid functions is > > >>>> not good for data cache locality. I experimented with padding arrays, > > >>>> ensuring that loop boundaries align with cache line boundaries, etc., > > >>>> but > >>>> this never worked quite satisfactorily -- MPI parallelization is > still > >>>> faster than OpenMP. In effect, the only reason one would use > OpenMP is > >>>> once one encounters MPI's scalability limits, so that > OpenMP's > >>>> non-scalability is less worse. > >>>> > >>>> - We could overlap calculations with communication. To do so, I have > > >>>> experimental changes that break loops over grid functions into tiles. > > >>>> Outer tiles need to wait for communication (synchronization or > > >>>> parallelization) to finish, while inner tiles can be calculated right > > >>>> away. Unfortunately, OpenMP does not support open-ended threads like > > >>>> this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and > >>>> > FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The > > >>>> respective changes to Carpet, the scheduler, and thorns are > > >>>> significant, > >>>> and I couldn't prove any performance improvements yet. > However, once we > >>>> removed other, more prominent non-scalability causes, > I hope that this > >>>> will become interesting. > >>>> > >>>> I haven't been attending the ET phone calls recently because Monday > > >>>> mornings aren't good for me schedule-wise. If you are interested, then > > >>>> we > >>>> can ensure that we both attend at the same time and then discuss > this. > >>>> We > >>>> need to make sure the Roland Haas is then also attending. > >>>> > >>>> -erik > >>>> > >>>> > >>>> On Sat, Jan 20, 2018 at 10:21 AM, James Healy <[email protected] > > >>>> <mailto:[email protected]>> wrote: > >>>> > >>>> Hello all, > >>>> > >>>> I am trying to run on the new skylake processors on Stampede2 and > >>>> while the run speeds we are obtaining are very good, we are > >>>> concerned that we aren't optimizing properly when it comes to > >>>> OpenMP. For instance, we see the best speeds when we use 8 MPI > >>>> processors per node (with 6 threads each for a total of 48 total > >>>> threads/node). Based on the architecture, we were expecting to > >>>> see the best speeds with 2 MPI/node. Here is what I have tried: > >>>> > >>>> 1. Using the simfactory files for stampede2-skx (config file, run > >>>> and submit scripts, and modules loaded) I compiled a version > >>>> of ET_2017_06 using LazEv (RIT's evolution thorn) and > >>>> McLachlan and submitted a series of runs that change both the > >>>> number of nodes used, and how I distribute the 48 threads/node > >>>> between MPI processes. > >>>> 2. I use a standard low resolution grid, with no IO or > >>>> regridding. Parameter file attached. > >>>> 3. Run speeds are measured from Carpet::physical_time_per_hour at > >>>> iteration 256. > >>>> 4. I tried both with and without hwloc/SystemTopology. > >>>> 5. For both McLachlan and LazEv, I see similar results, with 2 > >>>> MPI/node giving the worst results (see attached plot for > >>>> McLachlan) and a slight preferences for 8 MPI/node. > >>>> > >>>> So my questions are: > >>>> > >>>> 1. Has there been any tests run by any other users on stampede2 > >>>> skx? > >>>> 2. Should we expect 2 MPI/node to be the optimal choice? > >>>> 3. If so, are there any other configurations we can try that > >>>> could help optimize? > >>>> > >>>> Thanks in advance! > >>>> > >>>> Jim Healy > >>>> > >>>> > >>>> _______________________________________________ > >>>> Users mailing list > >>>> [email protected] <mailto:[email protected]> > >>>> http://lists.einsteintoolkit.org/mailman/listinfo/users > >>>> <http://lists.einsteintoolkit.org/mailman/listinfo/users> > >>>> > >>>> > >>>> > >>>> -- > Erik Schnetter <[email protected] <mailto: > >>>> [email protected]>> > >>>> http://www.perimeterinstitute.ca/personal/eschnetter/ > >>>> > >>>> > >>> > >> > >> > > > > -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://pgp.mit.edu .
pgpyNL4MwNj45.pgp
Description: OpenPGP digital signature
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
