Very good! That looks like a 25% speed improvement in the mid-range of #MPI
processes per node.

It also looks as if the maximum speed is achieved by using between 8 and 24
MPI processes per node, i.e. between 2 and 6 OpenMP threads per MPI process.

-erik

On Mon, Feb 19, 2018 at 10:07 AM, James Healy <jch...@rit.edu> wrote:

> Hello all,
>
> I followed up our previous discussion a few weeks ago by redoing the
> scaling tests but with hwloc and SystemTopology turned on.  I attached a
> plot showing the difference when using or not using the openmp tasks
> changes to prolongation.  I also attached the stdout files for the
> ranks=24  for tasks on and off with hwloc with the output from TimerReport.
>
> Thanks,
> Jim
>
>
> On 01/26/2018 10:26 AM, Roland Haas wrote:
>
>> Hello Jim,
>>
>> thank you very much for giving this a spin.
>>
>> Yours,
>> Roland
>>
>> Hi Erik, Roland, all,
>>>
>>> After our discussion on last week's telecon, I followed Roland's
>>> instructions on how to get the branch which has changes to how Carpet
>>> handles prolongation with respect to OpenMP.  I reran my simple scaling
>>> test on Stampede Skylake nodes using this branch of Carpet
>>> (rhaas/openmp-tasks) to test the scalability.
>>>
>>> Attached is a plot showing the speeds for a variety of number of nodes
>>> and how the 48 threads are distributed on the nodes between MPI processes
>>> and OpenMP threads.  I did this for three versions of the ETK.  1. Fresh
>>> checkout of ET_2017_06.  2. The ET_2017_06 with Carpet switched to the
>>> rhaas/openmp-tasks (labelled "Test On") 3. Again with the checkout from #2,
>>> but without the parameters to enable the new prolongation code (labelled
>>> "Test Off").  The run speeds used were grabbed at iteration 256 from
>>> Carpet::physical_time_per_hour.  No IO or regridding.
>>>
>>> For 4 and 8 nodes (ie 192 and 384 cores), there wasn't much difference
>>> between the 3 trials.  However, for 16 and 24 nodes (768 and 1152 cores),
>>> we see some improvement in run speed (10-15%) for many choices of
>>> distribution of threads, again with a slight preference for 8 ranks/node.
>>>
>>> I also ran the previous test (not using the openmp-tasks branch) on
>>> comet, and found similar results as before.
>>>
>>> Thanks,
>>> Jim
>>>
>>> On 01/21/2018 01:07 PM, Erik Schnetter wrote:
>>>
>>>> James
>>>>
>>>> I looked at OpenMP performance in the Einstein Toolkit a few months >
>>>> ago, and I found that Carpet's prolongation operators are not well >
>>>> parallelized. There is a branch in Carpet (and a few related thorns) > that
>>>> apply a different OpenMP parallelization strategy, which seems to > be more
>>>> efficient. We are currently looking into cherry-picking the > relevant
>>>> changes from this branch (there are also many unrelated > changes, since I
>>>> experimented a lot) and putting them back into the > master branch.
>>>>
>>>> These changes only help with prolongation, which seems to be a major >
>>>> contributor to non-OpenMP-scalability. I experimented with other > changes
>>>> as well. My findings (unfortunately without good solutions so > far) are:
>>>>
>>>> - The standard OpenMP parallelization of loops over grid functions is >
>>>> not good for data cache locality. I experimented with padding arrays, >
>>>> ensuring that loop boundaries align with cache line boundaries, etc., > but
>>>> this never worked quite satisfactorily -- MPI parallelization is > still
>>>> faster than OpenMP. In effect, the only reason one would use > OpenMP is
>>>> once one encounters MPI's scalability limits, so that > OpenMP's
>>>> non-scalability is less worse.
>>>>
>>>> - We could overlap calculations with communication. To do so, I have >
>>>> experimental changes that break loops over grid functions into tiles. >
>>>> Outer tiles need to wait for communication (synchronization or >
>>>> parallelization) to finish, while inner tiles can be calculated right >
>>>> away. Unfortunately, OpenMP does not support open-ended threads like >
>>>> this, so I'm using Qthreads <https://github.com/Qthreads/qthreads> and
>>>> > FunHPC <https://bitbucket.org/eschnett/funhpc.cxx> for this. The >
>>>> respective changes to Carpet, the scheduler, and thorns are > significant,
>>>> and I couldn't prove any performance improvements yet. > However, once we
>>>> removed other, more prominent non-scalability causes, > I hope that this
>>>> will become interesting.
>>>>
>>>> I haven't been attending the ET phone calls recently because Monday >
>>>> mornings aren't good for me schedule-wise. If you are interested, then > we
>>>> can ensure that we both attend at the same time and then discuss > this. We
>>>> need to make sure the Roland Haas is then also attending.
>>>>
>>>> -erik
>>>>
>>>>
>>>> On Sat, Jan 20, 2018 at 10:21 AM, James Healy <jch...@rit.edu >
>>>> <mailto:jch...@rit.edu>> wrote:
>>>>
>>>>      Hello all,
>>>>
>>>>      I am trying to run on the new skylake processors on Stampede2 and
>>>>      while the run speeds we are obtaining are very good, we are
>>>>      concerned that we aren't optimizing properly when it comes to
>>>>      OpenMP.  For instance, we see the best speeds when we use 8 MPI
>>>>      processors per node (with 6 threads each for a total of 48 total
>>>>      threads/node).  Based on the architecture, we were expecting to
>>>>      see the best speeds with 2 MPI/node.  Here is what I have tried:
>>>>
>>>>       1. Using the simfactory files for stampede2-skx (config file, run
>>>>          and submit scripts, and modules loaded) I compiled a version
>>>>          of ET_2017_06 using LazEv (RIT's evolution thorn) and
>>>>          McLachlan and submitted a series of runs that change both the
>>>>          number of nodes used, and how I distribute the 48 threads/node
>>>>          between MPI processes.
>>>>       2. I use a standard low resolution grid, with no IO or
>>>>          regridding.  Parameter file attached.
>>>>       3. Run speeds are measured from Carpet::physical_time_per_hour at
>>>>          iteration 256.
>>>>       4. I tried both with and without hwloc/SystemTopology.
>>>>       5. For both McLachlan and LazEv, I see similar results, with 2
>>>>          MPI/node giving the worst results (see attached plot for
>>>>          McLachlan) and a slight preferences for 8 MPI/node.
>>>>
>>>>      So my questions are:
>>>>
>>>>       1. Has there been any tests run by any other users on stampede2
>>>> skx?
>>>>       2. Should we expect 2 MPI/node to be the optimal choice?
>>>>       3. If so, are there any other configurations we can try that
>>>>          could help optimize?
>>>>
>>>>      Thanks in advance!
>>>>
>>>>      Jim Healy
>>>>
>>>>
>>>>      _______________________________________________
>>>>      Users mailing list
>>>>      Users@einsteintoolkit.org <mailto:Users@einsteintoolkit.org>
>>>>      http://lists.einsteintoolkit.org/mailman/listinfo/users
>>>>      <http://lists.einsteintoolkit.org/mailman/listinfo/users>
>>>>
>>>>
>>>>
>>>>   -- > Erik Schnetter <schnet...@cct.lsu.edu <mailto:
>>>> schnet...@cct.lsu.edu>>
>>>> http://www.perimeterinstitute.ca/personal/eschnetter/
>>>>
>>>>
>>>
>>
>>
>


-- 
Erik Schnetter <schnet...@cct.lsu.edu>
http://www.perimeterinstitute.ca/personal/eschnetter/
_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to