Re: [Users] Benchmarking results for McLachlan rewrite

Ian Hinder Fri, 24 Jul 2015 14:32:56 -0700

On 24 Jul 2015, at 23:01, Erik Schnetter <[email protected]> wrote:


> On Fri, Jul 24, 2015 at 3:43 PM, Ian Hinder <[email protected]> wrote:
> 
> On 24 Jul 2015, at 20:39, Erik Schnetter <[email protected]> wrote:
> 
>> On Fri, Jul 24, 2015 at 1:58 PM, Ian Hinder <[email protected]> wrote:
>> 
>> On 24 Jul 2015, at 19:42, Erik Schnetter <[email protected]> wrote:
>> 
>>> On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <[email protected]> wrote:
>>> 
>>> On 24 Jul 2015, at 19:15, Erik Schnetter <[email protected]> wrote:
>>> 
>>>> On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <[email protected]> wrote:
>>>> 
>>>> On 8 Jul 2015, at 16:53, Ian Hinder <[email protected]> wrote:
>>>> 
>>>>> 
>>>>> On 8 Jul 2015, at 15:14, Erik Schnetter <[email protected]> wrote:
>>>>> 
>>>>>> I added a second benchmark, using a Thornburg04 patch system, 8th order 
>>>>>> finite differencing, and 4th order patch interpolation. The results are
>>>>>> 
>>>>>> original: 8.53935e-06 sec
>>>>>> rewrite:  8.55188e-06 sec
>>>>>> 
>>>>>> this time with 1 thread per MPI process, since that was most efficient 
>>>>>> in both cases. Most of the time is spent in inter-patch interpolation, 
>>>>>> which is much more expensive than in a "regular" case since this 
>>>>>> benchmark is run on a single node and hence with very small grids.
>>>>>> 
>>>>>> With these numbers under our belt, can we merge the rewrite branch?
>>>>> 
>>>>> The "jacobian" benchmark that I gave you was still a pure kernel 
>>>>> benchmark, involving no interpatch interpolation.  It just measured the 
>>>>> speed of the RHSs when Jacobians were included.  I would also not use a 
>>>>> single-threaded benchmark with very small grid sizes; this might have 
>>>>> been fastest in this artificial case, but in practice I don't think we 
>>>>> would use that configuration.  The benchmark you have now run seems to be 
>>>>> more of a "complete system" benchmark, which is useful, but different.
>>>>> 
>>>>> I think it is important that the kernel itself has not gotten slower, 
>>>>> even if the kernel is not currently a major contributor to runtime.  We 
>>>>> specifically split out the advection derivatives because they made the 
>>>>> code with 8th order and Jacobians a fair bit slower.  I would just like 
>>>>> to see that this is not still the case with the new version, which has 
>>>>> changed the way this is handled.
>>>> 
>>>> I have now run my benchmarks on both the original and the rewritten 
>>>> McLachlan.  I seem to find that the ML_BSSN_* functions in
>>>> Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint 
>>>> calculations, are between 11% and 15% slower with the rewrite branch, 
>>>> depending on the details of the evolution.  See attached plot.  This is on 
>>>> Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz).
>>>> 
>>>> What exactly do you measure -- which bins or routines? Does this involve 
>>>> communication? Are you using thorn Dissipation?
>>> 
>>> 
>>> I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns that 
>>> start with ML_BSSN_ and eliminate the ones containing "constraints" (case 
>>> insensitive).  This is running on two processes, one node, 6 threads per 
>>> node.  Threads are correctly bound to cores.  There is ghostzone exchange 
>>> between the processes, so yes, there is communication in the 
>>> ML_BSSN_SelectBCs SYNC calls, but it is node-local.
>>> 
>>> Can you include thorn Dissipation in the "before" case, and use McLachlan's 
>>> dissipation in the "after" case?
>> 
>> There is no dissipation in either case.
>> 
>> The output data is in
>> 
>>      
>> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334
>>      
>> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542
>> 
>> including the parameter files.
>> 
>> Actually, what I said before was wrong; the timers I am using are under 
>> "thorns", not "syncs", so even the node-local communication should not be 
>> counted.
>> 
>> McLachlan has not been optimized for runs without dissipation. If you this 
>> this is important, then we can introduce a special case. I expect this to 
>> improve performance. However, running BSSN without dissipation is not what 
>> one would do in production, so I didn't investigate this case.
> 
> I agree that runs without dissipation are not relevant, but since I usually 
> use the Dissipation thorn, I didn't include it in the benchmark, which was a 
> benchmark of McLachlan.  I assume that McLachlan now always calculates the 
> dissipation term, even when it is zero, and that is what you mean by "not 
> optimised"?  This will introduce a performance regression (if this is the 
> reason for the increased benchmark time, then presumably only on the level of 
> ~15% for the kernel, hence less for a whole simulation) for any simulation 
> which uses dissipation from the Dissipation thorn.  Since McLachlan's 
> dissipation was previously very slow, this is presumably what most existing 
> parameter files use.  
> 
> Regarding switching to use McLachlan for dissipation: McLachlan's dissipation 
> is a bit more limited than the Dissipation thorn; it looks like McLachlan is 
> hard-coded to use dissipation of order 1+fdOrder, rather than the dissipation 
> order being chosen separately.  Sometimes lower orders are used as an 
> optimisation (the effect on convergence being judged to be minimal).  And 
> actually, critically, there is no way to specify different dissipation orders 
> on different refinement levels.  This is typically used in production binary 
> simulations.
> 
> In other words, you are asking for a version of ML_BSSN where it is efficient 
> to not use dissipation. Currently, that means that dissipation is disabled. 
> The question is -- should this be the default?
> 
> Do you think it is faster to use dissipation from McLachlan than to use that 
> provided by Dissipation?
> 
> Yes, I think so. 

I don't know.  Without knowing performance numbers, it is difficult to judge.  
Since people may be using McLachlan's dissipation in their parameter files 
(even though it is slow), it's probably not a good idea to disable it by 
default. 

Is it possible to make McLachlan efficient when dissipation is disabled, but 
keep the code for it there?  e.g. by wrapping it in a conditional?  If the 
condition is a scalar, this should be fine even with vectorisation, no?

-- 
Ian Hinder
http://members.aei.mpg.de/ianhin

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Re: [Users] Benchmarking results for McLachlan rewrite

Reply via email to