On 24 Jul 2015, at 19:42, Erik Schnetter <[email protected]> wrote:

> On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <[email protected]> wrote:
> 
> On 24 Jul 2015, at 19:15, Erik Schnetter <[email protected]> wrote:
> 
>> On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <[email protected]> wrote:
>> 
>> On 8 Jul 2015, at 16:53, Ian Hinder <[email protected]> wrote:
>> 
>>> 
>>> On 8 Jul 2015, at 15:14, Erik Schnetter <[email protected]> wrote:
>>> 
>>>> I added a second benchmark, using a Thornburg04 patch system, 8th order 
>>>> finite differencing, and 4th order patch interpolation. The results are
>>>> 
>>>> original: 8.53935e-06 sec
>>>> rewrite:  8.55188e-06 sec
>>>> 
>>>> this time with 1 thread per MPI process, since that was most efficient in 
>>>> both cases. Most of the time is spent in inter-patch interpolation, which 
>>>> is much more expensive than in a "regular" case since this benchmark is 
>>>> run on a single node and hence with very small grids.
>>>> 
>>>> With these numbers under our belt, can we merge the rewrite branch?
>>> 
>>> The "jacobian" benchmark that I gave you was still a pure kernel benchmark, 
>>> involving no interpatch interpolation.  It just measured the speed of the 
>>> RHSs when Jacobians were included.  I would also not use a single-threaded 
>>> benchmark with very small grid sizes; this might have been fastest in this 
>>> artificial case, but in practice I don't think we would use that 
>>> configuration.  The benchmark you have now run seems to be more of a 
>>> "complete system" benchmark, which is useful, but different.
>>> 
>>> I think it is important that the kernel itself has not gotten slower, even 
>>> if the kernel is not currently a major contributor to runtime.  We 
>>> specifically split out the advection derivatives because they made the code 
>>> with 8th order and Jacobians a fair bit slower.  I would just like to see 
>>> that this is not still the case with the new version, which has changed the 
>>> way this is handled.
>> 
>> I have now run my benchmarks on both the original and the rewritten 
>> McLachlan.  I seem to find that the ML_BSSN_* functions in
>> Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint 
>> calculations, are between 11% and 15% slower with the rewrite branch, 
>> depending on the details of the evolution.  See attached plot.  This is on 
>> Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz).
>> 
>> What exactly do you measure -- which bins or routines? Does this involve 
>> communication? Are you using thorn Dissipation?
> 
> 
> I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns that 
> start with ML_BSSN_ and eliminate the ones containing "constraints" (case 
> insensitive).  This is running on two processes, one node, 6 threads per 
> node.  Threads are correctly bound to cores.  There is ghostzone exchange 
> between the processes, so yes, there is communication in the 
> ML_BSSN_SelectBCs SYNC calls, but it is node-local.
> 
> Can you include thorn Dissipation in the "before" case, and use McLachlan's 
> dissipation in the "after" case?

There is no dissipation in either case.

The output data is in

        
http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334
        
http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542

including the parameter files.

Actually, what I said before was wrong; the timers I am using are under 
"thorns", not "syncs", so even the node-local communication should not be 
counted.

-- 
Ian Hinder
http://members.aei.mpg.de/ianhin

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to