On 24 Jul 2015, at 23:01, Erik Schnetter <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 3:43 PM, Ian Hinder <[email protected]> wrote: > > On 24 Jul 2015, at 20:39, Erik Schnetter <[email protected]> wrote: > >> On Fri, Jul 24, 2015 at 1:58 PM, Ian Hinder <[email protected]> wrote: >> >> On 24 Jul 2015, at 19:42, Erik Schnetter <[email protected]> wrote: >> >>> On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <[email protected]> wrote: >>> >>> On 24 Jul 2015, at 19:15, Erik Schnetter <[email protected]> wrote: >>> >>>> On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <[email protected]> wrote: >>>> >>>> On 8 Jul 2015, at 16:53, Ian Hinder <[email protected]> wrote: >>>> >>>>> >>>>> On 8 Jul 2015, at 15:14, Erik Schnetter <[email protected]> wrote: >>>>> >>>>>> I added a second benchmark, using a Thornburg04 patch system, 8th order >>>>>> finite differencing, and 4th order patch interpolation. The results are >>>>>> >>>>>> original: 8.53935e-06 sec >>>>>> rewrite: 8.55188e-06 sec >>>>>> >>>>>> this time with 1 thread per MPI process, since that was most efficient >>>>>> in both cases. Most of the time is spent in inter-patch interpolation, >>>>>> which is much more expensive than in a "regular" case since this >>>>>> benchmark is run on a single node and hence with very small grids. >>>>>> >>>>>> With these numbers under our belt, can we merge the rewrite branch? >>>>> >>>>> The "jacobian" benchmark that I gave you was still a pure kernel >>>>> benchmark, involving no interpatch interpolation. It just measured the >>>>> speed of the RHSs when Jacobians were included. I would also not use a >>>>> single-threaded benchmark with very small grid sizes; this might have >>>>> been fastest in this artificial case, but in practice I don't think we >>>>> would use that configuration. The benchmark you have now run seems to be >>>>> more of a "complete system" benchmark, which is useful, but different. >>>>> >>>>> I think it is important that the kernel itself has not gotten slower, >>>>> even if the kernel is not currently a major contributor to runtime. We >>>>> specifically split out the advection derivatives because they made the >>>>> code with 8th order and Jacobians a fair bit slower. I would just like >>>>> to see that this is not still the case with the new version, which has >>>>> changed the way this is handled. >>>> >>>> I have now run my benchmarks on both the original and the rewritten >>>> McLachlan. I seem to find that the ML_BSSN_* functions in >>>> Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint >>>> calculations, are between 11% and 15% slower with the rewrite branch, >>>> depending on the details of the evolution. See attached plot. This is on >>>> Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz). >>>> >>>> What exactly do you measure -- which bins or routines? Does this involve >>>> communication? Are you using thorn Dissipation? >>> >>> >>> I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns that >>> start with ML_BSSN_ and eliminate the ones containing "constraints" (case >>> insensitive). This is running on two processes, one node, 6 threads per >>> node. Threads are correctly bound to cores. There is ghostzone exchange >>> between the processes, so yes, there is communication in the >>> ML_BSSN_SelectBCs SYNC calls, but it is node-local. >>> >>> Can you include thorn Dissipation in the "before" case, and use McLachlan's >>> dissipation in the "after" case? >> >> There is no dissipation in either case. >> >> The output data is in >> >> >> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334 >> >> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542 >> >> including the parameter files. >> >> Actually, what I said before was wrong; the timers I am using are under >> "thorns", not "syncs", so even the node-local communication should not be >> counted. >> >> McLachlan has not been optimized for runs without dissipation. If you this >> this is important, then we can introduce a special case. I expect this to >> improve performance. However, running BSSN without dissipation is not what >> one would do in production, so I didn't investigate this case. > > I agree that runs without dissipation are not relevant, but since I usually > use the Dissipation thorn, I didn't include it in the benchmark, which was a > benchmark of McLachlan. I assume that McLachlan now always calculates the > dissipation term, even when it is zero, and that is what you mean by "not > optimised"? This will introduce a performance regression (if this is the > reason for the increased benchmark time, then presumably only on the level of > ~15% for the kernel, hence less for a whole simulation) for any simulation > which uses dissipation from the Dissipation thorn. Since McLachlan's > dissipation was previously very slow, this is presumably what most existing > parameter files use. > > Regarding switching to use McLachlan for dissipation: McLachlan's dissipation > is a bit more limited than the Dissipation thorn; it looks like McLachlan is > hard-coded to use dissipation of order 1+fdOrder, rather than the dissipation > order being chosen separately. Sometimes lower orders are used as an > optimisation (the effect on convergence being judged to be minimal). And > actually, critically, there is no way to specify different dissipation orders > on different refinement levels. This is typically used in production binary > simulations. > > In other words, you are asking for a version of ML_BSSN where it is efficient > to not use dissipation. Currently, that means that dissipation is disabled. > The question is -- should this be the default? > > Do you think it is faster to use dissipation from McLachlan than to use that > provided by Dissipation? > > Yes, I think so. I don't know. Without knowing performance numbers, it is difficult to judge. Since people may be using McLachlan's dissipation in their parameter files (even though it is slow), it's probably not a good idea to disable it by default. Is it possible to make McLachlan efficient when dissipation is disabled, but keep the code for it there? e.g. by wrapping it in a conditional? If the condition is a scalar, this should be fine even with vectorisation, no? -- Ian Hinder http://members.aei.mpg.de/ianhin
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
