On Fri, Jul 24, 2015 at 3:43 PM, Ian Hinder <[email protected]> wrote:
> > On 24 Jul 2015, at 20:39, Erik Schnetter <[email protected]> wrote: > > On Fri, Jul 24, 2015 at 1:58 PM, Ian Hinder <[email protected]> wrote: > >> >> On 24 Jul 2015, at 19:42, Erik Schnetter <[email protected]> wrote: >> >> On Fri, Jul 24, 2015 at 1:39 PM, Ian Hinder <[email protected]> >> wrote: >> >>> >>> On 24 Jul 2015, at 19:15, Erik Schnetter <[email protected]> wrote: >>> >>> On Fri, Jul 24, 2015 at 11:57 AM, Ian Hinder <[email protected]> >>> wrote: >>> >>>> >>>> On 8 Jul 2015, at 16:53, Ian Hinder <[email protected]> wrote: >>>> >>>> >>>> On 8 Jul 2015, at 15:14, Erik Schnetter <[email protected]> wrote: >>>> >>>> I added a second benchmark, using a Thornburg04 patch system, 8th order >>>> finite differencing, and 4th order patch interpolation. The results are >>>> >>>> original: 8.53935e-06 sec >>>> rewrite: 8.55188e-06 sec >>>> >>>> this time with 1 thread per MPI process, since that was most efficient >>>> in both cases. Most of the time is spent in inter-patch interpolation, >>>> which is much more expensive than in a "regular" case since this benchmark >>>> is run on a single node and hence with very small grids. >>>> >>>> With these numbers under our belt, can we merge the rewrite branch? >>>> >>>> >>>> The "jacobian" benchmark that I gave you was still a pure kernel >>>> benchmark, involving no interpatch interpolation. It just measured the >>>> speed of the RHSs when Jacobians were included. I would also not use a >>>> single-threaded benchmark with very small grid sizes; this might have been >>>> fastest in this artificial case, but in practice I don't think we would use >>>> that configuration. The benchmark you have now run seems to be more of a >>>> "complete system" benchmark, which is useful, but different. >>>> >>>> I think it is important that the kernel itself has not gotten slower, >>>> even if the kernel is not currently a major contributor to runtime. We >>>> specifically split out the advection derivatives because they made the code >>>> with 8th order and Jacobians a fair bit slower. I would just like to see >>>> that this is not still the case with the new version, which has changed the >>>> way this is handled. >>>> >>>> >>>> I have now run my benchmarks on both the original and the rewritten >>>> McLachlan. I seem to find that the ML_BSSN_* functions in >>>> Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns, excluding the constraint >>>> calculations, are between 11% and 15% slower with the rewrite branch, >>>> depending on the details of the evolution. See attached plot. This is on >>>> Datura with quite old CPUs (Intel Xeon CPU X5650 2.67GHz). >>>> >>> >>> What exactly do you measure -- which bins or routines? Does this involve >>> communication? Are you using thorn Dissipation? >>> >>> >>> I take all the timers in Evolve/CallEvol/CCTK_EVOL/CallFunction/thorns >>> that start with ML_BSSN_ and eliminate the ones containing "constraints" >>> (case insensitive). This is running on two processes, one node, 6 threads >>> per node. Threads are correctly bound to cores. There is ghostzone >>> exchange between the processes, so yes, there is communication in the >>> ML_BSSN_SelectBCs SYNC calls, but it is node-local. >>> >> >> Can you include thorn Dissipation in the "before" case, and use >> McLachlan's dissipation in the "after" case? >> >> >> There is no dissipation in either case. >> >> The output data is in >> >> >> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/orig/20150724-174334 >> >> http://git.barrywardell.net/?p=McLachlanBenchmarks.git;h=refs/runs/rewrite/20150724-170542 >> >> including the parameter files. >> >> Actually, what I said before was wrong; the timers I am using are under >> "thorns", not "syncs", so even the node-local communication should not be >> counted. >> > > McLachlan has not been optimized for runs without dissipation. If you this > this is important, then we can introduce a special case. I expect this to > improve performance. However, running BSSN without dissipation is not what > one would do in production, so I didn't investigate this case. > > > I agree that runs without dissipation are not relevant, but since I > usually use the Dissipation thorn, I didn't include it in the benchmark, > which was a benchmark of McLachlan. I assume that McLachlan now always > calculates the dissipation term, even when it is zero, and that is what you > mean by "not optimised"? This will introduce a performance regression (if > this is the reason for the increased benchmark time, then presumably only > on the level of ~15% for the kernel, hence less for a whole simulation) for > any simulation which uses dissipation from the Dissipation thorn. Since > McLachlan's dissipation was previously very slow, this is presumably what > most existing parameter files use. > > Regarding switching to use McLachlan for dissipation: McLachlan's > dissipation is a bit more limited than the Dissipation thorn; it looks like > McLachlan is hard-coded to use dissipation of order 1+fdOrder, rather than > the dissipation order being chosen separately. Sometimes lower orders are > used as an optimisation (the effect on convergence being judged to be > minimal). And actually, critically, there is no way to specify different > dissipation orders on different refinement levels. This is typically used > in production binary simulations. > In other words, you are asking for a version of ML_BSSN where it is efficient to not use dissipation. Currently, that means that dissipation is disabled. The question is -- should this be the default? Do you think it is faster to use dissipation from McLachlan than to use > that provided by Dissipation? > Yes, I think so. -erik -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
