[Python-Dev] Fwd: Python 3.11 performance with frame pointers

2023-01-04 Thread Daan De Meyer
Hi,

As part of the proposal to enable frame pointers by default in Fedora
(https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we
did some benchmarking to figure out the expected performance impact.
The performance impact was generally minimal, except for the
pyperformance benchmark suite where we noticed a more substantial
difference between a system built with frame pointers and a system
built without frame pointers. The results can be found here:
https://github.com/DaanDeMeyer/fpbench (look at the mean difference
column for the pyperformance results where the percentage is the
slowdown compared to a system built without frame pointers). One of
the biggest slowdowns was on the scimark_sparse_mat_mult benchmark
which slowed down 9.5% when the system (including python) was built
with frame pointers. Note that these benchmarks were run against
Python 3.11 on a Fedora 37 x86_64 system (one built with frame
pointers, another built without frame pointers). The system used to
run the benchmarks was an Amazon EC2 machine.

We did look a bit into the reasons behind this slowdown. I'll quote
the investigation by Andrii on the Fesco issue thread here
(https://pagure.io/fesco/issue/2817):

> So I did look a bit at Python with and without frame pointers trying to
> understand pyperformance > regressions.

> First, perf data suggests that big chunk of CPU is spent in 
> _PyEval_EvalFrameDefault,
> so I looked specifically  into it (also we had to use DWARF mode for perf for 
> apples-to-apples
> comparison, and a bunch of stack traces weren't symbolized properly, which 
> just again
> reminds why having frame pointers is important).

> perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot 
> spots, the work
> seemed to be distributed pretty similarly with or without frame pointers. 
> Also scrolling through
> _PyEval_EvalFrameDefault disassembly also showed that instruction patterns 
> between fp
> and no-fp versions are very similar.

> But just a few interesting observations.

> The size of _PyEval_EvalFrameDefault function specifically (and all the other 
> functions didn't
> change much in that regard) increased very significantly from 46104 to 53592 
> bytes, which is a
> considerable 15% increase. Looking deeper, I believe it's all due to more 
> stack spills and
> reloads due to one less register available to keep local variables in 
> registers instead of on the stack.

> Looking at _PyEval_EvalFrameDefault C code, it is a humongous one function 
> with gigantic switch
> statement that implements Python instruction handling logic. So the function 
> itself is big and it has
> a lot of local state in different branches, which to me explained why there 
> is so much stack spill/load.

> Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov 
> 0x50(%rsp),%r10 (and their reverse
> variants), I see that there is a substantial amount of stack spill/load in 
> _PyEval_EvalFrameDefault
> disassembly already in default no frame pointer variant (1870 out of 11181 
> total instructions in that
> function, 16.7%), and it just increases further in frame pointer version 
> (2341 out of 11733 instructions, 20%).

> One more interesting observation. With no frame pointers, GCC generates stack 
> accesses using %rsp
> with small positive offsets, which results in pretty compact binary 
> instruction representation, e.g.:

> 0x001cce40 <+44160>: 4c 8b 54 24 50  mov0x50(%rsp),%r10

> This uses 5 bytes. But if frame pointers are enabled, GCC switches to using 
> %rbp-relative offsets,
> which are all negative. And that seems to result in much bigger instructions, 
> taking now 7 bytes instead of 5:

> 0x001d3969 <+53065>: 48 8b 8d 10 ff ff ffmov-0xf0(%rbp),%rcx

> I found it pretty interesting. I'd imagine GCC should be capable to keep 
> using %rsp addressing just fine
> regardless of %rbp and save on instruction sizes, but apparently it doesn't. 
> Not sure why. But this instruction
> increase, coupled with increase of number of spills/reloads, actually 
> explains huge increase in byte size of
> _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra bytes 
> for existing 1870 instructions
> that were switched from %rsp+positive offset to %rbp + negative offset, plus 
> 7 bytes for each of new 471 instructions).
> I'm no compiler expert, but it would be nice for someone from GCC community 
> to check this as well (please CC
> relevant folks, if you know them).

> In summary, to put it bluntly, there is just more work to do for CPU 
> saving/restoring state to/from stack. But I don't
> think _PyEval_EvalFrameDefault example is typical of how application code is 
> written, nor is it, generally speaking,
> a good idea to do so much within single gigantic function. So I believe it's 
> more of an outlier than a typical case.

We have a few questions:
- Is this slowdown when Python is built with frame pointers to be
expected? Has the Pytho

[Python-Dev] Re: Fwd: Python 3.11 performance with frame pointers

2023-01-04 Thread Gregory P. Smith
I suggest re-posting this on discuss.python.org as more engaged active core
devs will pay attention to it there.

On Wed, Jan 4, 2023 at 11:12 AM Daan De Meyer 
wrote:

> Hi,
>
> As part of the proposal to enable frame pointers by default in Fedora
> (https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we
> did some benchmarking to figure out the expected performance impact.
> The performance impact was generally minimal, except for the
> pyperformance benchmark suite where we noticed a more substantial
> difference between a system built with frame pointers and a system
> built without frame pointers. The results can be found here:
> https://github.com/DaanDeMeyer/fpbench (look at the mean difference
> column for the pyperformance results where the percentage is the
> slowdown compared to a system built without frame pointers). One of
> the biggest slowdowns was on the scimark_sparse_mat_mult benchmark
> which slowed down 9.5% when the system (including python) was built
> with frame pointers. Note that these benchmarks were run against
> Python 3.11 on a Fedora 37 x86_64 system (one built with frame
> pointers, another built without frame pointers). The system used to
> run the benchmarks was an Amazon EC2 machine.
>
> We did look a bit into the reasons behind this slowdown. I'll quote
> the investigation by Andrii on the Fesco issue thread here
> (https://pagure.io/fesco/issue/2817):
>
> > So I did look a bit at Python with and without frame pointers trying to
> > understand pyperformance > regressions.
>
> > First, perf data suggests that big chunk of CPU is spent in
> _PyEval_EvalFrameDefault,
> > so I looked specifically  into it (also we had to use DWARF mode for
> perf for apples-to-apples
> > comparison, and a bunch of stack traces weren't symbolized properly,
> which just again
> > reminds why having frame pointers is important).
>
> > perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot
> spots, the work
> > seemed to be distributed pretty similarly with or without frame
> pointers. Also scrolling through
> > _PyEval_EvalFrameDefault disassembly also showed that instruction
> patterns between fp
> > and no-fp versions are very similar.
>
> > But just a few interesting observations.
>
> > The size of _PyEval_EvalFrameDefault function specifically (and all the
> other functions didn't
> > change much in that regard) increased very significantly from 46104 to
> 53592 bytes, which is a
> > considerable 15% increase. Looking deeper, I believe it's all due to
> more stack spills and
> > reloads due to one less register available to keep local variables in
> registers instead of on the stack.
>
> > Looking at _PyEval_EvalFrameDefault C code, it is a humongous one
> function with gigantic switch
> > statement that implements Python instruction handling logic. So the
> function itself is big and it has
> > a lot of local state in different branches, which to me explained why
> there is so much stack spill/load.
>
> > Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov
> 0x50(%rsp),%r10 (and their reverse
> > variants), I see that there is a substantial amount of stack spill/load
> in _PyEval_EvalFrameDefault
> > disassembly already in default no frame pointer variant (1870 out of
> 11181 total instructions in that
> > function, 16.7%), and it just increases further in frame pointer version
> (2341 out of 11733 instructions, 20%).
>
> > One more interesting observation. With no frame pointers, GCC generates
> stack accesses using %rsp
> > with small positive offsets, which results in pretty compact binary
> instruction representation, e.g.:
>
> > 0x001cce40 <+44160>: 4c 8b 54 24 50  mov
> 0x50(%rsp),%r10
>
> > This uses 5 bytes. But if frame pointers are enabled, GCC switches to
> using %rbp-relative offsets,
> > which are all negative. And that seems to result in much bigger
> instructions, taking now 7 bytes instead of 5:
>
> > 0x001d3969 <+53065>: 48 8b 8d 10 ff ff ffmov
> -0xf0(%rbp),%rcx
>
> > I found it pretty interesting. I'd imagine GCC should be capable to keep
> using %rsp addressing just fine
> > regardless of %rbp and save on instruction sizes, but apparently it
> doesn't. Not sure why. But this instruction
> > increase, coupled with increase of number of spills/reloads, actually
> explains huge increase in byte size of
> > _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra
> bytes for existing 1870 instructions
> > that were switched from %rsp+positive offset to %rbp + negative offset,
> plus 7 bytes for each of new 471 instructions).
> > I'm no compiler expert, but it would be nice for someone from GCC
> community to check this as well (please CC
> > relevant folks, if you know them).
>
> > In summary, to put it bluntly, there is just more work to do for CPU
> saving/restoring state to/from stack. But I don't
> > think _PyEval_EvalFrameDefault example is typical of how application
> code is written, n