[Python-Dev] Fwd: Python 3.11 performance with frame pointers
Hi, As part of the proposal to enable frame pointers by default in Fedora (https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we did some benchmarking to figure out the expected performance impact. The performance impact was generally minimal, except for the pyperformance benchmark suite where we noticed a more substantial difference between a system built with frame pointers and a system built without frame pointers. The results can be found here: https://github.com/DaanDeMeyer/fpbench (look at the mean difference column for the pyperformance results where the percentage is the slowdown compared to a system built without frame pointers). One of the biggest slowdowns was on the scimark_sparse_mat_mult benchmark which slowed down 9.5% when the system (including python) was built with frame pointers. Note that these benchmarks were run against Python 3.11 on a Fedora 37 x86_64 system (one built with frame pointers, another built without frame pointers). The system used to run the benchmarks was an Amazon EC2 machine. We did look a bit into the reasons behind this slowdown. I'll quote the investigation by Andrii on the Fesco issue thread here (https://pagure.io/fesco/issue/2817): > So I did look a bit at Python with and without frame pointers trying to > understand pyperformance > regressions. > First, perf data suggests that big chunk of CPU is spent in > _PyEval_EvalFrameDefault, > so I looked specifically into it (also we had to use DWARF mode for perf for > apples-to-apples > comparison, and a bunch of stack traces weren't symbolized properly, which > just again > reminds why having frame pointers is important). > perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot > spots, the work > seemed to be distributed pretty similarly with or without frame pointers. > Also scrolling through > _PyEval_EvalFrameDefault disassembly also showed that instruction patterns > between fp > and no-fp versions are very similar. > But just a few interesting observations. > The size of _PyEval_EvalFrameDefault function specifically (and all the other > functions didn't > change much in that regard) increased very significantly from 46104 to 53592 > bytes, which is a > considerable 15% increase. Looking deeper, I believe it's all due to more > stack spills and > reloads due to one less register available to keep local variables in > registers instead of on the stack. > Looking at _PyEval_EvalFrameDefault C code, it is a humongous one function > with gigantic switch > statement that implements Python instruction handling logic. So the function > itself is big and it has > a lot of local state in different branches, which to me explained why there > is so much stack spill/load. > Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov > 0x50(%rsp),%r10 (and their reverse > variants), I see that there is a substantial amount of stack spill/load in > _PyEval_EvalFrameDefault > disassembly already in default no frame pointer variant (1870 out of 11181 > total instructions in that > function, 16.7%), and it just increases further in frame pointer version > (2341 out of 11733 instructions, 20%). > One more interesting observation. With no frame pointers, GCC generates stack > accesses using %rsp > with small positive offsets, which results in pretty compact binary > instruction representation, e.g.: > 0x001cce40 <+44160>: 4c 8b 54 24 50 mov0x50(%rsp),%r10 > This uses 5 bytes. But if frame pointers are enabled, GCC switches to using > %rbp-relative offsets, > which are all negative. And that seems to result in much bigger instructions, > taking now 7 bytes instead of 5: > 0x001d3969 <+53065>: 48 8b 8d 10 ff ff ffmov-0xf0(%rbp),%rcx > I found it pretty interesting. I'd imagine GCC should be capable to keep > using %rsp addressing just fine > regardless of %rbp and save on instruction sizes, but apparently it doesn't. > Not sure why. But this instruction > increase, coupled with increase of number of spills/reloads, actually > explains huge increase in byte size of > _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra bytes > for existing 1870 instructions > that were switched from %rsp+positive offset to %rbp + negative offset, plus > 7 bytes for each of new 471 instructions). > I'm no compiler expert, but it would be nice for someone from GCC community > to check this as well (please CC > relevant folks, if you know them). > In summary, to put it bluntly, there is just more work to do for CPU > saving/restoring state to/from stack. But I don't > think _PyEval_EvalFrameDefault example is typical of how application code is > written, nor is it, generally speaking, > a good idea to do so much within single gigantic function. So I believe it's > more of an outlier than a typical case. We have a few questions: - Is this slowdown when Python is built with frame pointers to be expected? Has the Pytho
[Python-Dev] Re: Fwd: Python 3.11 performance with frame pointers
I suggest re-posting this on discuss.python.org as more engaged active core devs will pay attention to it there. On Wed, Jan 4, 2023 at 11:12 AM Daan De Meyer wrote: > Hi, > > As part of the proposal to enable frame pointers by default in Fedora > (https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we > did some benchmarking to figure out the expected performance impact. > The performance impact was generally minimal, except for the > pyperformance benchmark suite where we noticed a more substantial > difference between a system built with frame pointers and a system > built without frame pointers. The results can be found here: > https://github.com/DaanDeMeyer/fpbench (look at the mean difference > column for the pyperformance results where the percentage is the > slowdown compared to a system built without frame pointers). One of > the biggest slowdowns was on the scimark_sparse_mat_mult benchmark > which slowed down 9.5% when the system (including python) was built > with frame pointers. Note that these benchmarks were run against > Python 3.11 on a Fedora 37 x86_64 system (one built with frame > pointers, another built without frame pointers). The system used to > run the benchmarks was an Amazon EC2 machine. > > We did look a bit into the reasons behind this slowdown. I'll quote > the investigation by Andrii on the Fesco issue thread here > (https://pagure.io/fesco/issue/2817): > > > So I did look a bit at Python with and without frame pointers trying to > > understand pyperformance > regressions. > > > First, perf data suggests that big chunk of CPU is spent in > _PyEval_EvalFrameDefault, > > so I looked specifically into it (also we had to use DWARF mode for > perf for apples-to-apples > > comparison, and a bunch of stack traces weren't symbolized properly, > which just again > > reminds why having frame pointers is important). > > > perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot > spots, the work > > seemed to be distributed pretty similarly with or without frame > pointers. Also scrolling through > > _PyEval_EvalFrameDefault disassembly also showed that instruction > patterns between fp > > and no-fp versions are very similar. > > > But just a few interesting observations. > > > The size of _PyEval_EvalFrameDefault function specifically (and all the > other functions didn't > > change much in that regard) increased very significantly from 46104 to > 53592 bytes, which is a > > considerable 15% increase. Looking deeper, I believe it's all due to > more stack spills and > > reloads due to one less register available to keep local variables in > registers instead of on the stack. > > > Looking at _PyEval_EvalFrameDefault C code, it is a humongous one > function with gigantic switch > > statement that implements Python instruction handling logic. So the > function itself is big and it has > > a lot of local state in different branches, which to me explained why > there is so much stack spill/load. > > > Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov > 0x50(%rsp),%r10 (and their reverse > > variants), I see that there is a substantial amount of stack spill/load > in _PyEval_EvalFrameDefault > > disassembly already in default no frame pointer variant (1870 out of > 11181 total instructions in that > > function, 16.7%), and it just increases further in frame pointer version > (2341 out of 11733 instructions, 20%). > > > One more interesting observation. With no frame pointers, GCC generates > stack accesses using %rsp > > with small positive offsets, which results in pretty compact binary > instruction representation, e.g.: > > > 0x001cce40 <+44160>: 4c 8b 54 24 50 mov > 0x50(%rsp),%r10 > > > This uses 5 bytes. But if frame pointers are enabled, GCC switches to > using %rbp-relative offsets, > > which are all negative. And that seems to result in much bigger > instructions, taking now 7 bytes instead of 5: > > > 0x001d3969 <+53065>: 48 8b 8d 10 ff ff ffmov > -0xf0(%rbp),%rcx > > > I found it pretty interesting. I'd imagine GCC should be capable to keep > using %rsp addressing just fine > > regardless of %rbp and save on instruction sizes, but apparently it > doesn't. Not sure why. But this instruction > > increase, coupled with increase of number of spills/reloads, actually > explains huge increase in byte size of > > _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra > bytes for existing 1870 instructions > > that were switched from %rsp+positive offset to %rbp + negative offset, > plus 7 bytes for each of new 471 instructions). > > I'm no compiler expert, but it would be nice for someone from GCC > community to check this as well (please CC > > relevant folks, if you know them). > > > In summary, to put it bluntly, there is just more work to do for CPU > saving/restoring state to/from stack. But I don't > > think _PyEval_EvalFrameDefault example is typical of how application > code is written, n
