On Tue, May 05, 2026 at 05:09:34PM +0000, Dmitry Ilvokhin wrote: > Use the arch-overridable queued_spin_release(), introduced in the > previous commit, to ensure the tracepoint works correctly across all > architectures, including those with custom unlock implementations (e.g. > x86 paravirt). > > When the tracepoint is disabled, the only addition to the hot path is a > single NOP instruction (the static branch). When enabled, the contention > check, trace call, and unlock are combined in an out-of-line function to > minimize hot path impact, avoiding the compiler needing to preserve the > lock pointer in a callee-saved register across the trace call. > > Binary size impact (x86_64, defconfig): > uninlined unlock (common case): +680 bytes (+0.00%) > inlined unlock (worst case): +83659 bytes (+0.21%) > > The inlined unlock case could not be achieved through Kconfig options on > x86_64 as PREEMPT_BUILD unconditionally selects UNINLINE_SPIN_UNLOCK on > x86_64. The UNINLINE_SPIN_UNLOCK guards were manually inverted to force > inline the unlock path and estimate the worst case binary size increase. > > In practice, configurations with UNINLINE_SPIN_UNLOCK=n have already > opted against binary size optimization, so the inlined worst case is > unlikely to be a concern.
This is not quite accurate. You add the (5byte) NOP for the static branch, but then you also add another 5 bytes for the CALL and at least another 2 bytes (possibly 5) for a JMP back into the previous stream. That is 12-15 bytes added to what was a single MOV instruction. That is quite ludicrous. I disagree that UNINLINE_SPIN_UNLOCK=n opts against binary size. For x86 the unlock is smaller than a function call. I really don't see how this is worth it.
