On 4/22/2025 2:57 AM, Jürgen Groß wrote:
On 22.04.25 10:22, Xin Li (Intel) wrote:
The story started from tglx's reply in [1]:
For actual performance relevant code the current PV ops mechanics
are a horrorshow when the op defaults to the native instruction.
look at wrmsrl():
wrmsrl(msr, val
wrmsr(msr, (u32)val, (u32)val >> 32))
paravirt_write_msr(msr, low, high)
PVOP_VCALL3(cpu.write_msr, msr, low, high)
Which results in
mov $msr, %edi
mov $val, %rdx
mov %edx, %esi
shr $0x20, %rdx
call native_write_msr
and native_write_msr() does at minimum:
mov %edi,%ecx
mov %esi,%eax
wrmsr
ret
In the worst case 'ret' is going through the return thunk. Not to
talk about function prologues and whatever.
This becomes even more silly for trivial instructions like STI/CLI
or in the worst case paravirt_nop().
This is nonsense.
In the non-Xen case the initial indirect call is directly replaced with
STI/CLI via alternative patching, while for Xen it is replaced by a direct
call.
The paravirt_nop() case is handled in alt_replace_call() by replacing the
indirect call with a nop in case the target of the call was paravirt_nop()
(which is in fact no_func()).
The call makes only sense, when the native default is an actual
function, but for the trivial cases it's a blatant engineering
trainwreck.
The trivial cases are all handled as stated above: a direct replacement
instruction is placed at the indirect call position.
The above comment was given in 2023 IIRC, and you have addressed it.
Later a consensus was reached to utilize the alternatives mechanism to
eliminate the indirect call overhead introduced by the pv_ops APIs:
1) When built with !CONFIG_XEN_PV, X86_FEATURE_XENPV becomes a
disabled feature, preventing the Xen code from being built
and ensuring the native code is executed unconditionally.
This is the case today already. There is no need for any change to have
this in place.
2) When built with CONFIG_XEN_PV:
2.1) If not running on the Xen hypervisor (!X86_FEATURE_XENPV),
the kernel runtime binary is patched to unconditionally
jump to the native MSR write code.
2.2) If running on the Xen hypervisor (X86_FEATURE_XENPV), the
kernel runtime binary is patched to unconditionally jump
to the Xen MSR write code.
I can't see what is different here compared to today's state.
The alternatives mechanism is also used to choose the new immediate
form MSR write instruction when it's available.
Yes, this needs to be added.
Consequently, remove the pv_ops MSR write APIs and the Xen callbacks.
I still don't see a major difference to today's solution.
The existing code generates:
...
bf e0 06 00 00 mov $0x6e0,%edi
89 d6 mov %edx,%esi
48 c1 ea 20 shr $0x20,%rdx
ff 15 07 48 8c 01 call *0x18c4807(%rip) # <pv_ops+0xb8>
31 c0 xor %eax,%eax
...
And on native, the indirect call instruction is patched to a direct call
as you mentioned:
...
bf e0 06 00 00 mov $0x6e0,%edi
89 d6 mov %edx,%esi
48 c1 ea 20 shr $0x20,%rdx
e8 60 3e 01 00 call <{native,xen}_write_msr> # direct
90 nop
31 c0 xor %eax,%eax
...
This patch set generates assembly w/o CALL on native:
...
e9 e6 22 c6 01 jmp 1f # on native or nop on Xen
b9 e0 06 00 00 mov $0x6e0,%ecx
e8 91 d4 fa ff call ffffffff8134ee80 <asm_xen_write_msr>
e9 a4 9f eb 00 jmp ffffffff8225b9a0 <__x86_return_thunk>
...
1: b9 e0 06 00 00 mov $0x6e0,%ecx # immediate form here
48 89 c2 mov %rax,%rdx
48 c1 ea 20 shr $0x20,%rdx
3e 0f 30 ds wrmsr
...
It's not a major change, but when it is patched to use the immediate
form MSR write instruction, it's straightforwardly streamlined.
Only the "paravirt" term has been eliminated.
Yes.
But a PV guest doesn't operate at the highest privilege level, which
means MSR instructions typically result in a #GP fault. I actually
think the pv_ops MSR APIs are unnecessary because of this inherent
limitation.
Looking at the Xen MSR code, except PMU and just a few MSRs, it falls
back to executes native MSR instructions. As MSR instructions trigger
#GP, Xen takes control and handles them in 2 ways:
1) emulate (or ignore) a MSR operation and skip the guest instruction.
2) inject the #GP back to guest OS and let its #GP handler handle it.
But Linux MSR exception handler just ignores the MSR instruction
(MCE MSR exception will panic).
So why not let Xen handle all the details which it already tries to do?
(Linux w/ such a change may not be able to run on old Xen hypervisors.)
BTW, if performance is a concern, writes to MSR_KERNEL_GS_BASE and
MSR_GS_BASE anyway are hpyercalls into Xen.
Thanks!
Xin