from:"Sean Christopherson"

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-06-03 Thread Sean Christopherson

On Mon, Jun 03, 2024, Mickaël Salaün wrote:
> On Wed, May 15, 2024 at 01:32:24PM -0700, Sean Christopherson wrote:
> > On Tue, May 14, 2024, Mickaël Salaün wrote:
> > > On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote:
> > > > Development happens
> > > > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > > > branch, but I'd advice against looking into it until we add some order
> > > > to the rework. Regardless, feel free to get in touch.
> > > 
> > > Thanks for the update.
> > > 
> > > Could we schedule a PUCK meeting to synchronize and help each other?
> > > What about June 12?
> > 
> > June 12th works on my end.
> 
> Can you please send an invite?

I think this is all the info?

Time:  6am PDT
Video: https://meet.google.com/vdb-aeqo-knk
Phone: https://tel.meet/vdb-aeqo-knk?pin=3003112178656

Calendar: 
https://calendar.google.com/calendar/u/0?cid=Y182MWE1YjFmNjQ0NzM5YmY1YmVkN2U1ZWE1ZmMzNjY5Y2UzMmEyNTQ0YzVkYjFjN2M4OTE3MDJjYTUwOTBjN2Q1QGdyb3VwLmNhbGVuZGFyLmdvb2dsZS5jb20
Drive:
https://drive.google.com/drive/folders/1aTqCrvTsQI9T4qLhhLs_l986SngGlhPH?resourcekey=0-FDy0ykM3RerZedI8R-zj4A=drive_link

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-06-03 Thread Sean Christopherson

On Tue, May 14, 2024, Mickaël Salaün wrote:
> On Tue, May 07, 2024 at 09:16:06AM -0700, Sean Christopherson wrote:
> > On Tue, May 07, 2024, Mickaël Salaün wrote:
> > > If yes, that would indeed require a *lot* of work for something we're not
> > > sure will be accepted later on.
> > 
> > Yes and no.  The AWS folks are pursuing VSM support in KVM+QEMU, and SVSM 
> > support
> > is trending toward the paired VM+vCPU model.  IMO, it's entirely feasible to
> > design KVM support such that much of the development load can be shared 
> > between
> > the projects.  And having 2+ use cases for a feature (set) makes it _much_ 
> > more
> > likely that the feature(s) will be accepted.
> > 
> > And similar to what Paolo said regarding HEKI not having a complete story, I
> > don't see a clear line of sight for landing host-defined policy 
> > enforcement, as
> > there are many open, non-trivial questions that need answers. I.e. 
> > upstreaming
> > HEKI in its current form is also far from a done deal, and isn't guaranteed 
> > to
> > be substantially less work when all is said and done.
> 
> I'm not sure to understand why "Heki not having a complete story".  The
> goal is the same as the current kernel self-protection mechanisms.

HEKI doesn't have a complete story for how it's going to play nice with kexec(),
emulated RESET, etc.  The kernel's existing self-protection mechanisms Just Work
because the protections are automatically disabled/lost on such transitions.
They are obviously significant drawbacks to that behavior, but they are accepted
drawbacks, i.e. solving those problems isn't in scope (yet) for the kernel.  And
the "failure" mode is also loss of hardening, not an unusable guest.

In other words, the kernel's hardening is firmly best effort at this time,
whereas HEKI likely needs to be much more than "best effort" in order to justify
the extra complexity.  And that means having answers to the various 
interoperability
questions.

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-15 Thread Sean Christopherson

On Tue, May 14, 2024, Mickaël Salaün wrote:
> On Fri, May 10, 2024 at 10:07:00AM +, Nicolas Saenz Julienne wrote:
> > Development happens
> > https://github.com/vianpl/{linux,qemu,kvm-unit-tests} and the vsm-next
> > branch, but I'd advice against looking into it until we add some order
> > to the rework. Regardless, feel free to get in touch.
> 
> Thanks for the update.
> 
> Could we schedule a PUCK meeting to synchronize and help each other?
> What about June 12?

June 12th works on my end.

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-07 Thread Sean Christopherson

On Tue, May 07, 2024, Mickaël Salaün wrote:
> > Actually, potential bad/crazy idea.  Why does the _host_ need to define 
> > policy?
> > Linux already knows what assets it wants to (un)protect and when.  What's 
> > missing
> > is a way for the guest kernel to effectively deprivilege and re-authenticate
> > itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
> > support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, 
> > with a
> > bit of pKVM mixed in?
> > 
> > Borrowing VTL terminology, where VTL0 is the least privileged, userspace 
> > launches
> > the VM at VTL0.  At some point, the guest triggers the deprivileging 
> > sequence and
> > userspace creates VTL1.  Userpace also provides a way for VTL0 restrict 
> > access to
> > its memory, e.g. to effectively make the page tables for the kernel's 
> > direct map
> > writable only from VTL1, to make kernel text RO (or XO), etc.  And VTL0 
> > could then
> > also completely remove its access to code that changes CR0/CR4.
> > 
> > It would obviously require a _lot_ more upfront work, e.g. to isolate the 
> > kernel
> > text that modifies CR0/CR4 so that it can be removed from VTL0, but that 
> > should
> > be doable with annotations, e.g. tag relevant functions with __magic or 
> > whatever,
> > throw them in a dedicated section, and then free/protect the section(s) at 
> > the
> > appropriate time.
> > 
> > KVM would likely need to provide the ability to switch VTLs (or whatever 
> > they get
> > called), and host userspace would need to provide a decent amount of the 
> > backend
> > mechanisms and "core" policies, e.g. to manage VTL0 memory, teardown (turn 
> > off?)
> > VTL1 on kexec(), etc.  But everything else could live in the guest kernel 
> > itself.
> > E.g. to have CR pinning play nice with kexec(), toss the relevant kexec() 
> > code into
> > VTL1.  That way VTL1 can verify the kexec() target and tear itself down 
> > before
> > jumping into the new kernel. 
> > 
> > This is very off the cuff and have-wavy, e.g. I don't have much of an idea 
> > what
> > it would take to harden kernel text patching, but keeping the policy in the 
> > guest
> > seems like it'd make everything more tractable than trying to define an ABI
> > between Linux and a VMM that is rich and flexible enough to support all the
> > fancy things Linux does (and will do in the future).
> 
> Yes, we agree that the guest needs to manage its own policy.  That's why
> we implemented Heki for KVM this way, but without VTLs because KVM
> doesn't support them.
> 
> To sum up, is the VTL approach the only one that would be acceptable for
> KVM?  

Heh, that's not a question you want to be asking.  You're effectively asking me
to make an authorative, "final" decision on a topic which I am only passingly
familiar with.

But since you asked it... :-)  Probably?

I see a lot of advantages to a VTL/VSM-like approach:

 1. Provides Linux-as-a guest the flexibility it needs to meaningfully advance
its security, with the least amount of policy built into the guest/host ABI.

 2. Largely decouples guest policy from the host, i.e. should allow the guest to
evolve/update it's policy without needing to coordinate changes with the 
host.

 3. The KVM implementation can be generic enough to be reusable for other 
features.

 4. Other groups are already working on VTL-like support in KVM, e.g. for VSM
itself, and potentially for VMPL/SVSM support.

IMO, #2 is a *huge* selling point.  Not having to coordinate changes across
multiple code bases and/or organizations and/or maintainers is a big win for
velocity, long term maintenance, and probably the very viability of HEKI.

Providing the guest with the tools to define and implement its own policy means
end users don't have to way for some third party, e.g. CSPs, to deploy the
accompanying host-side changes, because there are no host-side changes.

And encapsulating everything in the guest drastically reduces the friction with
changes in the kernel that interact with hardening, both from a technical and a
social perspective.  I.e. giving the kernel (near) complete control over its
destiny minimizes the number of moving parts, and will be far, far easier to 
sell
to maintainers.  I would expect maintainers to react much more favorably to 
being
handed tools to harden the kernel, as opposed to being presented a set of APIs
that can be used to make the kernel compliant with _someone else's_ vision of
what kernel hardening should look like.

E.g. imagine a new feature comes along that requires overriding CR0/CR4 pinning
in a way that doesn't fit into existing policy.  If the VMM is involved in
defining/enforcing the CR pinning policy, then supporting said new feature would
require new guest/host ABI and an updated host VMM in order to make the new
feature compatible with HEKI.  Inevitably, even if everything goes smoothly from
an upstreaming perspective, that will result in guests that have to choose

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-06 Thread Sean Christopherson

On Mon, May 06, 2024, Mickaël Salaün wrote:
> On Fri, May 03, 2024 at 07:03:21AM GMT, Sean Christopherson wrote:
> > > ---
> > > 
> > > Changes since v1:
> > > * New patch. Making user space aware of Heki properties was requested by
> > >   Sean Christopherson.
> > 
> > No, I suggested having userspace _control_ the pinning[*], not merely be 
> > notified
> > of pinning.
> > 
> >  : IMO, manipulation of protections, both for memory (this patch) and CPU 
> > state
> >  : (control registers in the next patch) should come from userspace.  I 
> > have no
> >  : objection to KVM providing plumbing if necessary, but I think userspace 
> > needs to
> >  : to have full control over the actual state.
> >  : 
> >  : One of the things that caused Intel's control register pinning series to 
> > stall
> >  : out was how to handle edge cases like kexec() and reboot.  Deferring to 
> > userspace
> >  : means the kernel doesn't need to define policy, e.g. when to unprotect 
> > memory,
> >  : and avoids questions like "should userspace be able to overwrite pinned 
> > control
> >  : registers".
> >  : 
> >  : And like the confidential VM use case, keeping userspace in the loop is 
> > a big
> >  : beneifit, e.g. the guest can't circumvent protections by coercing 
> > userspace into
> >  : writing to protected memory.
> > 
> > I stand by that suggestion, because I don't see a sane way to handle things 
> > like
> > kexec() and reboot without having a _much_ more sophisticated policy than 
> > would
> > ever be acceptable in KVM.
> > 
> > I think that can be done without KVM having any awareness of CR pinning 
> > whatsoever.
> > E.g. userspace just needs to ability to intercept CR writes and inject 
> > #GPs.  Off
> > the cuff, I suspect the uAPI could look very similar to MSR filtering.  
> > E.g. I bet
> > userspace could enforce MSR pinning without any new KVM uAPI at all.
> > 
> > [*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com
> 
> OK, I had concern about the control not directly coming from the guest,
> especially in the case of pKVM and confidential computing, but I get you

Hardware-based CoCo is completely out of scope, because KVM has zero visibility
into the guest (well, SNP technically allows trapping CR0/CR4, but KVM really
shouldn't intercept CR0/CR4 for SNP guests).

And more importantly, _KVM_ doesn't define any policies for CoCo VMs.  KVM might
help enforce policies that are defined by hardware/firmware, but KVM doesn't
define any of its own.

If pKVM on x86 comes along, then KVM will likely get in the business of defining
policy, but until that happens, KVM needs to stay firmly out of the picture.

> point.  It should indeed be quite similar to the MSR filtering on the
> userspace side, except that we need another interface for the guest to
> request such change (i.e. self-protection).
> 
> Would it be OK to keep this new KVM_HC_LOCK_CR_UPDATE hypercall but
> forward the request to userspace with a VM exit instead?  That would
> also enable userspace to get the request and directly configure the CR
> pinning with the same VM exit.

No?  Maybe?  I strongly suspect that full support will need a richer set of APIs
than a single hypercall.  E.g. to handle kexec(), suspend+resume, emulated SMM,
and so on and so forth.  And that's just for CR pinning.

And hypercalls are hampered by the fact that VMCALL/VMMCALL don't allow for
delegation or restriction, i.e. there's no way for the guest to communicate to
the hypervisor that a less privileged component is allowed to perform some 
action,
nor is there a way for the guest to say some chunk of CPL0 code *isn't* allowed
to request transition.  Delegation and restriction all has to be done 
out-of-band.

It'd probably be more annoying to setup initially, but I think a synthetic 
device
with an MMIO-based interface would be more powerful and flexible in the long 
run.
Then userspace can evolve without needing to wait for KVM to catch up.

Actually, potential bad/crazy idea.  Why does the _host_ need to define policy?
Linux already knows what assets it wants to (un)protect and when.  What's 
missing
is a way for the guest kernel to effectively deprivilege and re-authenticate
itself as needed.  We've been tossing around the idea of paired VMs+vCPUs to
support VTLs and SEV's VMPLs, what if we usurped/piggybacked those ideas, with a
bit of pKVM mixed in?

Borrowing VTL terminology, where VTL0 is the least privileged, userspace 
launches
the VM at VTL0.  At some point, the guest triggers the deprivileging sequence 
and
userspace creates VTL1.  Userpace also provides a way for VTL0 restrict access 
to
its memo

Re: [RFC PATCH v3 3/5] KVM: x86: Add notifications for Heki policy configuration and violation

2024-05-03 Thread Sean Christopherson

On Fri, May 03, 2024, Mickaël Salaün wrote:
> Add an interface for user space to be notified about guests' Heki policy
> and related violations.
> 
> Extend the KVM_ENABLE_CAP IOCTL with KVM_CAP_HEKI_CONFIGURE and
> KVM_CAP_HEKI_DENIAL. Each one takes a bitmask as first argument that can
> contains KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4. The
> returned value is the bitmask of known Heki exit reasons, for now:
> KVM_HEKI_EXIT_REASON_CR0 and KVM_HEKI_EXIT_REASON_CR4.
> 
> If KVM_CAP_HEKI_CONFIGURE is set, a VM exit will be triggered for each
> KVM_HC_LOCK_CR_UPDATE hypercalls according to the requested control
> register. This enables to enlighten the VMM with the guest
> auto-restrictions.
> 
> If KVM_CAP_HEKI_DENIAL is set, a VM exit will be triggered for each
> pinned CR violation. This enables the VMM to react to a policy
> violation.
> 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: H. Peter Anvin 
> Cc: Ingo Molnar 
> Cc: Kees Cook 
> Cc: Madhavan T. Venkataraman 
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Cc: Thomas Gleixner 
> Cc: Vitaly Kuznetsov 
> Cc: Wanpeng Li 
> Signed-off-by: Mickaël Salaün 
> Link: https://lore.kernel.org/r/20240503131910.307630-4-...@digikod.net
> ---
> 
> Changes since v1:
> * New patch. Making user space aware of Heki properties was requested by
>   Sean Christopherson.

No, I suggested having userspace _control_ the pinning[*], not merely be 
notified
of pinning.

 : IMO, manipulation of protections, both for memory (this patch) and CPU state
 : (control registers in the next patch) should come from userspace.  I have no
 : objection to KVM providing plumbing if necessary, but I think userspace 
needs to
 : to have full control over the actual state.
 : 
 : One of the things that caused Intel's control register pinning series to 
stall
 : out was how to handle edge cases like kexec() and reboot.  Deferring to 
userspace
 : means the kernel doesn't need to define policy, e.g. when to unprotect 
memory,
 : and avoids questions like "should userspace be able to overwrite pinned 
control
 : registers".
 : 
 : And like the confidential VM use case, keeping userspace in the loop is a big
 : beneifit, e.g. the guest can't circumvent protections by coercing userspace 
into
 : writing to protected memory.

I stand by that suggestion, because I don't see a sane way to handle things like
kexec() and reboot without having a _much_ more sophisticated policy than would
ever be acceptable in KVM.

I think that can be done without KVM having any awareness of CR pinning 
whatsoever.
E.g. userspace just needs to ability to intercept CR writes and inject #GPs.  
Off
the cuff, I suspect the uAPI could look very similar to MSR filtering.  E.g. I 
bet
userspace could enforce MSR pinning without any new KVM uAPI at all.

[*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com

Re: [RFC PATCH v3 0/5] Hypervisor-Enforced Kernel Integrity - CR pinning

2024-05-03 Thread Sean Christopherson

On Fri, May 03, 2024, Mickaël Salaün wrote:
> Hi,
> 
> This patch series implements control-register (CR) pinning for KVM and
> provides an hypervisor-agnostic API to protect guests.  It includes the
> guest interface, the host interface, and the KVM implementation.
> 
> It's not ready for mainline yet (see the current limitations), but we
> think the overall design and interfaces are good and we'd like to have
> some feedback on that.

...

> # Current limitations
> 
> This patch series doesn't handle VM reboot, kexec, nor hybernate yet.
> We'd like to leverage the realated feature from KVM CR-pinning patch
> series [3].  Help appreciated!

Until you have a story for those scenarios, I don't expect you'll get a lot of
valuable feedback, or much feedback at all.  They were the hot topic for KVM CR
pinning, and they'll likely be the hot topic now.

Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-19 Thread Sean Christopherson

On Thu, Oct 19, 2023, David Woodhouse wrote:
> On Thu, 2023-10-19 at 08:40 -0700, Sean Christopherson wrote:
> > > If for some 'historical reasons' we can't revoke features we can always
> > > introduce a new PV feature bit saying that TSC is preferred.
> 
> Don't we already have one? It's the PVCLOCK_TSC_STABLE_BIT. Why would a
> guest ever use kvmclock if the PVCLOCK_TSC_STABLE_BIT is set?
>
> The *point* in the kvmclock is that the hypervisor can mess with the
> epoch/scaling to try to compensate for TSC brokenness as the host
> scales/sleeps/etc.
> 
> And the *problem* with the kvmclock is that it does just that, even
> when the host TSC hasn't done anything wrong and the kvmclock shouldn't
> have changed at all.
> 
> If the PVCLOCK_TSC_STABLE_BIT is set, a guest should just use the guest
> TSC directly without looking to the kvmclock for adjusting it.
> 
> No?

No :-)

PVCLOCK_TSC_STABLE_BIT doesn't provide the guarantees that are needed to use the
raw TSC directly.  It's close, but there is at least one situation where using 
TSC
directly even when the TSC is stable is bad idea: when hardware doesn't support 
TSC
scaling and the guest virtual TSC is running at a higher frequency than the 
hardware
TSC.  The guest doesn't have to worry about the TSC going backwards, but using 
the
TSC directly would cause the guest's time calculations to be inaccurate.

And PVCLOCK_TSC_STABLE_BIT is also much more dynamic as it's tied to a given
generation/sequence.  E.g. if KVM stops using its masterclock for whatever 
reason,
then kvm_guest_time_update() will effectively clear PVCLOCK_TSC_STABLE_BIT and 
the
guest-side __pvclock_clocksource_read() will be forced to do a bit of extra work
to ensure the clock is monotonically increasing.

Re: [PATCH RFC 1/1] x86/paravirt: introduce param to disable pv sched_clock

2023-10-19 Thread Sean Christopherson

On Thu, Oct 19, 2023, Vitaly Kuznetsov wrote:
> Dongli Zhang  writes:
> 
> > As mentioned in the linux kernel development document, "sched_clock() is
> > used for scheduling and timestamping". While there is a default native
> > implementation, many paravirtualizations have their own implementations.
> >
> > About KVM, it uses kvm_sched_clock_read() and there is no way to only
> > disable KVM's sched_clock. The "no-kvmclock" may disable all
> > paravirtualized kvmclock features.

...

> > Please suggest and comment if other options are better:
> >
> > 1. Global param (this RFC patch).
> >
> > 2. The kvmclock specific param (e.g., "no-vmw-sched-clock" in vmware).
> >
> > Indeed I like the 2nd method.
> >
> > 3. Enforce native sched_clock only when TSC is invariant (hyper-v method).
> >
> > 4. Remove and cleanup pv sched_clock, and always use pv_sched_clock() for
> > all (suggested by Peter Zijlstra in [3]). Some paravirtualizations may
> > want to keep the pv sched_clock.
> 
> Normally, it should be up to the hypervisor to tell the guest which
> clock to use, i.e. if TSC is reliable or not. Let me put my question
> this way: if TSC on the particular host is good for everything, why
> does the hypervisor advertises 'kvmclock' to its guests?

I suspect there are two reasons.

  1. As is likely the case in our fleet, no one revisited the set of advertised
 PV features when defining the VM shapes for a new generation of hardware, 
or
 whoever did the reviews wasn't aware that advertising kvmclock is actually
 suboptimal.  All the PV clock stuff in KVM is quite labyrinthian, so it's
 not hard to imagine it getting overlooked.

  2. Legacy VMs.  If VMs have been running with a PV clock for years, forcing
 them to switch to a new clocksource is high-risk, low-reward.

> If for some 'historical reasons' we can't revoke features we can always
> introduce a new PV feature bit saying that TSC is preferred.
> 
> 1) Global param doesn't sound like a good idea to me: chances are that
> people will be setting it on their guest images to workaround problems
> on one hypervisor (or, rather, on one public cloud which is too lazy to
> fix their hypervisor) while simultaneously creating problems on another.
> 
> 2) KVM specific parameter can work, but as KVM's sched_clock is the same
> as kvmclock, I'm not convinced it actually makes sense to separate the
> two. Like if sched_clock is known to be bad but TSC is good, why do we
> need to use PV clock at all? Having a parameter for debugging purposes
> may be OK though...
> 
> 3) This is Hyper-V specific, you can see that it uses a dedicated PV bit
> (HV_ACCESS_TSC_INVARIANT) and not the architectural
> CPUID.8007H:EDX[8]. I'm not sure we can blindly trust the later on
> all hypervisors.
> 
> 4) Personally, I'm not sure that relying on 'TSC is crap' detection is
> 100% reliable. I can imagine cases when we can't detect that fact that
> while synchronized across CPUs and not going backwards, it is, for
> example, ticking with an unstable frequency and PV sched clock is
> supposed to give the right correction (all of them are rdtsc() based
> anyways, aren't they?).

Yeah, practically speaking, the only thing adding a knob to turn off using PV
clocks for sched_clock will accomplish is creating an even bigger matrix of
combinations that can cause problems, e.g. where guests end up using kvmclock
timekeeping but not scheduling.

The explanation above and the links below fail to capture _the_ key point:
Linux-as-a-guest already prioritizes the TSC over paravirt clocks as the 
clocksource
when the TSC is constant and nonstop (first spliced blob below).

What I suggested is that if the TSC is chosen over a PV clock as the 
clocksource,
then we have the kernel also override the sched_clock selection (second spliced
blob below).

That doesn't require the guest admin to opt-in, and doesn't create even more
combinations to support.  It also provides for a smoother transition for when
customers inevitably end up creating VMs on hosts that don't advertise kvmclock
(or any PV clock).

> > To introduce a param may be easier to backport to old kernel version.
> >
> > References:
> > [1] 
> > https://lore.kernel.org/all/20230926230649.67852-1-dongli.zh...@oracle.com/
> > [2] https://lore.kernel.org/all/20231018195638.1898375-1-sea...@google.com/
> > [3] 
> > https://lore.kernel.org/all/20231002211651.ga3...@noisy.programming.kicks-ass.net/

On Mon, Oct 2, 2023 at 11:18 AM Sean Christopherson  wrote:
> > Do we need to update the documentation to always suggest TSC when it is
> > constant, as I believe many users still prefer pv clock than tsc?
> >
> > T

Re: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling

2023-08-01 Thread Sean Christopherson

On Tue, Aug 01, 2023, Peter Zijlstra wrote:
> On Tue, Aug 01, 2023 at 07:01:15PM +0000, Sean Christopherson wrote:
> > The spec I have from May 2022 says the NMI bit colocated with CS, not SS.  
> > And
> > the cover letter's suggestion to use a search engine to find the spec ain't
> > exactly helpful, that just gives me the same "May 2022 Revision 3.0" spec.  
> > So
> > either y'all have a spec that I can't find, or this is wrong.
> 
> https://intel.com/sdm
> 
> is a useful shorthand I've recently been told about.

Hallelujah!

> On that page is also "Flexible Return and Event Delivery Specification", when
> clicked it will gift you a FRED v5.0 PDF.

Worked for me, too.  Thanks!

Re: [PATCH RESEND v9 33/36] KVM: VMX: Add VMX_DO_FRED_EVENT_IRQOFF for IRQ/NMI handling

2023-08-01 Thread Sean Christopherson

On Tue, Aug 01, 2023, Xin Li wrote:
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index 07e927d4d099..5ee6a57b59a5 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -2,12 +2,14 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include "kvm-asm-offsets.h"
>  #include "run_flags.h"
> +#include "../../entry/calling.h"

Rather than do the low level PUSH_REGS and POP_REGS, I vote to have core code
expose a FRED-specific wrapper for invoking external_interrupt().  More below.

>  
>  #define WORD_SIZE (BITS_PER_LONG / 8)
>  
> @@ -31,6 +33,80 @@
>  #define VCPU_R15 __VCPU_REGS_R15 * WORD_SIZE
>  #endif
>  
> +#ifdef CONFIG_X86_FRED
> +.macro VMX_DO_FRED_EVENT_IRQOFF branch_insn branch_target nmi=0

objtool isn't happy.

arch/x86/kvm/vmx/vmenter.o: warning: objtool: 
vmx_do_fred_interrupt_irqoff+0x6c: return with modified stack frame
arch/x86/kvm/vmx/vmenter.o: warning: objtool: vmx_do_fred_nmi_irqoff+0x37: 
sibling call from callable instruction with modified stack frame

The "return with modified stack frame" goes away with my suggested changes, but
the sibling call remains for the NMI case due to the JMP instead of a call.

> + /*
> +  * Unconditionally create a stack frame, getting the correct RSP on the
> +  * stack (for x86-64) would take two instructions anyways, and RBP can
> +  * be used to restore RSP to make objtool happy (see below).
> +  */
> + push %_ASM_BP
> + mov %_ASM_SP, %_ASM_BP

The frame stuff is worth throwing in a macro, if only to avoid a copy+pasted
comment, which by the by, is wrong.  (a) it's ERETS, not IRET.  (b) the IRQ does
a vanilla RET, not ERETS.  E.g. like so:

.macro VMX_DO_EVENT_FRAME_BEGIN
/*
 * Unconditionally create a stack frame, getting the correct RSP on the
 * stack (for x86-64) would take two instructions anyways, and RBP can
 * be used to restore RSP to make objtool happy (see below).
 */
push %_ASM_BP
mov %_ASM_SP, %_ASM_BP
.endm

.macro VMX_DO_EVENT_FRAME_END
/*
 * "Restore" RSP from RBP, even though {E,I}RET has already unwound RSP
 * to the correct value *in most cases*.  KVM's IRQ handling with FRED
 * doesn't do ERETS, and objtool doesn't know the callee will IRET/ERET
 * and, without the explicit restore, thinks the stack is getting 
walloped.
 * Using an unwind hint is problematic due to x86-64's dynamic 
alignment.
 */
mov %_ASM_BP, %_ASM_SP
pop %_ASM_BP
.endm

> +
> + /*
> +  * Don't check the FRED stack level, the call stack leading to this
> +  * helper is effectively constant and shallow (relatively speaking).
> +  *
> +  * Emulate the FRED-defined redzone and stack alignment.
> +  */
> + sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
> + and $FRED_STACK_FRAME_RSP_MASK, %rsp
> +
> + /*
> +  * A FRED stack frame has extra 16 bytes of information pushed at the
> +  * regular stack top compared to an IDT stack frame.

There is pretty much no chance that anyone remembers the layout of an IDT stack
frame off the top of their head.  I.e. saying "FRED has 16 bytes more" isn't all
that useful.  It also fails to capture the fact that FRED stuff a hell of a lot
more information in those "common" 48 bytes.

It'll be hard/impossible to capture all of the overload info in a comment, but
showing the actual layout of the frame would be super helpful, e.g. something 
like
this

/*
 * FRED stack frames are always 64 bytes:
 *
 * --
 * | Bytes  | Usage |
 * -|
 * | 63:56  | Reserved  |
 * | 55:48  | Event Data| 
 * | 47:40  | SS + Event Info   |
 * | 39:32  | RSP   |
 * | 31:24  | RFLAGS|
 * | 23:16  | CS + Aux Info |
 * |  15:8  | RIP   |
 * |   7:0  | Error Code|
 * --   
 */

> +  */
> + push $0 /* Reserved by FRED, must be 0 */
> + push $0 /* FRED event data, 0 for NMI and external interrupts */
> +
> + shl $32, %rdi   /* FRED event type and vector */
> + .if \nmi
> + bts $FRED_SSX_NMI_BIT, %rdi /* Set the NMI bit */

The spec I have from May 2022 says the NMI bit colocated with CS, not SS.  And
the cover letter's suggestion to use a search engine to find the spec ain't
exactly helpful, that just gives me the same "May 2022 Revision 3.0" spec.  So
either y'all have a spec that I can't find, or this is wrong.

> + .endif
> + bts $FRED_SSX_64_BIT_MODE_BIT, %rdi /* Set the 64-bit mode */
> + or $__KERNEL_DS, %rdi
> + push %rdi
> + push %rbp
> + pushf
> +

Re: [PATCH v9 00/36] x86: enable FRED for x86-64

2023-07-31 Thread Sean Christopherson

On Mon, Jul 31, 2023, Xin3 Li wrote:
> > > This patch set enables the Intel flexible return and event delivery
> > > (FRED) architecture for x86-64.
> > 
> > ...
> > 
> > > --
> > > 2.34.1
> > 
> > What is this based on?
> 
> The tip tree master branch.
> 
> > FYI, you're using a version of git that will (mostly)
> > automatically generate the based, e.g. I do
> > 
> >   git format-patch --base=HEAD~$nr ...
> > 
> > in my scripts, where $nr is the number of patches I am sending.  My specific
> > approaches requires HEAD-$nr to be a publicly visible object/commit, but 
> > that
> > should be the case the vast majority of the time anyways.
> 
> Are you talking about that you only got a subset of this patch set?

No, I'm saying I don't want to waste a bunch of time tracking down exactly which
commit a 36 patch series is based on.  E.g. I just refreshed tip/master and 
still
get:

Applying: x86/idtentry: Incorporate definitions/declarations of the FRED 
external interrupt handler type
error: sha1 information is lacking or useless (arch/x86/include/asm/idtentry.h).
error: could not build fake ancestor
Patch failed at 0024 x86/idtentry: Incorporate definitions/declarations of the 
FRED external interrupt handler type
hint: Use 'git am --show-current-patch=diff' to see the failed patch

> HPA told me he only got patches 0-25/36.
> 
> And I got several undeliverable email notifications, saying
> "
> The following message to  was undeliverable.
> The reason for the problem:
> 5.x.1 - Maximum number of delivery attempts exceeded. [Default] 450-'4.7.25 
> Client host rejected: cannot find your hostname, [134.134.136.31]'
> "
> 
> I guess there were some problems with the Intel mail system last night,
> probably I should resend this patch set later.

Yes, lore also appears to be missing patches.  I grabbed the mbox off of KVM's
patchwork instance.

Re: [PATCH v9 00/36] x86: enable FRED for x86-64

2023-07-31 Thread Sean Christopherson

On Sun, Jul 30, 2023, Xin Li wrote:
> This patch set enables the Intel flexible return and event delivery
> (FRED) architecture for x86-64.

...

> -- 
> 2.34.1

What is this based on?  FYI, you're using a version of git that will (mostly)
automatically generate the based, e.g. I do 

  git format-patch --base=HEAD~$nr ...

in my scripts, where $nr is the number of patches I am sending.  My specific
approaches requires HEAD-$nr to be a publicly visible object/commit, but that
should be the case the vast majority of the time anyways.

Re: [ANNOUNCE] KVM Microconference at LPC 2023

2023-06-01 Thread Sean Christopherson

On Thu, Jun 01, 2023, Mickaï¿½l Salaï¿½n wrote:
> Hi,
> 
> What is the status of this microconference proposal? We'd be happy to talk
> about Heki [1] and potentially other hypervisor supports.

Proposal submitted (deadline is/was today), now we wait :-)  IIUC, we should 
find
out rather quickly whether or not the KVM MC is a go.

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-31 Thread Sean Christopherson

On Tue, May 30, 2023, Rick P Edgecombe wrote:
> On Fri, 2023-05-26 at 17:22 +0200, Mickaï¿½l Salaï¿½n wrote:
> > > > Can the guest kernel ask the host VMM's emulated devices to DMA into
> > > > the protected data? It should go through the host userspace mappings I
> > > > think, which don't care about EPT permissions. Or did I miss where you
> > > > are protecting that another way? There are a lot of easy ways to ask
> > > > the host to write to guest memory that don't involve the EPT. You
> > > > probably need to protect the host userspace mappings, and also the
> > > > places in KVM that kmap a GPA provided by the guest.
> > > 
> > > Good point, I'll check this confused deputy attack. Extended KVM
> > > protections should indeed handle all ways to map guests' memory.  I'm
> > > wondering if current VMMs would gracefully handle such new restrictions
> > > though.
> > 
> > I guess the host could map arbitrary data to the guest, so that need to be
> > handled, but how could the VMM (not the host kernel) bypass/update EPT
> > initially used for the guest (and potentially later mapped to the host)?
> 
> Well traditionally both QEMU and KVM accessed guest memory via host
> mappings instead of the EPT.ï¿½So I'm wondering what is stopping the
> guest from passing a protected gfn when setting up the DMA, and QEMU
> being enticed to write to it? The emulator as well would use these host
> userspace mappings and not consult the EPT IIRC.
> 
> I think Sean was suggesting host userspace should be more involved in
> this process, so perhaps it could protect its own alias of the
> protected memory, for example mprotect() it as read-only.

Ya, though "suggesting" is really "demanding, unless someone provides super 
strong
justification for handling this directly in KVM".  It's basically the same 
argument
that led to Linux Security Modules: I'm all for KVM providing the framework and
plumbing, but I don't want KVM to get involved in defining policy, thread 
models, etc.

Re: [patch] x86/smpboot: Disable parallel bootup if cc_vendor != NONE

2023-05-30 Thread Sean Christopherson

On Tue, May 30, 2023, Kirill A. Shutemov wrote:
> On Tue, May 30, 2023 at 06:00:46PM +0200, Thomas Gleixner wrote:
> > On Tue, May 30 2023 at 15:29, Kirill A. Shutemov wrote:
> > > On Tue, May 30, 2023 at 02:09:17PM +0200, Thomas Gleixner wrote:
> > >> The decision to allow parallel bringup of secondary CPUs checks
> > >> CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
> > >> parallel bootup because accessing the local APIC is intercepted and 
> > >> raises
> > >> a #VC or #VE, which cannot be handled at that point.
> > >> 
> > >> The check works correctly, but only for AMD encrypted guests. TDX does 
> > >> not
> > >> set that flag.
> > >> 
> > >> Check for cc_vendor != CC_VENDOR_NONE instead. That might be overbroad, 
> > >> but
> > >> definitely works for both AMD and Intel.
> > >
> > > It boots fine with TDX, but I think it is wrong. cc_get_vendor() will
> > > report CC_VENDOR_AMD even on bare metal if SME is enabled. I don't think
> > > we want it.
> > 
> > Right. Did not think about that.
> > 
> > But the same way is CC_ATTR_GUEST_MEM_ENCRYPT overbroad for AMD. Only
> > SEV-ES traps RDMSR if I'm understandig that maze correctly.
> 
> I don't know difference between SEV flavours that well.
> 
> I see there's that on SEV-SNP access to x2APIC MSR range (MSR 0x800-0x8FF)
> is intercepted regardless if MSR_AMD64_SNP_ALT_INJ feature is present. But
> I'm not sure what the state on SEV or SEV-ES.

With SEV-ES, if the hypervisor intercepts an MSR access, the VM-Exit is instead
morphed to a #VC (except for EFER).  The guest needs to do an explicit VMGEXIT
(i.e. a hypercall) to explicitly request MSR emulation (this *can* be done in 
the
#VC handler, but the guest can also do VMGEXIT directly, e.g. in lieu of a 
RDMSR).

With regular SEV, VM-Exits aren't reflected into the guest.  Register state 
isn't
encrypted so the hypervisor can emulate MSR accesses (and other instructions)
without needing an explicit hypercall from the guest.

Re: [patch] x86/smpboot: Disable parallel bootup if cc_vendor != NONE

2023-05-30 Thread Sean Christopherson

On Tue, May 30, 2023, Thomas Gleixner wrote:
> On Tue, May 30 2023 at 15:29, Kirill A. Shutemov wrote:
> > On Tue, May 30, 2023 at 02:09:17PM +0200, Thomas Gleixner wrote:
> >> The decision to allow parallel bringup of secondary CPUs checks
> >> CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
> >> parallel bootup because accessing the local APIC is intercepted and raises
> >> a #VC or #VE, which cannot be handled at that point.
> >> 
> >> The check works correctly, but only for AMD encrypted guests. TDX does not
> >> set that flag.
> >> 
> >> Check for cc_vendor != CC_VENDOR_NONE instead. That might be overbroad, but
> >> definitely works for both AMD and Intel.
> >
> > It boots fine with TDX, but I think it is wrong. cc_get_vendor() will
> > report CC_VENDOR_AMD even on bare metal if SME is enabled. I don't think
> > we want it.
> 
> Right. Did not think about that.
> 
> But the same way is CC_ATTR_GUEST_MEM_ENCRYPT overbroad for AMD. Only
> SEV-ES traps RDMSR if I'm understandig that maze correctly.

Ya, regular SEV doesn't encrypt register state.

Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-25 Thread Sean Christopherson

On Thu, May 25, 2023, Rick P Edgecombe wrote:
> I wonder if it might be a good idea to POC the guest side before
> settling on the KVM interface. Then you can also look at the whole
> thing and judge how much usage it would get for the different options
> of restrictions.

As I said earlier[*], IMO the control plane logic needs to live in host 
userspace.
I think any attempt to have KVM providen anything but the low level plumbing 
will
suffer the same fate as CR4 pinning and XO memory.  Iterating on an imperfect
solution to incremently improve security is far, far easier to do in userspace,
and far more likely to get merged.

[*] https://lore.kernel.org/all/zfuyhpuhtmbyd...@google.com

Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> 
> On 05/05/2023 18:28, Sean Christopherson wrote:
> > I have no doubt that we'll need to solve performance and scaling issues 
> > with the
> > memory attributes implementation, e.g. to utilize xarray multi-range support
> > instead of storing information on a per-4KiB-page basis, but AFAICT, the 
> > core
> > idea is sound.  And a very big positive from a maintenance perspective is 
> > that
> > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should 
> > also
> > benefit the other use case.
> > 
> > [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
> > [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
> > [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
> 
> I agree, I used this mechanism because it was easier at first to rely on a
> previous work, but while I was working on the MBEC support, I realized that
> it's not the optimal way to do it.
> 
> I was thinking about using a new special EPT bit similar to
> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
> think?

On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical 
reasons,
KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the 
guest
is moving around BARs, using option ROMs, etc.

ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
otrack attributes, but that works only because pKVM is more privileged than the
host kernel, and the shared vs. private memory attribute that pKVM cares about
is very, very restricted in how it can be used and changed.

I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, 
and
it ended up being a constant battle with the kernel, e.g. page migration, and 
with
KVM itself, e.g. the above memslot mess.

Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> 
> On 05/05/2023 18:44, Sean Christopherson wrote:
> > On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> > > Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> > > set EPT permissions on a set of page ranges.
> > 
> > IMO, manipulation of protections, both for memory (this patch) and CPU state
> > (control registers in the next patch) should come from userspace.  I have no
> > objection to KVM providing plumbing if necessary, but I think userspace 
> > needs to
> > to have full control over the actual state.
> 
> By user space, do you mean the host user space or the guest user space?

Host userspace, a.k.a. the VMM.  Definitely not guest userspace.

> About the guest user space, I see several issues to delegate this kind of
> control:
> - These are restrictions only relevant to the kernel.
> - The threat model is to protect against user space as early as possible.
> - It would be more complex for no obvious gain.
> 
> This patch series is an extension of the kernel self-protections mechanisms,
> and they are not configured by user space.
> 
> 
> > 
> > One of the things that caused Intel's control register pinning series to 
> > stall
> > out was how to handle edge cases like kexec() and reboot.  Deferring to 
> > userspace
> > means the kernel doesn't need to define policy, e.g. when to unprotect 
> > memory,
> > and avoids questions like "should userspace be able to overwrite pinned 
> > control
> > registers".
> 
> The idea is to authenticate every changes. For kexec, the VMM (or something
> else) would have to authenticate the new kernel. Do you have something else
> in mind that could legitimately require such memory or CR changes?

I think we're on the same page, the VMM (host userspace) would need to ack any
changes.

FWIW, SMM is another wart as entry to SMM clobbers CRs.  Now that CONFIG_KVM_SMM
is a thing, the easiest solution would be to disallow coexistence with SMM, 
though
that might not be an option for many use cases (IIUC, QEMU-based deployments use
SMM to implement secure boot).

> > And like the confidential VM use case, keeping userspace in the loop is a 
> > big
> > beneifit, e.g. the guest can't circumvent protections by coercing userspace 
> > into
> > writing to protected memory .
> 
> I don't understand this part. Are you talking about the host user space? How
> the guest could circumvent protections?

Host userspace.  Guest configures a device buffer in write-protected memory, 
gets
a host (synthetic) device to write into the memory.

Re: [PATCH v1 4/9] KVM: x86: Add new hypercall to set EPT permissions

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> Add a new KVM_HC_LOCK_MEM_PAGE_RANGES hypercall that enables a guest to
> set EPT permissions on a set of page ranges.

IMO, manipulation of protections, both for memory (this patch) and CPU state
(control registers in the next patch) should come from userspace.  I have no
objection to KVM providing plumbing if necessary, but I think userspace needs to
to have full control over the actual state.

One of the things that caused Intel's control register pinning series to stall
out was how to handle edge cases like kexec() and reboot.  Deferring to 
userspace
means the kernel doesn't need to define policy, e.g. when to unprotect memory,
and avoids questions like "should userspace be able to overwrite pinned control
registers".

And like the confidential VM use case, keeping userspace in the loop is a big
beneifit, e.g. the guest can't circumvent protections by coercing userspace into
writing to protected memory .

Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson

On Fri, May 05, 2023, Mickaï¿½l Salaï¿½n wrote:
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..a7fb4ff888e6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_KVM_PAGE_TRACK_H
>  
>  enum kvm_page_track_mode {
> + KVM_PAGE_TRACK_PREWRITE,

Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

  Drop "support" for multiple page-track modes, as there is no evidence
  that array-based and refcounted metadata is the optimal solution for
  other modes, nor is there any evidence that other use cases, e.g. for
  access-tracking, will be a good fit for the page-track machinery in
  general.

  E.g. one potential use case of access-tracking would be to prevent guest
  access to poisoned memory (from the guest's perspective).  In that case,
  the number of poisoned pages is likely to be a very small percentage of
  the guest memory, and there is no need to reference count the number of
  access-tracking users, i.e. expanding gfn_track[] for a new mode would be
  grossly inefficient.  And for poisoned memory, host userspace would also
  likely want to trap accesses, e.g. to inject #MC into the guest, and that
  isn't currently supported by the page-track framework.

  A better alternative for that poisoned page use case is likely a
  variation of the proposed per-gfn attributes overlay (linked), which
  would allow efficiently tracking the sparse set of poisoned pages, and by
  default would exit to userspace on access.

Of particular relevance:

  - Using the page-track machinery is inefficient because the guest is likely
going to write-protect a minority of its memory.  And this

  select KVM_EXTERNAL_WRITE_TRACKING if KVM

is particularly nasty because simply enabling HEKI in the Kconfig will cause
KVM to allocate rmaps and gfn tracking.

  - There's no need to reference count the protection, i.e. 15 of the 16 bits of
gfn_track are dead weight.

  - As proposed, adding a second "mode" would double the cost of gfn tracking.

  - Tying the protections to the memslots will create an impossible-to-maintain
ABI because the protections will be lost if the owning memslot is deleted 
and
recreated.

  - The page-track framework provides incomplete protection and will lead to an
ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
adding calls to kvm_page_track_prewrite(), but misses things like 
kvm_vcpu_map().

  - The scaling and maintenance issues will only get worse if/when someone tries
to support dropping read and/or execute permissions, e.g. for execute-only.

  - The code is x86-only, and is likely to stay that way for the foreseeable
future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
[2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
[3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com

Re: [patch 00/37] cpu/hotplug, x86: Reworked parallel CPU bringup

2023-04-20 Thread Sean Christopherson

On Thu, Apr 20, 2023, Thomas Gleixner wrote:
> On Thu, Apr 20 2023 at 10:23, Andrew Cooper wrote:
> > On 20/04/2023 9:32 am, Thomas Gleixner wrote:
> > > On Wed, Apr 19, 2023, Andrew Cooper wrote:
> > > > This was changed in x2APIC, which made the x2APIC_ID immutable.
>
> >> I'm pondering to simply deny parallel mode if x2APIC is not there.
> >
> > I'm not sure if that will help much.
> 
> Spoilsport.

LOL, well let me pile on then.  x2APIC IDs aren't immutable on AMD hardware.  
The
ID is read-only when the CPU is in x2APIC mode, but any changes made to the ID
while the CPU is in xAPIC mode survive the transition to x2APIC.  From the APM:

  A value previously written by software to the 8-bit APIC_ID register (MMIO 
offset
  30h) is converted by hardware into the appropriate format and reflected into 
the
  32-bit x2APIC_ID register (MSR 802h).

FWIW, my observations from testing on bare metal are that the xAPIC ID is 
effectively
read-only (writes are dropped) on Intel CPUs as far back as Haswell, while the 
above
behavior described in the APM holds true on at least Rome and Milan.

My guess is that Intel's uArch specific behavior of the xAPIC ID being read-only
was introduced when x2APIC came along, but I didn't test farther back than 
Haswell.

Re: [PATCH v2] x86/hotplug: Do not put offline vCPUs in mwait idle state

2023-01-20 Thread Sean Christopherson

On Fri, Jan 20, 2023, Igor Mammedov wrote:
> On Fri, 20 Jan 2023 05:55:11 -0800
> "Srivatsa S. Bhat"  wrote:
> 
> > Hi Igor and Thomas,
> > 
> > Thank you for your review!
> > 
> > On 1/19/23 1:12 PM, Thomas Gleixner wrote:
> > > On Mon, Jan 16 2023 at 15:55, Igor Mammedov wrote:  
> > >> "Srivatsa S. Bhat"  wrote:  
> > >>> Fix this by preventing the use of mwait idle state in the vCPU offline
> > >>> play_dead() path for any hypervisor, even if mwait support is
> > >>> available.  
> > >>
> > >> if mwait is enabled, it's very likely guest to have cpuidle
> > >> enabled and using the same mwait as well. So exiting early from
> > >>  mwait_play_dead(), might just punt workflow down:
> > >>   native_play_dead()
> > >> ...
> > >> mwait_play_dead();
> > >> if (cpuidle_play_dead())   <- possible mwait here
> > >>   
> > >> hlt_play_dead(); 
> > >>
> > >> and it will end up in mwait again and only if that fails
> > >> it will go HLT route and maybe transition to VMM.  
> > > 
> > > Good point.
> > >   
> > >> Instead of workaround on guest side,
> > >> shouldn't hypervisor force VMEXIT on being uplugged vCPU when it's
> > >> actually hot-unplugging vCPU? (ex: QEMU kicks vCPU out from guest
> > >> context when it is removing vCPU, among other things)  
> > > 
> > > For a pure guest side CPU unplug operation:
> > > 
> > > guest$ echo 0 >/sys/devices/system/cpu/cpu$N/online
> > > 
> > > the hypervisor is not involved at all. The vCPU is not removed in that
> > > case.
> > >   
> > 
> > Agreed, and this is indeed the scenario I was targeting with this patch,
> > as opposed to vCPU removal from the host side. I'll add this clarification
> > to the commit message.

Forcing HLT doesn't solve anything, it's perfectly legal to passthrough HLT.  I
guarantee there are use cases that passthrough HLT but _not_ MONITOR/MWAIT, and
that passthrough all of them.

> commit message explicitly said:
> "which prevents the hypervisor from running other vCPUs or workloads on the
> corresponding pCPU."
> 
> and that implies unplug on hypervisor side as well.
> Why? That's because when hypervisor exposes mwait to guest, it has to 
> reserve/pin
> a pCPU for each of present vCPUs. And you can safely run other VMs/workloads
> on that pCPU only after it's not possible for it to be reused by VM where
> it was used originally.

Pinning isn't strictly required from a safety perspective.  The latency of 
context
switching may suffer due to wake times, but preempting a vCPU that it's C1 (or
deeper) won't cause functional problems.   Passing through an entire socket
(or whatever scope triggers extra fun) might be a different story, but pinning
isn't strictly required.

That said, I 100% agree that this is expected behavior and not a bug.  Letting 
the
guest execute MWAIT or HLT means the host won't have perfect visibility into 
guest
activity state.

Oversubscribing a pCPU and exposing MWAIT and/or HLT to vCPUs is generally not 
done
precisely because the guest will always appear busy without extra effort on the
host.  E.g. KVM requires an explicit opt-in from userspace to expose MWAIT 
and/or
HLT.

If someone really wants to effeciently oversubscribe pCPUs and passthrough 
MWAIT,
then their best option is probably to have a paravirt interface so that the 
guest
can tell the host its offlining a vCPU.  Barring that the host could inspect the
guest when preempting a vCPU to try and guesstimate how much work the vCPU is
actually doing in order to make better scheduling decisions.

> Now consider following worst (and most likely) case without unplug
> on hypervisor side:
> 
>  1. vm1mwait: pin pCPU2 to vCPU2
>  2. vm1mwait: guest$ echo 0 >/sys/devices/system/cpu/cpu2/online
> -> HLT -> VMEXIT
>  --
>  3. vm2mwait: pin pCPU2 to vCPUx and start VM
>  4. vm2mwait: guest OS onlines Vcpu and starts using it incl.
>going into idle=>mwait state
>  --
>  5. vm1mwait: it still thinks that vCPU is present it can rightfully do:
>guest$ echo 1 >/sys/devices/system/cpu/cpu2/online
>  --  
>  6.1 best case vm1mwait online fails after timeout
>  6.2 worse case: vm2mwait does VMEXIT on vCPUx around time-frame when
>  vm1mwait onlines vCPU2, the online may succeed and then vm2mwait's
>  vCPUx will be stuck (possibly indefinitely) until for some reason
>  VMEXIT happens on vm1mwait's vCPU2 _and_ host decides to schedule
>  vCPUx on pCPU2 which would make vm1mwait stuck on vCPU2.
> So either way it's expected behavior.
> 
> And if there is no intention to unplug vCPU on hypervisor side,
> then VMEXIT on play_dead is not really necessary (mwait is better
> then HLT), since hypervisor can't safely reuse pCPU elsewhere and
> VCPU goes into deep sleep within guest context.
> 
> PS:
> The only case where making HLT/VMEXIT on play_dead might work out,
> would be if new workload weren't pinned to the same pCPU nor
> used mwait (i.e.

Re: [PATCH v7 0/2] KVM: x86/xen: update Xen CPUID Leaf 4

2023-01-19 Thread Sean Christopherson

On Fri, 06 Jan 2023 10:35:58 +, Paul Durrant wrote:
> Patch #2 was the original patch. It has been expended to a series in v6.
> 
> Paul Durrant (2):
>   KVM: x86/cpuid: generalize kvm_update_kvm_cpuid_base() and also
> capture limit
>   KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if
> present
> 
> [...]

Applied to kvm-x86 misc, thanks!

[1/2] KVM: x86/cpuid: generalize kvm_update_kvm_cpuid_base() and also capture 
limit
  https://github.com/kvm-x86/linux/commit/e3ac476711ca
[2/2] KVM: x86/xen: update Xen CPUID Leaf 4 (tsc info) sub-leaves, if present
  https://github.com/kvm-x86/linux/commit/509d19565173

--
https://github.com/kvm-x86/linux/tree/next
https://github.com/kvm-x86/linux/tree/fixes

Re: [PATCH 01/30] x86/crash,reboot: Avoid re-disabling VMX in all CPUs on crash/restart

2022-05-09 Thread Sean Christopherson

I find the shortlog to be very confusing, the bug has nothing to do with 
disabling
VMX and I distinctly remember wrapping VMXOFF with exception fixup to prevent 
doom
if VMX is already disabled :-).  The issue is really that nmi_shootdown_cpus() 
doesn't
play nice with being called twice.

On Wed, Apr 27, 2022, Guilherme G. Piccoli wrote:
> In the panic path we have a list of functions to be called, the panic
> notifiers - such callbacks perform various actions in the machine's
> last breath, and sometimes users want them to run before kdump. We
> have the parameter "crash_kexec_post_notifiers" for that. When such
> parameter is used, the function "crash_smp_send_stop()" is executed
> to poweroff all secondary CPUs through the NMI-shootdown mechanism;
> part of this process involves disabling virtualization features in
> all CPUs (except the main one).
> 
> Now, in the emergency restart procedure we have also a way of
> disabling VMX in all CPUs, using the same NMI-shootdown mechanism;
> what happens though is that in case we already NMI-disabled all CPUs,
> the emergency restart fails due to a second addition of the same items
> in the NMI list, as per the following log output:
> 
> sysrq: Trigger a crash
> Kernel panic - not syncing: sysrq triggered crash
> [...]
> Rebooting in 2 seconds..
> list_add double add: new=, prev=, next=.
> [ cut here ]
> kernel BUG at lib/list_debug.c:29!
> invalid opcode:  [#1] PREEMPT SMP PTI

Call stacks for the two callers would be very, very helpful.

> In order to reproduce the problem, users just need to set the kernel
> parameter "crash_kexec_post_notifiers" *without* kdump set in any
> system with the VMX feature present.
> 
> Since there is no benefit in re-disabling VMX in all CPUs in case
> it was already done, this patch prevents that by guarding the restart
> routine against doubly issuing NMIs unnecessarily. Notice we still
> need to disable VMX locally in the emergency restart.
> 
> Fixes: ed72736183c4 ("x86/reboot: Force all cpus to exit VMX root if VMX is 
> supported)
> Fixes: 0ee59413c967 ("x86/panic: replace smp_send_stop() with kdump friendly 
> version in panic path")
> Cc: David P. Reed 
> Cc: Hidehiro Kawai 
> Cc: Paolo Bonzini 
> Cc: Sean Christopherson 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/x86/include/asm/cpu.h |  1 +
>  arch/x86/kernel/crash.c|  8 
>  arch/x86/kernel/reboot.c   | 14 --
>  3 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
> index 86e5e4e26fcb..b6a9062d387f 100644
> --- a/arch/x86/include/asm/cpu.h
> +++ b/arch/x86/include/asm/cpu.h
> @@ -36,6 +36,7 @@ extern int _debug_hotplug_cpu(int cpu, int action);
>  #endif
>  #endif
>  
> +extern bool crash_cpus_stopped;
>  int mwait_usable(const struct cpuinfo_x86 *);
>  
>  unsigned int x86_family(unsigned int sig);
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index e8326a8d1c5d..71dd1a990e8d 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -42,6 +42,8 @@
>  #include 
>  #include 
>  
> +bool crash_cpus_stopped;
> +
>  /* Used while preparing memory map entries for second kernel */
>  struct crash_memmap_data {
>   struct boot_params *params;
> @@ -108,9 +110,7 @@ void kdump_nmi_shootdown_cpus(void)
>  /* Override the weak function in kernel/panic.c */
>  void crash_smp_send_stop(void)
>  {
> - static int cpus_stopped;
> -
> - if (cpus_stopped)
> + if (crash_cpus_stopped)
>   return;
>  
>   if (smp_ops.crash_stop_other_cpus)
> @@ -118,7 +118,7 @@ void crash_smp_send_stop(void)
>   else
>   smp_send_stop();
>  
> - cpus_stopped = 1;
> + crash_cpus_stopped = true;

This feels like were just adding more duct tape to the mess.  nmi_shootdown() is
still unsafe for more than one caller, and it takes a _lot_ of staring and 
searching
to understand that crash_smp_send_stop() is invoked iff CONFIG_KEXEC_CORE=y, 
i.e.
that it will call smp_ops.crash_stop_other_cpus() and not just smp_send_stop().

Rather than shared a flag between two relatively unrelated functions, what if we
instead disabling virtualization in crash_nmi_callback() and then turn the 
reboot
call into a nop if an NMI shootdown has already occurred?  That will also add a
bit of documentation about multiple shootdowns not working.

And I believe there's also a lurking bug in native_machine_emergency_restart() 
that
can be fixed with cleanup.  SVM can also block INIT and so should be disabled 
during
an emergency reboot.

The attached patches are compile tested only.  If they

[PATCH v4 15/17] KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y

2021-11-10 Thread Sean Christopherson

Move the definition of kvm_arm_pmu_available to pmu-emul.c and, out of
"necessity", hide it behind CONFIG_HW_PERF_EVENTS.  Provide a stub for
the key's wrapper, kvm_arm_support_pmu_v3().  Moving the key's definition
out of perf.c will allow a future commit to delete perf.c entirely.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/image-vars.h |  2 ++
 arch/arm64/kvm/perf.c  |  2 --
 arch/arm64/kvm/pmu-emul.c  |  2 ++
 include/kvm/arm_pmu.h  | 19 ---
 4 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index c96a9a0043bf..7eaf1f7c4168 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -102,7 +102,9 @@ KVM_NVHE_ALIAS(__stop___kvm_ex_table);
 KVM_NVHE_ALIAS(kvm_arm_hyp_percpu_base);
 
 /* PMU available static key */
+#ifdef CONFIG_HW_PERF_EVENTS
 KVM_NVHE_ALIAS(kvm_arm_pmu_available);
+#endif
 
 /* Position-independent library routines */
 KVM_NVHE_ALIAS_HYP(clear_page, __pi_clear_page);
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 374c496a3f1d..52cfab253c65 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -11,8 +11,6 @@
 
 #include 
 
-DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
 void kvm_perf_init(void)
 {
kvm_register_perf_callbacks(NULL);
diff --git a/arch/arm64/kvm/pmu-emul.c b/arch/arm64/kvm/pmu-emul.c
index a5e4bbf5e68f..3308ceefa129 100644
--- a/arch/arm64/kvm/pmu-emul.c
+++ b/arch/arm64/kvm/pmu-emul.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
 static void kvm_pmu_create_perf_event(struct kvm_vcpu *vcpu, u64 select_idx);
 static void kvm_pmu_update_pmc_chained(struct kvm_vcpu *vcpu, u64 select_idx);
 static void kvm_pmu_stop_counter(struct kvm_vcpu *vcpu, struct kvm_pmc *pmc);
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 90f21898aad8..f9ed4c171d7b 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -13,13 +13,6 @@
 #define ARMV8_PMU_CYCLE_IDX(ARMV8_PMU_MAX_COUNTERS - 1)
 #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1)
 
-DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
-static __always_inline bool kvm_arm_support_pmu_v3(void)
-{
-   return static_branch_likely(_arm_pmu_available);
-}
-
 #ifdef CONFIG_HW_PERF_EVENTS
 
 struct kvm_pmc {
@@ -36,6 +29,13 @@ struct kvm_pmu {
struct irq_work overflow_work;
 };
 
+DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
+static __always_inline bool kvm_arm_support_pmu_v3(void)
+{
+   return static_branch_likely(_arm_pmu_available);
+}
+
 #define kvm_arm_pmu_irq_initialized(v) ((v)->arch.pmu.irq_num >= VGIC_NR_SGIS)
 u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u64 select_idx);
 void kvm_pmu_set_counter_value(struct kvm_vcpu *vcpu, u64 select_idx, u64 val);
@@ -65,6 +65,11 @@ int kvm_arm_pmu_v3_enable(struct kvm_vcpu *vcpu);
 struct kvm_pmu {
 };
 
+static inline bool kvm_arm_support_pmu_v3(void)
+{
+   return false;
+}
+
 #define kvm_arm_pmu_irq_initialized(v) (false)
 static inline u64 kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu,
u64 select_idx)
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 13/17] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c

2021-11-10 Thread Sean Christopherson

Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/vmx/vmx.c  | 22 +-
 arch/x86/kvm/x86.c  | 20 +---
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ec16f645cb8c..621bedff0aa5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1515,7 +1515,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
-   bool (*intel_pt_intr_in_guest)(void);
+   unsigned int (*handle_intel_pt_intr)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 36098eb9a7f9..7cb7f261f7dc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7708,6 +7708,20 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 };
 
+static unsigned int vmx_handle_intel_pt_intr(void)
+{
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+   /* '0' on failure so that the !PT case can use a RET0 static call. */
+   if (!kvm_arch_pmi_in_guest(vcpu))
+   return 0;
+
+   kvm_make_request(KVM_REQ_PMI, vcpu);
+   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
+ (unsigned long *)>arch.pmu.global_status);
+   return 1;
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7734,6 +7748,8 @@ static __init void vmx_setup_user_return_msrs(void)
kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
 }
 
+static struct kvm_x86_init_ops vmx_init_ops __initdata;
+
 static __init int hardware_setup(void)
 {
unsigned long host_bndcfgs;
@@ -7892,6 +7908,10 @@ static __init int hardware_setup(void)
return -EINVAL;
if (!enable_ept || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
+   if (pt_mode == PT_MODE_HOST_GUEST)
+   vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+   else
+   vmx_init_ops.handle_intel_pt_intr = NULL;
 
setup_default_sgx_lepubkeyhash();
 
@@ -7920,7 +7940,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
-   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
+   .handle_intel_pt_intr = NULL,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bafd2e78ad04..a4d25d0587e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8410,20 +8410,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static unsigned int kvm_handle_intel_pt_intr(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_arch_pmi_in_guest(vcpu))
-   return 0;
-
-   kvm_make_request(KVM_REQ_PMI, vcpu);
-   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
-   (unsigned long *)>arch.pmu.global_status);
-   return 1;
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -6,11 +11102,7 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
-   /* Temporary ugliness. */
-   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
-   else
-   kvm_register_perf_callbacks(NULL);
+   kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 16/17] KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c

2021-11-10 Thread Sean Christopherson

Call KVM's (un)register perf callbacks helpers directly from arm.c and
delete perf.c

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  3 ---
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  5 +++--
 arch/arm64/kvm/perf.c | 22 --
 4 files changed, 4 insertions(+), 28 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 72e2afe6e8e3..824040b174ab 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -675,9 +675,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-void kvm_perf_init(void);
-void kvm_perf_teardown(void);
-
 /*
  * Returns true if a Performance Monitoring Interrupt (PMI), a.k.a. perf event,
  * arrived in guest context.  For arm64, any event that arrives while a vCPU is
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..0bcc378b7961 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 93c952375f3b..8d18a64a72f1 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1776,7 +1776,8 @@ static int init_subsystems(void)
if (err)
goto out;
 
-   kvm_perf_init();
+   kvm_register_perf_callbacks(NULL);
+
kvm_sys_reg_table_init();
 
 out:
@@ -2164,7 +2165,7 @@ int kvm_arch_init(void *opaque)
 /* NOP: Compiling as a module not supported */
 void kvm_arch_exit(void)
 {
-   kvm_perf_teardown();
+   kvm_unregister_perf_callbacks();
 }
 
 static int __init early_kvm_mode_cfg(char *arg)
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
deleted file mode 100644
index 52cfab253c65..
--- a/arch/arm64/kvm/perf.c
+++ /dev/null
@@ -1,22 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Based on the x86 implementation.
- *
- * Copyright (C) 2012 ARM Ltd.
- * Author: Marc Zyngier 
- */
-
-#include 
-#include 
-
-#include 
-
-void kvm_perf_init(void)
-{
-   kvm_register_perf_callbacks(NULL);
-}
-
-void kvm_perf_teardown(void)
-{
-   kvm_unregister_perf_callbacks();
-}
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 17/17] perf: Drop guest callback (un)register stubs

2021-11-10 Thread Sean Christopherson

Drop perf's stubs for (un)registering guest callbacks now that KVM
registration of callbacks is hidden behind GUEST_PERF_EVENTS=y.  The only
other user is x86 XEN_PV, and x86 unconditionally selects PERF_EVENTS.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 include/linux/perf_event.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0ac7d867ca0c..7b7525e9155f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1511,11 +1511,6 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr)  { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline void perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *cbs)
{ }
-static inline void perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *cbs)
{ }
-
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
 typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data);
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 14/17] KVM: arm64: Convert to the generic perf callbacks

2021-11-10 Thread Sean Christopherson

Drop arm64's version of the callbacks in favor of the callbacks provided
by generic KVM, which are semantically identical.

Reviewed-by: Marc Zyngier 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kvm/perf.c | 34 ++
 1 file changed, 2 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index dfa9bce8559e..374c496a3f1d 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,42 +13,12 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!vcpu)
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (!vcpu_mode_priv(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return *vcpu_pc(vcpu);
-}
-
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_get_guest_ip,
-};
-
 void kvm_perf_init(void)
 {
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(NULL);
 }
 
 void kvm_perf_teardown(void)
 {
-   perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_unregister_perf_callbacks();
 }
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 12/17] KVM: Move x86's perf guest info callbacks to generic KVM

2021-11-10 Thread Sean Christopherson

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.

Acked-by: Marc Zyngier 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 10 ++
 arch/arm64/kvm/arm.c  |  5 +++
 arch/x86/include/asm/kvm_host.h   |  3 ++
 arch/x86/kvm/x86.c| 53 +++
 include/linux/kvm_host.h  | 10 ++
 virt/kvm/kvm_main.c   | 44 +
 6 files changed, 83 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 5a76d9a76fd9..72e2afe6e8e3 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -678,6 +678,16 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
fault_ipa);
 void kvm_perf_init(void);
 void kvm_perf_teardown(void);
 
+/*
+ * Returns true if a Performance Monitoring Interrupt (PMI), a.k.a. perf event,
+ * arrived in guest context.  For arm64, any event that arrives while a vCPU is
+ * loaded is considered to be "in guest".
+ */
+static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   return IS_ENABLED(CONFIG_GUEST_PERF_EVENTS) && !!vcpu;
+}
+
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
 void kvm_update_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index f5490afe1ebf..93c952375f3b 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -496,6 +496,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu_mode_priv(vcpu);
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return *vcpu_pc(vcpu);
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 812c08e797fe..ec16f645cb8c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1565,6 +1565,9 @@ static inline int kvm_arch_flush_remote_tlb(struct kvm 
*kvm)
return -ENOTSUPP;
 }
 
+#define kvm_arch_pmi_in_guest(vcpu) \
+   ((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e9e1a4bb1d00..bafd2e78ad04 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8410,43 +8410,12 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
-{
-   return vcpu && vcpu->arch.handling_intr_from_guest;
-}
-
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!kvm_pmi_in_guest(vcpu))
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (static_call(kvm_x86_get_cpl)(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_guest_get_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
-   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
-   return 0;
-
-   return kvm_rip_read(vcpu);
-}
-
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_pmi_in_guest(vcpu))
+   if (!kvm_arch_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
@@ -8455,12 +8424,6 @@ static unsigned int kvm_handle_intel_pt_intr(void)
return 1;
 }
 
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_guest_get_ip,
-   .handle_intel_pt_intr   = NULL,
-};
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11153,9 +6,11 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   /* Temporary ugliness. */
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
+

[PATCH v4 11/17] KVM: x86: More precisely identify NMI from guest when handling PMI

2021-11-10 Thread Sean Christopherson

Differentiate between IRQ and NMI for KVM's PMC overflow callback, which
was originally invoked in response to an NMI that arrived while the guest
was running, but was inadvertantly changed to fire on IRQs as well when
support for perf without PMU/NMI was added to KVM.  In practice, this
should be a nop as the PMC overflow callback shouldn't be reached, but
it's a cheap and easy fix that also better documents the situation.

Note, this also doesn't completely prevent false positives if perf
somehow ends up calling into KVM, e.g. an NMI can arrive in host after
KVM sets its flag.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c |  2 +-
 arch/x86/kvm/x86.h | 13 ++---
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b36ca4e476c2..df6a3e0bdcde 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3936,7 +3936,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu 
*vcpu)
}
 
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
kvm_load_host_xsave_state(vcpu);
stgi();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0927d07b2efb..36098eb9a7f9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6371,7 +6371,9 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
unsigned long entry)
 {
-   kvm_before_interrupt(vcpu);
+   bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+   kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : 
KVM_HANDLING_IRQ);
vmx_do_interrupt_nmi_irqoff(entry);
kvm_after_interrupt(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c8ef49385c99..e9e1a4bb1d00 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9837,7 +9837,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * interrupts on processors that implement an interrupt shadow, the
 * stat.exits increment will do nicely.
 */
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d070043fd2e8..f8d2c58feadc 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -385,9 +385,16 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
+enum kvm_intr_type {
+   /* Values are arbitrary, but must be non-zero. */
+   KVM_HANDLING_IRQ = 1,
+   KVM_HANDLING_NMI,
+};
+
+static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
+   enum kvm_intr_type intr)
 {
-   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, (u8)intr);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
@@ -397,7 +404,7 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
 
 static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
 {
-   return !!vcpu->arch.handling_intr_from_guest;
+   return vcpu->arch.handling_intr_from_guest == KVM_HANDLING_NMI;
 }
 
 static inline bool kvm_pat_valid(u64 data)
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 10/17] KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable

2021-11-10 Thread Sean Christopherson

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  3 +--
 arch/x86/kvm/pmu.c  |  2 +-
 arch/x86/kvm/x86.c  | 21 -
 arch/x86/kvm/x86.h  | 10 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 112ffb32..812c08e797fe 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -773,6 +773,7 @@ struct kvm_vcpu_arch {
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
bool smi_pending;/* SMI queued after currently running handler */
+   u8 handling_intr_from_guest;
 
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -1893,8 +1894,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
-unsigned int kvm_guest_state(void);
-
 void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5b68d4188de0..eef48258e50f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * woken up. So we should wake it, but this is impossible from
 * NMI context. Do it from irq work instead.
 */
-   if (!kvm_guest_state())
+   if (!kvm_handling_nmi_from_guest(pmc->vcpu))
irq_work_queue(_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ceb09d78277e..c8ef49385c99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8410,15 +8410,17 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
+static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   return vcpu && vcpu->arch.handling_intr_from_guest;
+}
 
-unsigned int kvm_guest_state(void)
+static unsigned int kvm_guest_state(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
unsigned int state;
 
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
state = PERF_GUEST_ACTIVE;
@@ -8430,9 +8432,10 @@ unsigned int kvm_guest_state(void)
 
 static unsigned long kvm_guest_get_ip(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   if (WARN_ON_ONCE(!vcpu))
+   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
+   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
return 0;
 
return kvm_rip_read(vcpu);
@@ -8440,10 +8443,10 @@ static unsigned long kvm_guest_get_ip(void)
 
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index ea264c4502e4..d070043fd2e8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -385,18 +385,20 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, vcpu);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, NULL);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 0);
 }
 
+static inline bool kvm_handling_nmi_from_guest(struct k

[PATCH v4 09/17] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-11-10 Thread Sean Christopherson

Use static_call to optimize perf's guest callbacks on arm64 and x86,
which are now the only architectures that define the callbacks.  Use
DEFINE_STATIC_CALL_RET0 as the default/NULL for all guest callbacks, as
the callback semantics are that a return value '0' means "not in guest".

static_call obviously avoids the overhead of CONFIG_RETPOLINE=y, but is
also advantageous versus other solutions, e.g. per-cpu callbacks, in that
a per-cpu memory load is not needed to detect the !guest case.

Based on code from Peter and Like.

Suggested-by: Peter Zijlstra (Intel) 
Cc: Like Xu 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 include/linux/perf_event.h | 34 --
 kernel/events/core.c   | 15 +++
 2 files changed, 23 insertions(+), 26 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ea47ef616ee0..0ac7d867ca0c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1244,40 +1244,22 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 
 #ifdef CONFIG_GUEST_PERF_EVENTS
 extern struct perf_guest_info_callbacks __rcu *perf_guest_cbs;
-static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
-{
-   /*
-* Callbacks are RCU-protected and must be READ_ONCE to avoid reloading
-* the callbacks between a !NULL check and dereferences, to ensure
-* pending stores/changes to the callback pointers are visible before a
-* non-NULL perf_guest_cbs is visible to readers, and to prevent a
-* module from unloading callbacks while readers are active.
-*/
-   return rcu_dereference(perf_guest_cbs);
-}
+
+DECLARE_STATIC_CALL(__perf_guest_state, *perf_guest_cbs->state);
+DECLARE_STATIC_CALL(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DECLARE_STATIC_CALL(__perf_guest_handle_intel_pt_intr, 
*perf_guest_cbs->handle_intel_pt_intr);
+
 static inline unsigned int perf_guest_state(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   return guest_cbs ? guest_cbs->state() : 0;
+   return static_call(__perf_guest_state)();
 }
 static inline unsigned long perf_guest_get_ip(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   /*
-* Arbitrarily return '0' in the unlikely scenario that the callbacks
-* are unregistered between checking guest state and getting the IP.
-*/
-   return guest_cbs ? guest_cbs->get_ip() : 0;
+   return static_call(__perf_guest_get_ip)();
 }
 static inline unsigned int perf_guest_handle_intel_pt_intr(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->handle_intel_pt_intr)
-   return guest_cbs->handle_intel_pt_intr();
-   return 0;
+   return static_call(__perf_guest_handle_intel_pt_intr)();
 }
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 1c8d341ecc77..b4fd928e4ff8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6524,12 +6524,23 @@ static void perf_pending_event(struct irq_work *entry)
 #ifdef CONFIG_GUEST_PERF_EVENTS
 struct perf_guest_info_callbacks __rcu *perf_guest_cbs;
 
+DEFINE_STATIC_CALL_RET0(__perf_guest_state, *perf_guest_cbs->state);
+DEFINE_STATIC_CALL_RET0(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DEFINE_STATIC_CALL_RET0(__perf_guest_handle_intel_pt_intr, 
*perf_guest_cbs->handle_intel_pt_intr);
+
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
if (WARN_ON_ONCE(rcu_access_pointer(perf_guest_cbs)))
return;
 
rcu_assign_pointer(perf_guest_cbs, cbs);
+   static_call_update(__perf_guest_state, cbs->state);
+   static_call_update(__perf_guest_get_ip, cbs->get_ip);
+
+   /* Implementing ->handle_intel_pt_intr is optional. */
+   if (cbs->handle_intel_pt_intr)
+   static_call_update(__perf_guest_handle_intel_pt_intr,
+  cbs->handle_intel_pt_intr);
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
@@ -6539,6 +6550,10 @@ void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs)
return;
 
rcu_assign_pointer(perf_guest_cbs, NULL);
+   static_call_update(__perf_guest_state, (void *)&__static_call_return0);
+   static_call_update(__perf_guest_get_ip, (void *)&__static_call_return0);
+   static_call_update(__perf_guest_handle_intel_pt_intr,
+  (void *)&__static_call_return0);
synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 08/17] perf: Force architectures to opt-in to guest callbacks

2021-11-10 Thread Sean Christopherson

Introduce GUEST_PERF_EVENTS and require architectures to select it to
allow registering and using guest callbacks in perf.  This will hopefully
make it more difficult for new architectures to add useless "support" for
guest callbacks, e.g. via copy+paste.

Stubbing out the helpers has the happy bonus of avoiding a load of
perf_guest_cbs when GUEST_PERF_EVENTS=n on arm64/x86.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kvm/Kconfig | 1 +
 arch/x86/kvm/Kconfig   | 1 +
 arch/x86/xen/Kconfig   | 1 +
 include/linux/perf_event.h | 6 ++
 init/Kconfig   | 4 
 kernel/events/core.c   | 2 ++
 6 files changed, 15 insertions(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 8ffcbe29395e..e9761d84f982 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -39,6 +39,7 @@ menuconfig KVM
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select SCHED_INFO
+   select GUEST_PERF_EVENTS if PERF_EVENTS
help
  Support hosting virtualized guest machines.
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 619186138176..47bdbe705a76 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -36,6 +36,7 @@ config KVM
select KVM_MMIO
select SCHED_INFO
select PERF_EVENTS
+   select GUEST_PERF_EVENTS
select HAVE_KVM_MSI
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_NO_POLL
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index 6bcd3d8ca6ac..85246dd9faa1 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,6 +23,7 @@ config XEN_PV
select PARAVIRT_XXL
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
+   select GUEST_PERF_EVENTS
help
  Support running as a Xen PV guest.
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 346d5aff5804..ea47ef616ee0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1242,6 +1242,7 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 enum perf_bpf_event_type type,
 u16 flags);
 
+#ifdef CONFIG_GUEST_PERF_EVENTS
 extern struct perf_guest_info_callbacks __rcu *perf_guest_cbs;
 static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
 {
@@ -1280,6 +1281,11 @@ static inline unsigned int 
perf_guest_handle_intel_pt_intr(void)
 }
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
+#else
+static inline unsigned int perf_guest_state(void)   { return 0; }
+static inline unsigned long perf_guest_get_ip(void) { return 0; }
+static inline unsigned int perf_guest_handle_intel_pt_intr(void) { return 0; }
+#endif /* CONFIG_GUEST_PERF_EVENTS */
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
diff --git a/init/Kconfig b/init/Kconfig
index 21b1f4870c80..6bc5c56d669b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1799,6 +1799,10 @@ config HAVE_PERF_EVENTS
help
  See tools/perf/design.txt for details.
 
+config GUEST_PERF_EVENTS
+   bool
+   depends on HAVE_PERF_EVENTS
+
 config PERF_USE_VMALLOC
bool
help
diff --git a/kernel/events/core.c b/kernel/events/core.c
index eb6b9cfd0054..1c8d341ecc77 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6521,6 +6521,7 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
+#ifdef CONFIG_GUEST_PERF_EVENTS
 struct perf_guest_info_callbacks __rcu *perf_guest_cbs;
 
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
@@ -6541,6 +6542,7 @@ void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs)
synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+#endif
 
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 07/17] perf: Add wrappers for invoking guest callbacks

2021-11-10 Thread Sean Christopherson

Add helpers for the guest callbacks to prepare for burying the callbacks
behind a Kconfig (it's a lot easier to provide a few stubs than to #ifdef
piles of code), and also to prepare for converting the callbacks to
static_call().  perf_instruction_pointer() in particular will have subtle
semantics with static_call(), as the "no callbacks" case will return 0 if
the callbacks are unregistered between querying guest state and getting
the IP.  Implement the change now to avoid a functional change when adding
static_call() support, and because the new helper needs to return
_something_ in this case.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 16 +---
 arch/x86/events/core.c | 15 +--
 arch/x86/events/intel/core.c   |  5 +
 include/linux/perf_event.h | 24 
 4 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 274dc3e11b6d..db04a55cee7e 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -102,9 +102,7 @@ compat_user_backtrace(struct compat_frame_tail __user *tail,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* We don't support guest os callchain now */
return;
}
@@ -149,10 +147,9 @@ static bool callchain_trace(void *data, unsigned long pc)
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* We don't support guest os callchain now */
return;
}
@@ -163,18 +160,15 @@ void perf_callchain_kernel(struct 
perf_callchain_entry_ctx *entry,
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state())
-   return guest_cbs->get_ip();
+   if (perf_guest_state())
+   return perf_guest_get_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
+   unsigned int guest_state = perf_guest_state();
int misc = 0;
 
if (guest_state) {
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index e29312a1003a..620347398027 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2768,11 +2768,10 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct unwind_state state;
unsigned long addr;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2872,11 +2871,10 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2953,18 +2951,15 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state())
-   return guest_cbs->get_ip();
+   if (perf_guest_state())
+   return perf_guest_get_ip();
 
return regs->ip + code_segment_base(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
+   unsigned int guest_state = perf_guest_state();
int misc = 0;
 
if (guest_state) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 24adbd6282d4

[PATCH v4 05/17] perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv

2021-11-10 Thread Sean Christopherson

Drop "support" for guest callbacks from architectures that don't implement
the guest callbacks.  Future patches will convert the callbacks to
static_call; rather than churn a bunch of arch code (that was presumably
copy+pasted from x86), remove it wholesale as it's useless and at best
wasting cycles.

A future patch will also add a Kconfig to force architcture to opt into
the callbacks to make it more difficult for uses "support" to sneak in in
the future.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 33 -
 arch/csky/kernel/perf_callchain.c  | 12 ---
 arch/nds32/kernel/perf_event_cpu.c | 34 --
 arch/riscv/kernel/perf_callchain.c | 13 
 4 files changed, 8 insertions(+), 84 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 1626dfc6f6ce..bc6b246ab55e 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,14 +62,8 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
perf_callchain_store(entry, regs->ARM_pc);
 
if (!current->mm)
@@ -99,44 +93,25 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
arm_get_current_stackframe(regs, );
walk_stackframe(, callchain_trace, entry);
 }
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
-
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
-   misc |= PERF_RECORD_MISC_GUEST_USER;
-   else
-   misc |= PERF_RECORD_MISC_GUEST_KERNEL;
-   } else {
-   if (user_mode(regs))
-   misc |= PERF_RECORD_MISC_USER;
-   else
-   misc |= PERF_RECORD_MISC_KERNEL;
-   }
+   if (user_mode(regs))
+   misc |= PERF_RECORD_MISC_USER;
+   else
+   misc |= PERF_RECORD_MISC_KERNEL;
 
return misc;
 }
diff --git a/arch/csky/kernel/perf_callchain.c 
b/arch/csky/kernel/perf_callchain.c
index 35318a635a5f..92057de08f4f 100644
--- a/arch/csky/kernel/perf_callchain.c
+++ b/arch/csky/kernel/perf_callchain.c
@@ -86,13 +86,8 @@ static unsigned long user_backtrace(struct 
perf_callchain_entry_ctx *entry,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
unsigned long fp = 0;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return;
-
fp = regs->regs[4];
perf_callchain_store(entry, regs->pc);
 
@@ -111,15 +106,8 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   pr_warn("C-SKY does not support perf in guest mode!");
-   return;
-   }
-
fr.fp = regs->regs[4];
fr.lr = regs->lr;
walk_stackframe(, entry);
diff --git a/arch/nds32/kernel/perf_event_cpu.c 
b/arch/nds32/kernel/perf_event_cpu.c
index f38791960781..a78a879e7ef1 100644
--- a/arch/nds32/kernel/perf_event_cpu.c
+++ b/arch/nds32/kernel/perf_event_cpu.c
@@ -1363,7 +1363,6 @@ void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry,
struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *gu

[PATCH v4 06/17] perf/core: Rework guest callbacks to prepare for static_call support

2021-11-10 Thread Sean Christopherson

From: Like Xu 

To prepare for using static_calls to optimize perf's guest callbacks,
replace ->is_in_guest and ->is_user_mode with a new multiplexed hook
->state, tweak ->handle_intel_pt_intr to play nice with being called when
there is no active guest, and drop "guest" from ->get_guest_ip.

Return '0' from ->state and ->handle_intel_pt_intr to indicate "not in
guest" so that DEFINE_STATIC_CALL_RET0 can be used to define the static
calls, i.e. no callback == !guest.

Suggested-by: Peter Zijlstra (Intel) 
Originally-by: Peter Zijlstra (Intel) 
Signed-off-by: Like Xu 
Signed-off-by: Zhu Lingshan 
[sean: extracted from static_call patch, fixed get_ip() bug, wrote changelog]
Signed-off-by: Sean Christopherson 
Reviewed-by: Boris Ostrovsky 
Reviewed-by: Paolo Bonzini 
---
 arch/arm64/kernel/perf_callchain.c | 13 +-
 arch/arm64/kvm/perf.c  | 35 +++---
 arch/x86/events/core.c | 13 +-
 arch/x86/events/intel/core.c   |  5 +---
 arch/x86/include/asm/kvm_host.h|  2 +-
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/x86.c | 40 --
 arch/x86/xen/pmu.c | 32 ++--
 include/linux/perf_event.h | 10 +---
 9 files changed, 73 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 86d9f2013172..274dc3e11b6d 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -104,7 +104,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -152,7 +152,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry,
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -165,8 +165,8 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
+   if (guest_cbs && guest_cbs->state())
+   return guest_cbs->get_ip();
 
return instruction_pointer(regs);
 }
@@ -174,10 +174,11 @@ unsigned long perf_instruction_pointer(struct pt_regs 
*regs)
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
+   if (guest_state) {
+   if (guest_state & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index a0d660cf889e..dfa9bce8559e 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,39 +13,34 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static int kvm_is_in_guest(void)
+static unsigned int kvm_guest_state(void)
 {
-return kvm_get_running_vcpu() != NULL;
-}
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+   unsigned int state;
 
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu;
-
-   vcpu = kvm_get_running_vcpu();
+   if (!vcpu)
+   return 0;
 
-   if (vcpu)
-   return !vcpu_mode_priv(vcpu);
+   state = PERF_GUEST_ACTIVE;
+   if (!vcpu_mode_priv(vcpu))
+   state |= PERF_GUEST_USER;
 
-   return 0;
+   return state;
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return *vcpu_pc(vcpu);
-
-   return 0;
+   return *vcpu_pc(vcpu);
 }
 
 static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
+   .state  = kvm_guest_state,
+   .get_ip = kvm_get_guest_ip,
 };
 
 void kvm_perf_init(void)
diff --git a/arch/x86/events/core.c b/arch/

[PATCH v4 04/17] perf: Stop pretending that perf can handle multiple guest callbacks

2021-11-10 Thread Sean Christopherson

Drop the 'int' return value from the perf (un)register callbacks helpers
and stop pretending perf can support multiple callbacks.  The 'int'
returns are not future proofing anything as none of the callers take
action on an error.  It's also not obvious that there will ever be
co-tenant hypervisors, and if there are, that allowing multiple callbacks
to be registered is desirable or even correct.

Opportunistically rename callbacks=>cbs in the affected declarations to
match their definitions.

No functional change intended.

Reviewed-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  4 ++--
 arch/arm64/kvm/perf.c |  8 
 include/linux/perf_event.h| 12 ++--
 kernel/events/core.c  | 15 ---
 4 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 4be8486042a7..5a76d9a76fd9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -675,8 +675,8 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-int kvm_perf_init(void);
-int kvm_perf_teardown(void);
+void kvm_perf_init(void);
+void kvm_perf_teardown(void);
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index c84fe24b2ea1..a0d660cf889e 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -48,12 +48,12 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
-int kvm_perf_init(void)
+void kvm_perf_init(void)
 {
-   return perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks(_guest_cbs);
 }
 
-int kvm_perf_teardown(void)
+void kvm_perf_teardown(void)
 {
-   return perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks(_guest_cbs);
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 318c489b735b..98c204488496 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1252,8 +1252,8 @@ static inline struct perf_guest_info_callbacks 
*perf_get_guest_cbs(void)
 */
return rcu_dereference(perf_guest_cbs);
 }
-extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*callbacks);
-extern int perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
+extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
@@ -1497,10 +1497,10 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr) { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline int perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
-static inline int perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
+static inline void perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *cbs)
{ }
+static inline void perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *cbs)
{ }
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0cc775f702f8..eb6b9cfd0054 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6521,31 +6521,24 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
-/*
- * We assume there is only KVM supporting the callbacks.
- * Later on, we might change it to a list if there is
- * another virtualization implementation supporting the callbacks.
- */
 struct perf_guest_info_callbacks __rcu *perf_guest_cbs;
 
-int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
if (WARN_ON_ONCE(rcu_access_pointer(perf_guest_cbs)))
-   return -EBUSY;
+   return;
 
rcu_assign_pointer(perf_guest_cbs, cbs);
-   return 0;
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
-int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks 
*cbs)
 {
if (WARN_ON_ONCE(rcu_access_pointer(perf_guest_cbs) !=

[PATCH v4 01/17] perf: Protect perf_guest_cbs with RCU

2021-11-10 Thread Sean Christopherson

Protect perf_guest_cbs with RCU to fix multiple possible errors.  Luckily,
all paths that read perf_guest_cbs already require RCU protection, e.g. to
protect the callback chains, so only the direct perf_guest_cbs touchpoints
need to be modified.

Bug #1 is a simple lack of WRITE_ONCE/READ_ONCE behavior to ensure
perf_guest_cbs isn't reloaded between a !NULL check and a dereference.
Fixed via the READ_ONCE() in rcu_dereference().

Bug #2 is that on weakly-ordered architectures, updates to the callbacks
themselves are not guaranteed to be visible before the pointer is made
visible to readers.  Fixed by the smp_store_release() in
rcu_assign_pointer() when the new pointer is non-NULL.

Bug #3 is that, because the callbacks are global, it's possible for
readers to run in parallel with an unregisters, and thus a module
implementing the callbacks can be unloaded while readers are in flight,
resulting in a use-after-free.  Fixed by a synchronize_rcu() call when
unregistering callbacks.

Bug #1 escaped notice because it's extremely unlikely a compiler will
reload perf_guest_cbs in this sequence.  perf_guest_cbs does get reloaded
for future derefs, e.g. for ->is_user_mode(), but the ->is_in_guest()
guard all but guarantees the consumer will win the race, e.g. to nullify
perf_guest_cbs, KVM has to completely exit the guest and teardown down
all VMs before KVM start its module unload / unregister sequence.  This
also makes it all but impossible to encounter bug #3.

Bug #2 has not been a problem because all architectures that register
callbacks are strongly ordered and/or have a static set of callbacks.

But with help, unloading kvm_intel can trigger bug #1 e.g. wrapping
perf_guest_cbs with READ_ONCE in perf_misc_flags() while spamming
kvm_intel module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

Fixes: 39447b386c84 ("perf: Enhance perf to allow for guest statistic 
collection from host")
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 17 +++--
 arch/arm64/kernel/perf_callchain.c | 18 --
 arch/csky/kernel/perf_callchain.c  |  6 --
 arch/nds32/kernel/perf_event_cpu.c | 17 +++--
 arch/riscv/kernel/perf_callchain.c |  7 +--
 arch/x86/events/core.c | 17 +++--
 arch/x86/events/intel/core.c   |  9 ++---
 include/linux/perf_event.h | 13 -
 kernel/events/core.c   | 13 ++---
 9 files changed, 82 insertions(+), 35 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 3b69a76d341e..1626dfc6f6ce 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,9 +62,10 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -98,9 +99,10 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -111,18 +113,21 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+
+   if (guest_cbs && guest_cbs->is_in_guest())
+   return guest_cbs->get_guest_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf

[PATCH v4 00/17] perf: KVM: Fix, optimize, and clean up callbacks

2021-11-10 Thread Sean Christopherson

This is a combination of ~2 series to fix bugs in the perf+KVM callbacks,
optimize the callbacks by employing static_call, and do a variety of
cleanup in both perf and KVM.

For the non-perf patches, I think everything except patch 13 (Paolo) and
patches 15 and 16 (Marc) has the appropriate acks.

Patch 1 fixes a set of mostly-theoretical bugs by protecting the guest
callbacks pointer with RCU.

Patches 2 and 3 fix an Intel PT handling bug where KVM incorrectly
eats PT interrupts when PT is supposed to be owned entirely by the host.

Patches 4-9 clean up perf's callback infrastructure and switch to
static_call for arm64 and x86 (the only survivors).

Patches 10-17 clean up related KVM code and unify the arm64/x86 callbacks.

Based on Linus' tree, commit cb690f5238d7 ("Merge tag 'for-5.16/drivers...).

v4:
  - Rebase.
  - Collect acks and reviews.
  - Fully protect perf_guest_cbs with RCU. [Paolo].
  - Add patch to hide arm64's kvm_arm_pmu_available behind
CONFIG_HW_PERF_EVENTS=y.

v3:
  - https://lore.kernel.org/all/20210922000533.713300-1-sea...@google.com/
  - Add wrappers for guest callbacks to that stubs can be provided when
GUEST_PERF_EVENTS=n.
  - s/HAVE_GUEST_PERF_EVENTS/GUEST_PERF_EVENTS and select it from KVM
and XEN_PV instead of from top-level arm64/x86. [Paolo]
  - Drop an unnecessary synchronize_rcu() when registering callbacks. [Peter]
  - Retain a WARN_ON_ONCE() when unregistering callbacks if the caller
didn't provide the correct pointer. [Peter]
  - Rework the static_call patch to move it all to common perf.
  - Add a patch to drop the (un)register stubs, made possible after
having KVM+XEN_PV select GUEST_PERF_EVENTS.
  - Split dropping guest callback "support" for arm, csky, etc... to a
separate patch, to make introducing GUEST_PERF_EVENTS cleaner.
  
v2 (relative to static_call v10):
  - Split the patch into the semantic change (multiplexed ->state) and
introduction of static_call.
  - Don't use '0' for "not a guest RIP".
  - Handle unregister path.
  - Drop changes for architectures that can be culled entirely.

v2 (relative to v1):
  - https://lkml.kernel.org/r/20210828003558.713983-6-sea...@google.com
  - Drop per-cpu approach. [Peter]
  - Fix mostly-theoretical reload and use-after-free with READ_ONCE(),
WRITE_ONCE(), and synchronize_rcu(). [Peter]
  - Avoid new exports like the plague. [Peter]

v1:
  - https://lkml.kernel.org/r/20210827005718.585190-1-sea...@google.com

v10 static_call:
  - https://lkml.kernel.org/r/20210806133802.3528-2-lingshan@intel.com

Like Xu (1):
  perf/core: Rework guest callbacks to prepare for static_call support

Sean Christopherson (16):
  perf: Protect perf_guest_cbs with RCU
  KVM: x86: Register perf callbacks after calling vendor's
hardware_setup()
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
guest
  perf: Stop pretending that perf can handle multiple guest callbacks
  perf: Drop dead and useless guest "support" from arm, csky, nds32 and
riscv
  perf: Add wrappers for invoking guest callbacks
  perf: Force architectures to opt-in to guest callbacks
  perf/core: Use static_call to optimize perf_guest_info_callbacks
  KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu
variable
  KVM: x86: More precisely identify NMI from guest when handling PMI
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: arm64: Convert to the generic perf callbacks
  KVM: arm64: Hide kvm_arm_pmu_available behind CONFIG_HW_PERF_EVENTS=y
  KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c
  perf: Drop guest callback (un)register stubs

 arch/arm/kernel/perf_callchain.c   | 28 ++
 arch/arm64/include/asm/kvm_host.h  | 11 +-
 arch/arm64/kernel/image-vars.h |  2 +
 arch/arm64/kernel/perf_callchain.c | 13 ---
 arch/arm64/kvm/Kconfig |  1 +
 arch/arm64/kvm/Makefile|  2 +-
 arch/arm64/kvm/arm.c   | 10 -
 arch/arm64/kvm/perf.c  | 59 --
 arch/arm64/kvm/pmu-emul.c  |  2 +
 arch/csky/kernel/perf_callchain.c  | 10 -
 arch/nds32/kernel/perf_event_cpu.c | 29 ++-
 arch/riscv/kernel/perf_callchain.c | 10 -
 arch/x86/events/core.c | 13 ---
 arch/x86/events/intel/core.c   |  5 +--
 arch/x86/include/asm/kvm_host.h|  7 +++-
 arch/x86/kvm/Kconfig   |  1 +
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c | 25 -
 arch/x86/kvm/x86.c | 58 +
 arch/x86/kvm/x86.h | 17 +++--
 arch/x86/xen/Kconfig   |  1 +
 arch/x86/xen/pmu.c | 32 +++-
 include/kvm/arm_pmu.h  | 19 ++
 include/linux/kvm_host.h   | 10 +
 i

[PATCH v4 03/17] KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest

2021-11-10 Thread Sean Christopherson

Override the Processor Trace (PT) interrupt handler for guest mode if and
only if PT is configured for host+guest mode, i.e. is being used
independently by both host and guest.  If PT is configured for system
mode, the host fully controls PT and must handle all events.

Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest")
Cc: sta...@vger.kernel.org
Cc: Like Xu 
Reported-by: Alexander Shishkin 
Reported-by: Artem Kashkanov 
Acked-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/vmx.c  | 1 +
 arch/x86/kvm/x86.c  | 5 -
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2acf37cc1991..bf0a9ce53750 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1514,6 +1514,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
+   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 76861b66bbcf..0927d07b2efb 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7918,6 +7918,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
+   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 021b2c1ac9f0..021d3f5364b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8451,7 +8451,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.is_in_guest= kvm_is_in_guest,
.is_user_mode   = kvm_is_user_mode,
.get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = kvm_handle_intel_pt_intr,
+   .handle_intel_pt_intr   = NULL,
 };
 
 #ifdef CONFIG_X86_64
@@ -11146,6 +11146,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
+   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
perf_register_guest_info_callbacks(_guest_cbs);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11176,6 +11178,7 @@ int kvm_arch_hardware_setup(void *opaque)
 void kvm_arch_hardware_unsetup(void)
 {
perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
 }
-- 
2.34.0.rc0.344.g81b53c2807-goog

[PATCH v4 02/17] KVM: x86: Register perf callbacks after calling vendor's hardware_setup()

2021-11-10 Thread Sean Christopherson

Wait to register perf callbacks until after doing vendor hardaware setup.
VMX's hardware_setup() configures Intel Processor Trace (PT) mode, and a
future fix to register the Intel PT guest interrupt hook if and only if
Intel PT is exposed to the guest will consume the configured PT mode.

Delaying registration to hardware setup is effectively a nop as KVM's perf
hooks all pivot on the per-CPU current_vcpu, which is non-NULL only when
KVM is handling an IRQ/NMI in a VM-Exit path.  I.e. current_vcpu will be
NULL throughout both kvm_arch_init() and kvm_arch_hardware_setup().

Cc: Alexander Shishkin 
Cc: Artem Kashkanov 
Cc: sta...@vger.kernel.org
Acked-by: Paolo Bonzini 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c1c4e2b05a63..021b2c1ac9f0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8567,8 +8567,6 @@ int kvm_arch_init(void *opaque)
 
kvm_timer_init();
 
-   perf_register_guest_info_callbacks(_guest_cbs);
-
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
@@ -8600,7 +8598,6 @@ void kvm_arch_exit(void)
clear_hv_tscchange_cb();
 #endif
kvm_lapic_exit();
-   perf_unregister_guest_info_callbacks(_guest_cbs);
 
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(_cpufreq_notifier_block,
@@ -11149,6 +11146,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   perf_register_guest_info_callbacks(_guest_cbs);
+
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11176,6 +11175,8 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
+   perf_unregister_guest_info_callbacks(_guest_cbs);
+
static_call(kvm_x86_hardware_unsetup)();
 }
 
-- 
2.34.0.rc0.344.g81b53c2807-goog

Re: [PATCH v3 01/16] perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and deref

2021-11-10 Thread Sean Christopherson

On Wed, Nov 10, 2021, Paolo Bonzini wrote:
> On 11/4/21 15:18, Sean Christopherson wrote:
> > If I'm interpeting Paolo's suggestion
> > correctly, he's pointing out that oustanding stores to the function 
> > pointers in
> > @cbs need to complete before assigning a non-NULL pointer to perf_guest_cbs,
> > otherwise a perf event handler may see a valid pointer with half-baked 
> > callbacks.
> > 
> > I think smp_store_release() with a comment would be appropriate, assuming my
> > above interpretation is correct.
> > 
> 
> Yes, exactly.  It should even be rcu_assign_pointer(), matching the
> synchronize_rcu()

And perf_guest_cbs should be tagged __rcu and accessed accordingly.  Which is
effectively what this version (poorly) implemented with a homebrewed mix of
{READ,WRITE}_ONCE, lockdep(), and synchronize_rcu().

> in patch 1 (and the change can be done in patch 1, too).

Ya, the change needs to go into patch 1.

Re: [PATCH v3 08/16] perf: Force architectures to opt-in to guest callbacks

2021-11-09 Thread Sean Christopherson

On Wed, Sep 22, 2021, Sean Christopherson wrote:
> On Wed, Sep 22, 2021, Paolo Bonzini wrote:
> > On 22/09/21 02:05, Sean Christopherson wrote:
> > > @@ -1273,6 +1274,11 @@ static inline unsigned int 
> > > perf_guest_handle_intel_pt_intr(void)
> > >   }
> > >   extern void perf_register_guest_info_callbacks(struct 
> > > perf_guest_info_callbacks *cbs);
> > >   extern void perf_unregister_guest_info_callbacks(struct 
> > > perf_guest_info_callbacks *cbs);
> > > +#else
> > > +static inline unsigned int perf_guest_state(void) { 
> > > return 0; }
> > > +static inline unsigned long perf_guest_get_ip(void)   { 
> > > return 0; }
> > > +static inline unsigned int perf_guest_handle_intel_pt_intr(void) { 
> > > return 0; }
> > > +#endif /* CONFIG_GUEST_PERF_EVENTS */
> > 
> > Reviewed-by: Paolo Bonzini 
> > 
> > Having perf_guest_handle_intel_pt_intr in generic code is a bit off.  Of
> > course it has to be in the struct, but the wrapper might be placed in
> > arch/x86/include/asm/perf_event.h as well (applies to patch 7 as well).
> 
> Yeah, I went with this option purely to keep everything bundled together.  I 
> have
> no strong opinion.

Scratch, that, I do have an opinion.  perf_guest_handle_intel_pt_intr() is in
common code because the callbacks themselves and perf_get_guest_cbs() are 
defined
in linux/perf_event.h, _after_ asm/perf_event.h is included.

arch/x86/include/asm/perf_event.h is quite bereft of includes, so there's no
obvious landing spot for those two things, and adding a new header seems like
overkill.

Re: [PATCH v3 15/16] KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c / pmu.c

2021-11-09 Thread Sean Christopherson

On Mon, Oct 11, 2021, Marc Zyngier wrote:
> On Wed, 22 Sep 2021 01:05:32 +0100,
> Sean Christopherson  wrote:
> > diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
> > index 864b9997efb2..42270676498d 100644
> > --- a/include/kvm/arm_pmu.h
> > +++ b/include/kvm/arm_pmu.h
> > @@ -14,6 +14,7 @@
> >  #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) 
> > >> 1)
> >  
> >  DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
> > +void kvm_pmu_init(void);
> >  
> >  static __always_inline bool kvm_arm_support_pmu_v3(void)
> >  {
> 
> Note that this patch is now conflicting with e840f42a4992 ("KVM:
> arm64: Fix PMU probe ordering"), which was merged in -rc4. Moving the
> static key definition to arch/arm64/kvm/pmu-emul.c and getting rid of
> kvm_pmu_init() altogether should be enough to resolve it.

Defining kvm_arm_pmu_available in pmu-emul.c doesn't work as-is because 
pmu-emul.c
depends on CONFIG_HW_PERF_EVENTS=y.  Since pmu-emul.c is the only path that 
enables
the key, my plan is to add a prep match to bury kvm_arm_pmu_available behind the
existing #ifdef CONFIG_HW_PERF_EVENTS in arm_pmu.h and add a stub
for kvm_arm_support_pmu_v3().  The only ugly part is that the KVM_NVHE_ALIAS() 
also
gains an #ifdef, but that doesn't seem too bad.

Re: [PATCH v3 01/16] perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and deref

2021-11-04 Thread Sean Christopherson

On Thu, Nov 04, 2021, Like Xu wrote:
> On 22/9/2021 8:05 am, Sean Christopherson wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 464917096e73..80ff050a7b55 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6491,14 +6491,21 @@ struct perf_guest_info_callbacks *perf_guest_cbs;
> >   int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
> > *cbs)
> >   {
> > -   perf_guest_cbs = cbs;
> > +   if (WARN_ON_ONCE(perf_guest_cbs))
> > +   return -EBUSY;
> > +
> > +   WRITE_ONCE(perf_guest_cbs, cbs);
> 
> So per Paolo's comment [1], does it help to use
>   smp_store_release(perf_guest_cbs, cbs)
> or
>   rcu_assign_pointer(perf_guest_cbs, cbs)
> here?

Heh, if by "help" you mean "required to prevent bad things on weakly ordered
architectures", then yes, it helps :-)  If I'm interpeting Paolo's suggestion
correctly, he's pointing out that oustanding stores to the function pointers in
@cbs need to complete before assigning a non-NULL pointer to perf_guest_cbs,
otherwise a perf event handler may see a valid pointer with half-baked 
callbacks.

I think smp_store_release() with a comment would be appropriate, assuming my
above interpretation is correct.

> [1] 
> https://lore.kernel.org/kvm/37afc465-c12f-01b9-f3b6-c2573e112...@redhat.com/
> 
> > return 0;
> >   }
> >   EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
> >   int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks 
> > *cbs)
> >   {
> > -   perf_guest_cbs = NULL;
> > +   if (WARN_ON_ONCE(perf_guest_cbs != cbs))
> > +   return -EINVAL;
> > +
> > +   WRITE_ONCE(perf_guest_cbs, NULL);
> > +   synchronize_rcu();
> > return 0;
> >   }
> >   EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
> >

Re: [PATCH v3 12/16] KVM: Move x86's perf guest info callbacks to generic KVM

2021-10-11 Thread Sean Christopherson

On Mon, Oct 11, 2021, Marc Zyngier wrote:
> On Wed, 22 Sep 2021 01:05:29 +0100, Sean Christopherson  
> wrote:
> > diff --git a/arch/arm64/include/asm/kvm_host.h 
> > b/arch/arm64/include/asm/kvm_host.h
> > index ed940aec89e0..828b6eaa2c56 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -673,6 +673,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
> > fault_ipa);
> >  void kvm_perf_init(void);
> >  void kvm_perf_teardown(void);
> >  
> > +#ifdef CONFIG_GUEST_PERF_EVENTS
> > +static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
> 
> Pardon my x86 ignorance, what is PMI? PMU Interrupt?

Ya, Performance Monitoring Interrupt.  I didn't realize the term wasn't common
perf terminology.  Maybe kvm_arch_perf_events_in_guest() to be less x86-centric?

> > +{
> > +   /* Any callback while a vCPU is loaded is considered to be in guest. */
> > +   return !!vcpu;
> > +}
> > +#endif
> 
> Do you really need this #ifdef?

Nope, should compile fine without it, though simply dropping the #ifdef would 
make
make the semantics of the function wrong, even if nothing consumes it.  Tweak it
to use IS_ENABLED()?

return IS_ENABLED(CONFIG_GUEST_PERF_EVENTS) && !!vcpu;

Re: [PATCH v3 08/16] perf: Force architectures to opt-in to guest callbacks

2021-09-22 Thread Sean Christopherson

On Wed, Sep 22, 2021, Paolo Bonzini wrote:
> On 22/09/21 02:05, Sean Christopherson wrote:
> > @@ -1273,6 +1274,11 @@ static inline unsigned int 
> > perf_guest_handle_intel_pt_intr(void)
> >   }
> >   extern void perf_register_guest_info_callbacks(struct 
> > perf_guest_info_callbacks *cbs);
> >   extern void perf_unregister_guest_info_callbacks(struct 
> > perf_guest_info_callbacks *cbs);
> > +#else
> > +static inline unsigned int perf_guest_state(void)   { return 0; }
> > +static inline unsigned long perf_guest_get_ip(void) { 
> > return 0; }
> > +static inline unsigned int perf_guest_handle_intel_pt_intr(void) { return 
> > 0; }
> > +#endif /* CONFIG_GUEST_PERF_EVENTS */
> 
> Reviewed-by: Paolo Bonzini 
> 
> Having perf_guest_handle_intel_pt_intr in generic code is a bit off.  Of
> course it has to be in the struct, but the wrapper might be placed in
> arch/x86/include/asm/perf_event.h as well (applies to patch 7 as well).

Yeah, I went with this option purely to keep everything bundled together.  I 
have
no strong opinion.

[PATCH v3 16/16] perf: Drop guest callback (un)register stubs

2021-09-21 Thread Sean Christopherson

Drop perf's stubs for (un)registering guest callbacks now that KVM
registration of callbacks is hidden behind GUEST_PERF_EVENTS=y.  The only
other user is x86 XEN_PV, and x86 unconditionally selects PERF_EVENTS.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/perf_event.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d582dfeb4e20..20327d1046bb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1505,11 +1505,6 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr)  { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline void perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *cbs)
{ }
-static inline void perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *cbs)
{ }
-
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
 typedef int (perf_ksymbol_get_name_f)(char *name, int name_len, void *data);
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 14/16] KVM: arm64: Convert to the generic perf callbacks

2021-09-21 Thread Sean Christopherson

Drop arm64's version of the callbacks in favor of the callbacks provided
by generic KVM, which are semantically identical.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/kvm/perf.c | 34 ++
 1 file changed, 2 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 3e99ac4ab2d6..0b902e0d5b5d 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,45 +13,15 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!vcpu)
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (!vcpu_mode_priv(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return *vcpu_pc(vcpu);
-}
-
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_get_guest_ip,
-};
-
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
 
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(NULL);
 }
 
 void kvm_perf_teardown(void)
 {
-   perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_unregister_perf_callbacks();
 }
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 12/16] KVM: Move x86's perf guest info callbacks to generic KVM

2021-09-21 Thread Sean Christopherson

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Implement the necessary arm64 arch hooks now to avoid having to provide
stubs or a temporary #define (from x86) to avoid arm64 compilation errors
when CONFIG_GUEST_PERF_EVENTS=y.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  8 +
 arch/arm64/kvm/arm.c  |  5 +++
 arch/x86/include/asm/kvm_host.h   |  3 ++
 arch/x86/kvm/x86.c| 53 +++
 include/linux/kvm_host.h  | 10 ++
 virt/kvm/kvm_main.c   | 44 +
 6 files changed, 81 insertions(+), 42 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index ed940aec89e0..828b6eaa2c56 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -673,6 +673,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
fault_ipa);
 void kvm_perf_init(void);
 void kvm_perf_teardown(void);
 
+#ifdef CONFIG_GUEST_PERF_EVENTS
+static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   /* Any callback while a vCPU is loaded is considered to be in guest. */
+   return !!vcpu;
+}
+#endif
+
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
 void kvm_update_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..2b542fdc237e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -500,6 +500,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu_mode_priv(vcpu);
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return *vcpu_pc(vcpu);
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2d86a2dfc775..6efe4e03a6d2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1543,6 +1543,9 @@ static inline int kvm_arch_flush_remote_tlb(struct kvm 
*kvm)
return -ENOTSUPP;
 }
 
+#define kvm_arch_pmi_in_guest(vcpu) \
+   ((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 412646b973bb..1bea616402e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,43 +8264,12 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
-{
-   return vcpu && vcpu->arch.handling_intr_from_guest;
-}
-
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!kvm_pmi_in_guest(vcpu))
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (static_call(kvm_x86_get_cpl)(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_guest_get_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
-   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
-   return 0;
-
-   return kvm_rip_read(vcpu);
-}
-
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_pmi_in_guest(vcpu))
+   if (!kvm_arch_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
@@ -8309,12 +8278,6 @@ static unsigned int kvm_handle_intel_pt_intr(void)
return 1;
 }
 
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_guest_get_ip,
-   .handle_intel_pt_intr   = NULL,
-};
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11068,9 +11031,11 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   /* Temporary ugliness. */
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
+   else
+   kvm_register_perf_callbacks(NULL);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
@@ -11099,8 +11064,7 @@ int kvm_arch_hardware_setup(void *op

[PATCH v3 11/16] KVM: x86: More precisely identify NMI from guest when handling PMI

2021-09-21 Thread Sean Christopherson

Differntiate between IRQ and NMI for KVM's PMC overflow callback, which
was originally invoked in response to an NMI that arrived while the guest
was running, but was inadvertantly changed to fire on IRQs as well when
support for perf without PMU/NMI was added to KVM.  In practice, this
should be a nop as the PMC overflow callback shouldn't be reached, but
it's a cheap and easy fix that also better documents the situation.

Note, this also doesn't completely prevent false positives if perf
somehow ends up calling into KVM, e.g. an NMI can arrive in host after
KVM sets its flag.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c |  2 +-
 arch/x86/kvm/x86.h | 13 ++---
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1a70e11f0487..0a0c01744b63 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3843,7 +3843,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu 
*vcpu)
}
 
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
kvm_load_host_xsave_state(vcpu);
stgi();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f19d72136f77..61a4f5ff2acd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6344,7 +6344,9 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
unsigned long entry)
 {
-   kvm_before_interrupt(vcpu);
+   bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+   kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : 
KVM_HANDLING_IRQ);
vmx_do_interrupt_nmi_irqoff(entry);
kvm_after_interrupt(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 24a6faa07442..412646b973bb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9676,7 +9676,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * interrupts on processors that implement an interrupt shadow, the
 * stat.exits increment will do nicely.
 */
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a9c107e7c907..9b26f9b09d2a 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,9 +387,16 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
+enum kvm_intr_type {
+   /* Values are arbitrary, but must be non-zero. */
+   KVM_HANDLING_IRQ = 1,
+   KVM_HANDLING_NMI,
+};
+
+static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
+   enum kvm_intr_type intr)
 {
-   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, (u8)intr);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
@@ -399,7 +406,7 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
 
 static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
 {
-   return !!vcpu->arch.handling_intr_from_guest;
+   return vcpu->arch.handling_intr_from_guest == KVM_HANDLING_NMI;
 }
 
 static inline bool kvm_pat_valid(u64 data)
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 15/16] KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c / pmu.c

2021-09-21 Thread Sean Christopherson

Call KVM's (un)register perf callbacks helpers directly from arm.c, and
move the PMU bits into pmu.c and rename the related helper accordingly.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  3 ---
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  6 --
 arch/arm64/kvm/perf.c | 27 ---
 arch/arm64/kvm/pmu.c  |  8 
 include/kvm/arm_pmu.h |  1 +
 6 files changed, 14 insertions(+), 33 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 828b6eaa2c56..f141ac65f4f1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,9 +670,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-void kvm_perf_init(void);
-void kvm_perf_teardown(void);
-
 #ifdef CONFIG_GUEST_PERF_EVENTS
 static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..0bcc378b7961 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2b542fdc237e..48f89d80f464 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1744,7 +1744,9 @@ static int init_subsystems(void)
if (err)
goto out;
 
-   kvm_perf_init();
+   kvm_pmu_init();
+   kvm_register_perf_callbacks(NULL);
+
kvm_sys_reg_table_init();
 
 out:
@@ -2160,7 +2162,7 @@ int kvm_arch_init(void *opaque)
 /* NOP: Compiling as a module not supported */
 void kvm_arch_exit(void)
 {
-   kvm_perf_teardown();
+   kvm_unregister_perf_callbacks();
 }
 
 static int __init early_kvm_mode_cfg(char *arg)
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
deleted file mode 100644
index 0b902e0d5b5d..
--- a/arch/arm64/kvm/perf.c
+++ /dev/null
@@ -1,27 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Based on the x86 implementation.
- *
- * Copyright (C) 2012 ARM Ltd.
- * Author: Marc Zyngier 
- */
-
-#include 
-#include 
-
-#include 
-
-DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
-void kvm_perf_init(void)
-{
-   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-   static_branch_enable(_arm_pmu_available);
-
-   kvm_register_perf_callbacks(NULL);
-}
-
-void kvm_perf_teardown(void)
-{
-   kvm_unregister_perf_callbacks();
-}
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 03a6c1f4a09a..d98b57a17043 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -7,6 +7,14 @@
 #include 
 #include 
 
+DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
+void kvm_pmu_init(void)
+{
+   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
+   static_branch_enable(_arm_pmu_available);
+}
+
 /*
  * Given the perf event attributes and system type, determine
  * if we are going to need to switch counters at guest entry/exit.
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 864b9997efb2..42270676498d 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -14,6 +14,7 @@
 #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1)
 
 DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+void kvm_pmu_init(void);
 
 static __always_inline bool kvm_arm_support_pmu_v3(void)
 {
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 13/16] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c

2021-09-21 Thread Sean Christopherson

Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/vmx/vmx.c  | 22 +-
 arch/x86/kvm/x86.c  | 20 +---
 3 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6efe4e03a6d2..d40814b57ae8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1495,7 +1495,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
-   bool (*intel_pt_intr_in_guest)(void);
+   unsigned int (*handle_intel_pt_intr)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 61a4f5ff2acd..33f92febe3ce 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7687,6 +7687,20 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 };
 
+static unsigned int vmx_handle_intel_pt_intr(void)
+{
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+   /* '0' on failure so that the !PT case can use a RET0 static call. */
+   if (!kvm_arch_pmi_in_guest(vcpu))
+   return 0;
+
+   kvm_make_request(KVM_REQ_PMI, vcpu);
+   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
+ (unsigned long *)>arch.pmu.global_status);
+   return 1;
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7713,6 +7727,8 @@ static __init void vmx_setup_user_return_msrs(void)
kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
 }
 
+static struct kvm_x86_init_ops vmx_init_ops __initdata;
+
 static __init int hardware_setup(void)
 {
unsigned long host_bndcfgs;
@@ -7873,6 +7889,10 @@ static __init int hardware_setup(void)
return -EINVAL;
if (!enable_ept || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
+   if (pt_mode == PT_MODE_HOST_GUEST)
+   vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+   else
+   vmx_init_ops.handle_intel_pt_intr = NULL;
 
setup_default_sgx_lepubkeyhash();
 
@@ -7898,7 +7918,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
-   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
+   .handle_intel_pt_intr = NULL,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bea616402e6..b79b2d29260d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,20 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static unsigned int kvm_handle_intel_pt_intr(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_arch_pmi_in_guest(vcpu))
-   return 0;
-
-   kvm_make_request(KVM_REQ_PMI, vcpu);
-   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
-   (unsigned long *)>arch.pmu.global_status);
-   return 1;
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11031,11 +11017,7 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
-   /* Temporary ugliness. */
-   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
-   else
-   kvm_register_perf_callbacks(NULL);
+   kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 10/16] KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable

2021-09-21 Thread Sean Christopherson

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  3 +--
 arch/x86/kvm/pmu.c  |  2 +-
 arch/x86/kvm/x86.c  | 21 -
 arch/x86/kvm/x86.h  | 10 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1080166fc0cf..2d86a2dfc775 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,6 +763,7 @@ struct kvm_vcpu_arch {
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
bool smi_pending;/* SMI queued after currently running handler */
+   u8 handling_intr_from_guest;
 
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -1874,8 +1875,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
-unsigned int kvm_guest_state(void);
-
 void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5b68d4188de0..eef48258e50f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * woken up. So we should wake it, but this is impossible from
 * NMI context. Do it from irq work instead.
 */
-   if (!kvm_guest_state())
+   if (!kvm_handling_nmi_from_guest(pmc->vcpu))
irq_work_queue(_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6cc66466f301..24a6faa07442 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,15 +8264,17 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
+static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   return vcpu && vcpu->arch.handling_intr_from_guest;
+}
 
-unsigned int kvm_guest_state(void)
+static unsigned int kvm_guest_state(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
unsigned int state;
 
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
state = PERF_GUEST_ACTIVE;
@@ -8284,9 +8286,10 @@ unsigned int kvm_guest_state(void)
 
 static unsigned long kvm_guest_get_ip(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   if (WARN_ON_ONCE(!vcpu))
+   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
+   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
return 0;
 
return kvm_rip_read(vcpu);
@@ -8294,10 +8297,10 @@ static unsigned long kvm_guest_get_ip(void)
 
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 7d66d63dc55a..a9c107e7c907 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,18 +387,20 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, vcpu);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, NULL);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 0);
 }
 
+static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
+{
+   retur

[PATCH v3 08/16] perf: Force architectures to opt-in to guest callbacks

2021-09-21 Thread Sean Christopherson

Introduce GUEST_PERF_EVENTS and require architectures to select it to
allow registering and using guest callbacks in perf.  This will hopefully
make it more difficult for new architectures to add useless "support" for
guest callbacks, e.g. via copy+paste.

Stubbing out the helpers has the happy bonus of avoiding a load of
perf_guest_cbs when GUEST_PERF_EVENTS=n on arm64/x86.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/kvm/Kconfig | 1 +
 arch/x86/kvm/Kconfig   | 1 +
 arch/x86/xen/Kconfig   | 1 +
 include/linux/perf_event.h | 6 ++
 init/Kconfig   | 4 
 kernel/events/core.c   | 2 ++
 6 files changed, 15 insertions(+)

diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index a4eba0908bfa..f2121404c7c6 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -37,6 +37,7 @@ menuconfig KVM
select HAVE_KVM_IRQ_BYPASS
select HAVE_KVM_VCPU_RUN_PID_CHANGE
select SCHED_INFO
+   select GUEST_PERF_EVENTS if PERF_EVENTS
help
  Support hosting virtualized guest machines.
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ac69894eab88..699bf786fbce 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -36,6 +36,7 @@ config KVM
select KVM_MMIO
select SCHED_INFO
select PERF_EVENTS
+   select GUEST_PERF_EVENTS
select HAVE_KVM_MSI
select HAVE_KVM_CPU_RELAX_INTERCEPT
select HAVE_KVM_NO_POLL
diff --git a/arch/x86/xen/Kconfig b/arch/x86/xen/Kconfig
index afc1da68b06d..d07595a9552d 100644
--- a/arch/x86/xen/Kconfig
+++ b/arch/x86/xen/Kconfig
@@ -23,6 +23,7 @@ config XEN_PV
select PARAVIRT_XXL
select XEN_HAVE_PVMMU
select XEN_HAVE_VPMU
+   select GUEST_PERF_EVENTS
help
  Support running as a Xen PV guest.
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c0a6eaf55fb1..eefa197d5354 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1238,6 +1238,7 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 enum perf_bpf_event_type type,
 u16 flags);
 
+#ifdef CONFIG_GUEST_PERF_EVENTS
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
 static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
 {
@@ -1273,6 +1274,11 @@ static inline unsigned int 
perf_guest_handle_intel_pt_intr(void)
 }
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
+#else
+static inline unsigned int perf_guest_state(void)   { return 0; }
+static inline unsigned long perf_guest_get_ip(void) { return 0; }
+static inline unsigned int perf_guest_handle_intel_pt_intr(void) { return 0; }
+#endif /* CONFIG_GUEST_PERF_EVENTS */
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
diff --git a/init/Kconfig b/init/Kconfig
index 55f9f7738ebb..acc7e8ba4563 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1776,6 +1776,10 @@ config HAVE_PERF_EVENTS
help
  See tools/perf/design.txt for details.
 
+config GUEST_PERF_EVENTS
+   bool
+   depends on HAVE_PERF_EVENTS
+
 config PERF_USE_VMALLOC
bool
help
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2e3dc9fbd5d9..c6ec05809f54 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6482,6 +6482,7 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
+#ifdef CONFIG_GUEST_PERF_EVENTS
 struct perf_guest_info_callbacks *perf_guest_cbs;
 
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
@@ -6502,6 +6503,7 @@ void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs)
synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+#endif
 
 static void
 perf_output_sample_regs(struct perf_output_handle *handle,
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 09/16] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-09-21 Thread Sean Christopherson

Use static_call to optimize perf's guest callbacks on arm64 and x86,
which are now the only architectures that define the callbacks.  Use
DEFINE_STATIC_CALL_RET0 as the default/NULL for all guest callbacks, as
the callback semantics are that a return value '0' means "not in guest".

static_call obviously avoids the overhead of CONFIG_RETPOLINE=y, but is
also advantageous versus other solutions, e.g. per-cpu callbacks, in that
a per-cpu memory load is not needed to detect the !guest case.

Based on code from Peter and Like.

Suggested-by: Peter Zijlstra (Intel) 
Cc: Like Xu 
Signed-off-by: Sean Christopherson 
---
 include/linux/perf_event.h | 28 ++--
 kernel/events/core.c   | 15 +++
 2 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index eefa197d5354..d582dfeb4e20 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1240,37 +1240,21 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 
 #ifdef CONFIG_GUEST_PERF_EVENTS
 extern struct perf_guest_info_callbacks *perf_guest_cbs;
-static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
-{
-   /* Reg/unreg perf_guest_cbs waits for readers via synchronize_rcu(). */
-   lockdep_assert_preemption_disabled();
+DECLARE_STATIC_CALL(__perf_guest_state, *perf_guest_cbs->state);
+DECLARE_STATIC_CALL(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DECLARE_STATIC_CALL(__perf_guest_handle_intel_pt_intr, 
*perf_guest_cbs->handle_intel_pt_intr);
 
-   /* Prevent reloading between a !NULL check and dereferences. */
-   return READ_ONCE(perf_guest_cbs);
-}
 static inline unsigned int perf_guest_state(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   return guest_cbs ? guest_cbs->state() : 0;
+   return static_call(__perf_guest_state)();
 }
 static inline unsigned long perf_guest_get_ip(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   /*
-* Arbitrarily return '0' in the unlikely scenario that the callbacks
-* are unregistered between checking guest state and getting the IP.
-*/
-   return guest_cbs ? guest_cbs->get_ip() : 0;
+   return static_call(__perf_guest_get_ip)();
 }
 static inline unsigned int perf_guest_handle_intel_pt_intr(void)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->handle_intel_pt_intr)
-   return guest_cbs->handle_intel_pt_intr();
-   return 0;
+   return static_call(__perf_guest_handle_intel_pt_intr)();
 }
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6ec05809f54..79c8ee1778a4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6485,12 +6485,23 @@ static void perf_pending_event(struct irq_work *entry)
 #ifdef CONFIG_GUEST_PERF_EVENTS
 struct perf_guest_info_callbacks *perf_guest_cbs;
 
+DEFINE_STATIC_CALL_RET0(__perf_guest_state, *perf_guest_cbs->state);
+DEFINE_STATIC_CALL_RET0(__perf_guest_get_ip, *perf_guest_cbs->get_ip);
+DEFINE_STATIC_CALL_RET0(__perf_guest_handle_intel_pt_intr, 
*perf_guest_cbs->handle_intel_pt_intr);
+
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
if (WARN_ON_ONCE(perf_guest_cbs))
return;
 
WRITE_ONCE(perf_guest_cbs, cbs);
+   static_call_update(__perf_guest_state, cbs->state);
+   static_call_update(__perf_guest_get_ip, cbs->get_ip);
+
+   /* Implementing ->handle_intel_pt_intr is optional. */
+   if (cbs->handle_intel_pt_intr)
+   static_call_update(__perf_guest_handle_intel_pt_intr,
+  cbs->handle_intel_pt_intr);
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
@@ -6500,6 +6511,10 @@ void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs)
return;
 
WRITE_ONCE(perf_guest_cbs, NULL);
+   static_call_update(__perf_guest_state, (void *)&__static_call_return0);
+   static_call_update(__perf_guest_get_ip, (void *)&__static_call_return0);
+   static_call_update(__perf_guest_handle_intel_pt_intr,
+  (void *)&__static_call_return0);
synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 07/16] perf: Add wrappers for invoking guest callbacks

2021-09-21 Thread Sean Christopherson

Add helpers for the guest callbacks to prepare for burying the callbacks
behind a Kconfig (it's a lot easier to provide a few stubs than to #ifdef
piles of code), and also to prepare for converting the callbacks to
static_call().  perf_instruction_pointer() in particular will have subtle
semantics with static_call(), as the "no callbacks" case will return 0 if
the callbacks are unregistered between querying guest state and getting
the IP.  Implement the change now to avoid a functional change when adding
static_call() support, and because the new helper needs to return
_something_ in this case.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 16 +---
 arch/x86/events/core.c | 15 +--
 arch/x86/events/intel/core.c   |  5 +
 include/linux/perf_event.h | 24 
 4 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 274dc3e11b6d..db04a55cee7e 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -102,9 +102,7 @@ compat_user_backtrace(struct compat_frame_tail __user *tail,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* We don't support guest os callchain now */
return;
}
@@ -149,10 +147,9 @@ static bool callchain_trace(void *data, unsigned long pc)
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* We don't support guest os callchain now */
return;
}
@@ -163,18 +160,15 @@ void perf_callchain_kernel(struct 
perf_callchain_entry_ctx *entry,
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state())
-   return guest_cbs->get_ip();
+   if (perf_guest_state())
+   return perf_guest_get_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
+   unsigned int guest_state = perf_guest_state();
int misc = 0;
 
if (guest_state) {
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3a7630fdd340..d20e4f8d1aef 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2761,11 +2761,10 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct unwind_state state;
unsigned long addr;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2865,11 +2864,10 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (perf_guest_state()) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2946,18 +2944,15 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state())
-   return guest_cbs->get_ip();
+   if (perf_guest_state())
+   return perf_guest_get_ip();
 
return regs->ip + code_segment_base(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
+   unsigned int guest_state = perf_guest_state();
int misc = 0;
 
if (guest_state) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 524ad1f747bd..f5b02017ba16 100644
--- a/arch/

[PATCH v3 06/16] perf/core: Rework guest callbacks to prepare for static_call support

2021-09-21 Thread Sean Christopherson

From: Like Xu 

To prepare for using static_calls to optimize perf's guest callbacks,
replace ->is_in_guest and ->is_user_mode with a new multiplexed hook
->state, tweak ->handle_intel_pt_intr to play nice with being called when
there is no active guest, and drop "guest" from ->is_in_guest.

Return '0' from ->state and ->handle_intel_pt_intr to indicate "not in
guest" so that DEFINE_STATIC_CALL_RET0 can be used to define the static
calls, i.e. no callback == !guest.

Suggested-by: Peter Zijlstra (Intel) 
Originally-by: Peter Zijlstra (Intel) 
Signed-off-by: Like Xu 
Signed-off-by: Zhu Lingshan 
[sean: extracted from static_call patch, fixed get_ip() bug, wrote changelog]
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 13 +-
 arch/arm64/kvm/perf.c  | 35 +++---
 arch/x86/events/core.c | 13 +-
 arch/x86/events/intel/core.c   |  5 +---
 arch/x86/include/asm/kvm_host.h|  2 +-
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/x86.c | 40 --
 arch/x86/xen/pmu.c | 32 ++--
 include/linux/perf_event.h | 10 +---
 kernel/events/core.c   |  1 +
 10 files changed, 74 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 86d9f2013172..274dc3e11b6d 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -104,7 +104,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -152,7 +152,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry,
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -165,8 +165,8 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
+   if (guest_cbs && guest_cbs->state())
+   return guest_cbs->get_ip();
 
return instruction_pointer(regs);
 }
@@ -174,10 +174,11 @@ unsigned long perf_instruction_pointer(struct pt_regs 
*regs)
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
+   if (guest_state) {
+   if (guest_state & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index c37c0cf1bfe9..3e99ac4ab2d6 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,39 +13,34 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static int kvm_is_in_guest(void)
+static unsigned int kvm_guest_state(void)
 {
-return kvm_get_running_vcpu() != NULL;
-}
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+   unsigned int state;
 
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu;
-
-   vcpu = kvm_get_running_vcpu();
+   if (!vcpu)
+   return 0;
 
-   if (vcpu)
-   return !vcpu_mode_priv(vcpu);
+   state = PERF_GUEST_ACTIVE;
+   if (!vcpu_mode_priv(vcpu))
+   state |= PERF_GUEST_USER;
 
-   return 0;
+   return state;
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return *vcpu_pc(vcpu);
-
-   return 0;
+   return *vcpu_pc(vcpu);
 }
 
 static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
+   .state  = kvm_guest_state,
+   .get_ip = kvm_get_guest_ip,
 };
 
 void kvm_perf_init(void)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f

[PATCH v3 05/16] perf: Drop dead and useless guest "support" from arm, csky, nds32 and riscv

2021-09-21 Thread Sean Christopherson

Drop "support" for guest callbacks from architctures that don't implement
the guest callbacks.  Future patches will convert the callbacks to
static_call; rather than churn a bunch of arch code (that was presumably
copy+pasted from x86), remove it wholesale as it's useless and at best
wasting cycles.

A future patch will also add a Kconfig to force architcture to opt into
the callbacks to make it more difficult for uses "support" to sneak in in
the future.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 33 -
 arch/csky/kernel/perf_callchain.c  | 12 ---
 arch/nds32/kernel/perf_event_cpu.c | 34 --
 arch/riscv/kernel/perf_callchain.c | 13 
 4 files changed, 8 insertions(+), 84 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 1626dfc6f6ce..bc6b246ab55e 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,14 +62,8 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
perf_callchain_store(entry, regs->ARM_pc);
 
if (!current->mm)
@@ -99,44 +93,25 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
arm_get_current_stackframe(regs, );
walk_stackframe(, callchain_trace, entry);
 }
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
-
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
-   misc |= PERF_RECORD_MISC_GUEST_USER;
-   else
-   misc |= PERF_RECORD_MISC_GUEST_KERNEL;
-   } else {
-   if (user_mode(regs))
-   misc |= PERF_RECORD_MISC_USER;
-   else
-   misc |= PERF_RECORD_MISC_KERNEL;
-   }
+   if (user_mode(regs))
+   misc |= PERF_RECORD_MISC_USER;
+   else
+   misc |= PERF_RECORD_MISC_KERNEL;
 
return misc;
 }
diff --git a/arch/csky/kernel/perf_callchain.c 
b/arch/csky/kernel/perf_callchain.c
index 35318a635a5f..92057de08f4f 100644
--- a/arch/csky/kernel/perf_callchain.c
+++ b/arch/csky/kernel/perf_callchain.c
@@ -86,13 +86,8 @@ static unsigned long user_backtrace(struct 
perf_callchain_entry_ctx *entry,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
unsigned long fp = 0;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return;
-
fp = regs->regs[4];
perf_callchain_store(entry, regs->pc);
 
@@ -111,15 +106,8 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   pr_warn("C-SKY does not support perf in guest mode!");
-   return;
-   }
-
fr.fp = regs->regs[4];
fr.lr = regs->lr;
walk_stackframe(, entry);
diff --git a/arch/nds32/kernel/perf_event_cpu.c 
b/arch/nds32/kernel/perf_event_cpu.c
index f38791960781..a78a879e7ef1 100644
--- a/arch/nds32/kernel/perf_event_cpu.c
+++ b/arch/nds32/kernel/perf_event_cpu.c
@@ -1363,7 +1363,6 @@ void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry,
struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_gu

[PATCH v3 04/16] perf: Stop pretending that perf can handle multiple guest callbacks

2021-09-21 Thread Sean Christopherson

Drop the 'int' return value from the perf (un)register callbacks helpers
and stop pretending perf can support multiple callbacks.  The 'int'
returns are not future proofing anything as none of the callers take
action on an error.  It's also not obvious that there will ever be
co-tenant hypervisors, and if there are, that allowing multiple callbacks
to be registered is desirable or even correct.

Opportunistically rename callbacks=>cbs in the affected declarations to
match their definitions.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  4 ++--
 arch/arm64/kvm/perf.c |  8 
 include/linux/perf_event.h| 12 ++--
 kernel/events/core.c  | 16 
 4 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 41911585ae0c..ed940aec89e0 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,8 +670,8 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-int kvm_perf_init(void);
-int kvm_perf_teardown(void);
+void kvm_perf_init(void);
+void kvm_perf_teardown(void);
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 151c31fb9860..c37c0cf1bfe9 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -48,15 +48,15 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
-int kvm_perf_init(void)
+void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
 
-   return perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks(_guest_cbs);
 }
 
-int kvm_perf_teardown(void)
+void kvm_perf_teardown(void)
 {
-   return perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks(_guest_cbs);
 }
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6b0405e578c1..317d4658afe9 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1245,8 +1245,8 @@ static inline struct perf_guest_info_callbacks 
*perf_get_guest_cbs(void)
/* Prevent reloading between a !NULL check and dereferences. */
return READ_ONCE(perf_guest_cbs);
 }
-extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*callbacks);
-extern int perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
+extern void perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
@@ -1489,10 +1489,10 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr) { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline int perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
-static inline int perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
+static inline void perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *cbs)
{ }
+static inline void perf_unregister_guest_info_callbacks
+(struct perf_guest_info_callbacks *cbs)
{ }
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 80ff050a7b55..d90a43572400 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6482,31 +6482,23 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
-/*
- * We assume there is only KVM supporting the callbacks.
- * Later on, we might change it to a list if there is
- * another virtualization implementation supporting the callbacks.
- */
 struct perf_guest_info_callbacks *perf_guest_cbs;
-
-int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
if (WARN_ON_ONCE(perf_guest_cbs))
-   return -EBUSY;
+   return;
 
WRITE_ONCE(perf_guest_cbs, cbs);
-   return 0;
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
-int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_unregister_guest_inf

[PATCH v3 03/16] KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest

2021-09-21 Thread Sean Christopherson

Override the Processor Trace (PT) interrupt handler for guest mode if and
only if PT is configured for host+guest mode, i.e. is being used
independently by both host and guest.  If PT is configured for system
mode, the host fully controls PT and must handle all events.

Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest")
Cc: sta...@vger.kernel.org
Cc: Like Xu 
Reported-by: Alexander Shishkin 
Reported-by: Artem Kashkanov 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/vmx.c  | 1 +
 arch/x86/kvm/x86.c  | 5 -
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b256db394a..1ea4943a73d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1494,6 +1494,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
+   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fada1055f325..f19d72136f77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7896,6 +7896,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
+   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb6015f97f9e..ffc6c2d73508 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8305,7 +8305,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.is_in_guest= kvm_is_in_guest,
.is_user_mode   = kvm_is_user_mode,
.get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = kvm_handle_intel_pt_intr,
+   .handle_intel_pt_intr   = NULL,
 };
 
 #ifdef CONFIG_X86_64
@@ -11061,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
+   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
perf_register_guest_info_callbacks(_guest_cbs);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11091,6 +11093,7 @@ int kvm_arch_hardware_setup(void *opaque)
 void kvm_arch_hardware_unsetup(void)
 {
perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
 }
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 02/16] KVM: x86: Register perf callbacks after calling vendor's hardware_setup()

2021-09-21 Thread Sean Christopherson

Wait to register perf callbacks until after doing vendor hardaware setup.
VMX's hardware_setup() configures Intel Processor Trace (PT) mode, and a
future fix to register the Intel PT guest interrupt hook if and only if
Intel PT is exposed to the guest will consume the configured PT mode.

Delaying registration to hardware setup is effectively a nop as KVM's perf
hooks all pivot on the per-CPU current_vcpu, which is non-NULL only when
KVM is handling an IRQ/NMI in a VM-Exit path.  I.e. current_vcpu will be
NULL throughout both kvm_arch_init() and kvm_arch_hardware_setup().

Cc: Alexander Shishkin 
Cc: Artem Kashkanov 
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..fb6015f97f9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8426,8 +8426,6 @@ int kvm_arch_init(void *opaque)
 
kvm_timer_init();
 
-   perf_register_guest_info_callbacks(_guest_cbs);
-
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
@@ -8461,7 +8459,6 @@ void kvm_arch_exit(void)
clear_hv_tscchange_cb();
 #endif
kvm_lapic_exit();
-   perf_unregister_guest_info_callbacks(_guest_cbs);
 
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(_cpufreq_notifier_block,
@@ -11064,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   perf_register_guest_info_callbacks(_guest_cbs);
+
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11091,6 +11090,8 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
+   perf_unregister_guest_info_callbacks(_guest_cbs);
+
static_call(kvm_x86_hardware_unsetup)();
 }
 
-- 
2.33.0.464.g1972c5931b-goog

[PATCH v3 00/16] perf: KVM: Fix, optimize, and clean up callbacks

2021-09-21 Thread Sean Christopherson

Peter, I left the Intel PT mess as-is.  Having to pass a NULL pointer
from KVM arm64 seemed to be a lesser evil than more exports and multiple
registration paths.

This is a combination of ~2 series to fix bugs in the perf+KVM callbacks,
optimize the callbacks by employing static_call, and do a variety of
cleanup in both perf and KVM.

Patch 1 fixes a mostly-theoretical bug where perf can deref a NULL
pointer if KVM unregisters its callbacks while they're being accessed.
In practice, compilers tend to avoid problematic reloads of the pointer
and the PMI handler doesn't lose the race against module unloading,
i.e doesn't hit a use-after-free.

Patches 2 and 3 fix an Intel PT handling bug where KVM incorrectly
eats PT interrupts when PT is supposed to be owned entirely by the host.

Patches 4-9 clean up perf's callback infrastructure and switch to
static_call for arm64 and x86 (the only survivors).

Patches 10-16 clean up related KVM code and unify the arm64/x86 callbacks.

Based on "git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue", commit
680c7e3be6a3 ("KVM: x86: Exit to userspace ...").

v3:
  - Add wrappers for guest callbacks to that stubs can be provided when
GUEST_PERF_EVENTS=n.
  - s/HAVE_GUEST_PERF_EVENTS/GUEST_PERF_EVENTS and select it from KVM
and XEN_PV instead of from top-level arm64/x86. [Paolo]
  - Drop an unnecessary synchronize_rcu() when registering callbacks. [Peter]
  - Retain a WARN_ON_ONCE() when unregistering callbacks if the caller
didn't provide the correct pointer. [Peter]
  - Rework the static_call patch to move it all to common perf.
  - Add a patch to drop the (un)register stubs, made possible after
having KVM+XEN_PV select GUEST_PERF_EVENTS.
  - Split dropping guest callback "support" for arm, csky, etc... to a
separate patch, to make introducing GUEST_PERF_EVENTS cleaner.
  
v2 (relative to static_call v10):
  - Split the patch into the semantic change (multiplexed ->state) and
introduction of static_call.
  - Don't use '0' for "not a guest RIP".
  - Handle unregister path.
  - Drop changes for architectures that can be culled entirely.

v2 (relative to v1):
  - https://lkml.kernel.org/r/20210828003558.713983-6-sea...@google.com
  - Drop per-cpu approach. [Peter]
  - Fix mostly-theoretical reload and use-after-free with READ_ONCE(),
WRITE_ONCE(), and synchronize_rcu(). [Peter]
  - Avoid new exports like the plague. [Peter]

v1:
  - https://lkml.kernel.org/r/20210827005718.585190-1-sea...@google.com

v10 static_call:
  - https://lkml.kernel.org/r/20210806133802.3528-2-lingshan@intel.com


Like Xu (1):
  perf/core: Rework guest callbacks to prepare for static_call support

Sean Christopherson (15):
  perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and
deref
  KVM: x86: Register perf callbacks after calling vendor's
hardware_setup()
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
guest
  perf: Stop pretending that perf can handle multiple guest callbacks
  perf: Drop dead and useless guest "support" from arm, csky, nds32 and
riscv
  perf: Add wrappers for invoking guest callbacks
  perf: Force architectures to opt-in to guest callbacks
  perf/core: Use static_call to optimize perf_guest_info_callbacks
  KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu
variable
  KVM: x86: More precisely identify NMI from guest when handling PMI
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: arm64: Convert to the generic perf callbacks
  KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c /
pmu.c
  perf: Drop guest callback (un)register stubs

 arch/arm/kernel/perf_callchain.c   | 28 ++
 arch/arm64/include/asm/kvm_host.h  |  9 -
 arch/arm64/kernel/perf_callchain.c | 13 ---
 arch/arm64/kvm/Kconfig |  1 +
 arch/arm64/kvm/Makefile|  2 +-
 arch/arm64/kvm/arm.c   | 11 +-
 arch/arm64/kvm/perf.c  | 62 --
 arch/arm64/kvm/pmu.c   |  8 
 arch/csky/kernel/perf_callchain.c  | 10 -
 arch/nds32/kernel/perf_event_cpu.c | 29 ++
 arch/riscv/kernel/perf_callchain.c | 10 -
 arch/x86/events/core.c | 13 ---
 arch/x86/events/intel/core.c   |  5 +--
 arch/x86/include/asm/kvm_host.h|  7 +++-
 arch/x86/kvm/Kconfig   |  1 +
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c | 25 +++-
 arch/x86/kvm/x86.c | 58 +---
 arch/x86/kvm/x86.h | 17 ++--
 arch/x86/xen/Kconfig   |  1 +
 arch/x86/xen/pmu.c | 32 +++
 include/kvm/arm_pmu.h  |  1 +
 include/linux/kvm_host.h   | 10 +
 include/linux/perf_ev

[PATCH v3 01/16] perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and deref

2021-09-21 Thread Sean Christopherson

Protect perf_guest_cbs with READ_ONCE/WRITE_ONCE to ensure it's not
reloaded between a !NULL check and a dereference, and wait for all
readers via syncrhonize_rcu() to prevent use-after-free, e.g. if the
callbacks are being unregistered during module unload.  Because the
callbacks are global, it's possible for readers to run in parallel with
an unregister operation.

The bug has escaped notice because all dereferences of perf_guest_cbs
follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
and it's extremely unlikely a compiler will reload perf_guest_cbs in this
sequence.  Compilers do reload perf_guest_cbs for future derefs, e.g. for
->is_user_mode(), but the ->is_in_guest() guard all but guarantees the
PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to
completely exit the guest and teardown down all VMs before KVM start its
module unload / unregister sequence.

But with help, unloading kvm_intel can trigger a NULL pointer derference,
e.g. wrapping perf_guest_cbs with READ_ONCE in perf_misc_flags() while
spamming kvm_intel module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

Fixes: 39447b386c84 ("perf: Enhance perf to allow for guest statistic 
collection from host")
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 17 +++--
 arch/arm64/kernel/perf_callchain.c | 18 --
 arch/csky/kernel/perf_callchain.c  |  6 --
 arch/nds32/kernel/perf_event_cpu.c | 17 +++--
 arch/riscv/kernel/perf_callchain.c |  7 +--
 arch/x86/events/core.c | 17 +++--
 arch/x86/events/intel/core.c   |  9 ++---
 include/linux/perf_event.h |  8 
 kernel/events/core.c   | 11 +--
 9 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 3b69a76d341e..1626dfc6f6ce 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,9 +62,10 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -98,9 +99,10 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -111,18 +113,21 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+
+   if (guest_cbs && guest_cbs->is_in_guest())
+   return guest_cbs->get_guest_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   if (perf_guest_cbs->is_user_mode())
+   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 4a72c2727309..86d9f2013172 100644
--- a/arch/arm64/k

Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-09-21 Thread Sean Christopherson

On Wed, Sep 15, 2021, Zhu, Lingshan wrote:
> 
> 
> On 8/27/2021 3:59 AM, Sean Christopherson wrote:
> > TL;DR: Please don't merge this patch, it's broken and is also built on a 
> > shoddy
> > foundation that I would like to fix.
> Hi Sean,Peter, Paolo
> 
> I will send out an V11 which drops this patch since it's buggy, and Sean is
> working on fix this.
> Does this sound good?

Works for me, thanks!

Re: [PATCH v2 05/13] perf: Force architectures to opt-in to guest callbacks

2021-09-21 Thread Sean Christopherson

On Tue, Sep 21, 2021, Paolo Bonzini wrote:
> On 28/08/21 21:47, Peter Zijlstra wrote:
> > > +config HAVE_GUEST_PERF_EVENTS
> > > + bool
> > depends on HAVE_KVM
> 
> It won't really do anything, since Kconfig does not detects conflicts
> between select' and 'depends on' clauses.

It does throw a WARN, though the build doesn't fail.

WARNING: unmet direct dependencies detected for HAVE_GUEST_PERF_EVENTS
  Depends on [n]: HAVE_KVM [=n] && HAVE_PERF_EVENTS [=y]
  Selected by [y]:
  - ARM64 [=y]

WARNING: unmet direct dependencies detected for HAVE_GUEST_PERF_EVENTS
  Depends on [n]: HAVE_KVM [=n] && HAVE_PERF_EVENTS [=y]
  Selected by [y]:
  - ARM64 [=y]

WARNING: unmet direct dependencies detected for HAVE_GUEST_PERF_EVENTS
  Depends on [n]: HAVE_KVM [=n] && HAVE_PERF_EVENTS [=y]
  Selected by [y]:
  - ARM64 [=y]

> Rather, should the symbol be selected by KVM, instead of ARM64 and X86?

By KVM, you mean KVM in arm64 and x86, correct?  Because HAVE_GUEST_PERF_EVENTS
should not be selected for s390, PPC, or MIPS.

Oh, and Xen also uses the callbacks on x86, which means the HAVE_KVM part is
arguabably wrong, even though it's guaranteed to be true for the XEN_PV case.
I'll drop that dependency and send out a separate series to clean up the arm64
side of HAVE_KVM.

The reason I didn't bury HAVE_GUEST_PERF_EVENTS under KVM (and XEN_PV) is that
there are number of references to the callbacks throught perf and I didn't want
to create #ifdef hell.

But I think I figured out a not-awful solution.  If there are wrappers+stubs for
the guest callback users, then the new Kconfig can be selected on-demand instead
of unconditionally by arm64 and x86.  That has the added bonus of eliminating
the relevant code paths for !KVM (and !XEN_PV on x86), with or without 
static_call.
It also obviates the needs for __KVM_WANT_GUEST_PERF_EVENTS or whatever I called
that thing.

It more or less requires defining the static calls in generic perf, but I think
that actually ends up being good thing as it consolidates more code without
introducing more #ifdefs.  The diffstats for the static_call() conversions are
also quite nice.

 include/linux/perf_event.h | 28 ++--
 kernel/events/core.c   | 15 +++
 2 files changed, 21 insertions(+), 22 deletions(-)

I'll try to get a new version out today or tomorrow.

Re: [PATCH v2 00/13] perf: KVM: Fix, optimize, and clean up callbacks

2021-09-17 Thread Sean Christopherson

On Fri, Sep 17, 2021, Peter Zijlstra wrote:
> On Thu, Sep 16, 2021 at 09:37:43PM +0000, Sean Christopherson wrote:
> So I don't mind exporting __static_call_return0, but exporting a raw
> static_call is much like exporting a function pointer :/

Ya, that part is quite gross.

> > The unregister path would also need its own synchronize_rcu().  In general, 
> > I
> > don't love duplicating the logic, but it's not the end of the world.
> > 
> > Either way works for me.  Paolo or Peter, do either of you have a 
> > preference?
> 
> Can we de-feature kvm as a module and only have this PT functionality
> when built-in? :-)

I agree that many of the for-KVM exports are ugly, especially several of the
perf exports, but I will fight tooth and nail to keep KVM-as-a-module.  It is
invaluable for development and testing, and in the not-too-distant future there
is KVM-maintenance related functionality that we'd like to implement that relies
on KVM being a module.

I would be more than happy to help explore approaches that reduce the for-KVM
exports, but I am strongly opposed to defeaturing KVM-as-a-module.  I have a few
nascent ideas for eliminating a handful of a random exports, but no clever ideas
for eliminating perf's for-KVM exports.

Re: [PATCH v2 05/13] perf: Force architectures to opt-in to guest callbacks

2021-09-16 Thread Sean Christopherson

On Sat, Aug 28, 2021, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 05:35:50PM -0700, Sean Christopherson wrote:
> > diff --git a/init/Kconfig b/init/Kconfig
> > index 55f9f7738ebb..9ef51ae53977 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1776,6 +1776,9 @@ config HAVE_PERF_EVENTS
> > help
> >   See tools/perf/design.txt for details.
> >  
> > +config HAVE_GUEST_PERF_EVENTS
> > +   bool
>   depends on HAVE_KVM
> 
> ?

Ah, nice!  We can go even further to:

depends on HAVE_PERF_EVENTS && HAVE_KVM

though I'm pretty sure all architectures that select HAVE_KVM also select
HAVE_PERF_EVENTS.

Huh.  arm64 doesn't select HAVE_KVM even though it selects almost literally 
every
other HAVE_KVM_* config.  arm64 has some other weirdness with CONFIG_KVM, I'll 
add
a patch or two to fix that stuff and amend this patch as above.

Thanks again!

Re: [PATCH v2 01/13] perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and deref

2021-09-16 Thread Sean Christopherson

On Sat, Aug 28, 2021, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 05:35:46PM -0700, Sean Christopherson wrote:
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 464917096e73..2126f6327321 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6491,14 +6491,19 @@ struct perf_guest_info_callbacks *perf_guest_cbs;
> >  
> >  int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
> > *cbs)
> >  {
> > -   perf_guest_cbs = cbs;
> > +   if (WARN_ON_ONCE(perf_guest_cbs))
> > +   return -EBUSY;
> > +
> > +   WRITE_ONCE(perf_guest_cbs, cbs);
> > +   synchronize_rcu();
> 
> You're waiting for all NULL users to go away? :-) IOW, we can do without
> this synchronize_rcu() call.

Doh, right.  I was thinking KVM needed to wait for in-progress NMI to exit to
ensure guest PT interrupts are handled correctly, but obviously the NMI handler
needs to exit for that CPU to get into a guest...

> > return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
> >  
> >  int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks 
> > *cbs)
> >  {
> > -   perf_guest_cbs = NULL;
> 
>   if (WARN_ON_ONCE(perf_guest_cbs != cbs))
>   return -EBUSY;
> 
> ?

Works for me.  I guess I'm more optimistic about people not being morons :-)

Re: [PATCH v2 00/13] perf: KVM: Fix, optimize, and clean up callbacks

2021-09-16 Thread Sean Christopherson

On Sat, Aug 28, 2021, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 05:35:45PM -0700, Sean Christopherson wrote:
> > Like Xu (2):
> >   perf/core: Rework guest callbacks to prepare for static_call support
> >   perf/core: Use static_call to optimize perf_guest_info_callbacks
> > 
> > Sean Christopherson (11):
> >   perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and
> > deref
> >   KVM: x86: Register perf callbacks after calling vendor's
> > hardware_setup()
> >   KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
> > guest
> >   perf: Stop pretending that perf can handle multiple guest callbacks
> >   perf: Force architectures to opt-in to guest callbacks
> >   KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu
> > variable
> >   KVM: x86: More precisely identify NMI from guest when handling PMI
> >   KVM: Move x86's perf guest info callbacks to generic KVM
> >   KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
> >   KVM: arm64: Convert to the generic perf callbacks
> >   KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c /
> > pmu.c

Argh, sorry, I somehow managed to miss all of your replies.  I'll get back to
this series next week.  Thanks for the quick response!

> Lets keep the whole intel_pt crud inside x86...

In theory, I like the idea of burying intel_pt inside x86 (and even in 
Intel+VMX code
for the most part), but the actual implementation is a bit gross.  Because of 
the
whole "KVM can be a module" thing, either the static call and 
__static_call_return0
would need to be exported, or a new register/unregister pair would have to be 
exported.

The unregister path would also need its own synchronize_rcu().  In general, I
don't love duplicating the logic, but it's not the end of the world.

Either way works for me.  Paolo or Peter, do either of you have a preference?

> ---
> Index: linux-2.6/arch/x86/events/core.c
> ===
> --- linux-2.6.orig/arch/x86/events/core.c
> +++ linux-2.6/arch/x86/events/core.c
> @@ -92,7 +92,7 @@ DEFINE_STATIC_CALL_RET0(x86_pmu_guest_ge
>  
>  DEFINE_STATIC_CALL_RET0(x86_guest_state, *(perf_guest_cbs->state));
>  DEFINE_STATIC_CALL_RET0(x86_guest_get_ip, *(perf_guest_cbs->get_ip));
> -DEFINE_STATIC_CALL_RET0(x86_guest_handle_intel_pt_intr, 
> *(perf_guest_cbs->handle_intel_pt_intr));
> +DEFINE_STATIC_CALL_RET0(x86_guest_handle_intel_pt_intr, unsigned int 
> (*)(void));

FWIW, the param needs to be a raw function, not a function pointer.

[PATCH v2 13/13] KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c / pmu.c

2021-08-27 Thread Sean Christopherson

Call KVM's (un)register perf callbacks helpers directly from arm.c, and
move the PMU bits into pmu.c and rename the related helper accordingly.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  3 ---
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  6 --
 arch/arm64/kvm/perf.c | 27 ---
 arch/arm64/kvm/pmu.c  |  8 
 include/kvm/arm_pmu.h |  1 +
 6 files changed, 14 insertions(+), 33 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 73dc402ded1f..d549b58120bc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,9 +670,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-void kvm_perf_init(void);
-void kvm_perf_teardown(void);
-
 #ifdef CONFIG_PERF_EVENTS
 #define __KVM_WANT_PERF_CALLBACKS
 static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..0bcc378b7961 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 2b542fdc237e..48f89d80f464 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1744,7 +1744,9 @@ static int init_subsystems(void)
if (err)
goto out;
 
-   kvm_perf_init();
+   kvm_pmu_init();
+   kvm_register_perf_callbacks(NULL);
+
kvm_sys_reg_table_init();
 
 out:
@@ -2160,7 +2162,7 @@ int kvm_arch_init(void *opaque)
 /* NOP: Compiling as a module not supported */
 void kvm_arch_exit(void)
 {
-   kvm_perf_teardown();
+   kvm_unregister_perf_callbacks();
 }
 
 static int __init early_kvm_mode_cfg(char *arg)
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
deleted file mode 100644
index 0b902e0d5b5d..
--- a/arch/arm64/kvm/perf.c
+++ /dev/null
@@ -1,27 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Based on the x86 implementation.
- *
- * Copyright (C) 2012 ARM Ltd.
- * Author: Marc Zyngier 
- */
-
-#include 
-#include 
-
-#include 
-
-DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
-void kvm_perf_init(void)
-{
-   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-   static_branch_enable(_arm_pmu_available);
-
-   kvm_register_perf_callbacks(NULL);
-}
-
-void kvm_perf_teardown(void)
-{
-   kvm_unregister_perf_callbacks();
-}
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 03a6c1f4a09a..d98b57a17043 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -7,6 +7,14 @@
 #include 
 #include 
 
+DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
+void kvm_pmu_init(void)
+{
+   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
+   static_branch_enable(_arm_pmu_available);
+}
+
 /*
  * Given the perf event attributes and system type, determine
  * if we are going to need to switch counters at guest entry/exit.
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 864b9997efb2..42270676498d 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -14,6 +14,7 @@
 #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1)
 
 DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+void kvm_pmu_init(void);
 
 static __always_inline bool kvm_arm_support_pmu_v3(void)
 {
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 11/13] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c

2021-08-27 Thread Sean Christopherson

Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/vmx/vmx.c  | 22 +-
 arch/x86/kvm/x86.c  | 20 +---
 include/linux/kvm_host.h|  2 --
 4 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a98c7907110c..7a3d1dcfef39 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1495,7 +1495,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
-   bool (*intel_pt_intr_in_guest)(void);
+   unsigned int (*handle_intel_pt_intr)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 61a4f5ff2acd..33f92febe3ce 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7687,6 +7687,20 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 };
 
+static unsigned int vmx_handle_intel_pt_intr(void)
+{
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+   /* '0' on failure so that the !PT case can use a RET0 static call. */
+   if (!kvm_arch_pmi_in_guest(vcpu))
+   return 0;
+
+   kvm_make_request(KVM_REQ_PMI, vcpu);
+   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
+ (unsigned long *)>arch.pmu.global_status);
+   return 1;
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7713,6 +7727,8 @@ static __init void vmx_setup_user_return_msrs(void)
kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
 }
 
+static struct kvm_x86_init_ops vmx_init_ops __initdata;
+
 static __init int hardware_setup(void)
 {
unsigned long host_bndcfgs;
@@ -7873,6 +7889,10 @@ static __init int hardware_setup(void)
return -EINVAL;
if (!enable_ept || !cpu_has_vmx_intel_pt())
pt_mode = PT_MODE_SYSTEM;
+   if (pt_mode == PT_MODE_HOST_GUEST)
+   vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+   else
+   vmx_init_ops.handle_intel_pt_intr = NULL;
 
setup_default_sgx_lepubkeyhash();
 
@@ -7898,7 +7918,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
-   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
+   .handle_intel_pt_intr = NULL,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1bea616402e6..b79b2d29260d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,20 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static unsigned int kvm_handle_intel_pt_intr(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_arch_pmi_in_guest(vcpu))
-   return 0;
-
-   kvm_make_request(KVM_REQ_PMI, vcpu);
-   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
-   (unsigned long *)>arch.pmu.global_status);
-   return 1;
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11031,11 +11017,7 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
-   /* Temporary ugliness. */
-   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
-   else
-   kvm_register_perf_callbacks(NULL);
+   kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 34d99034852f..b9235c3ac6af 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1164,8 +1164,6 @@ static inline bool kvm_arch_intc_initialized(struct kvm 
*kvm)
 #endif
 
 #ifdef __KVM_WANT_PERF_CALLBACKS
-
-void kvm_set_intel_pt_intr_handler(unsigned int (*handler)(void));
 unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
 
 void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 12/13] KVM: arm64: Convert to the generic perf callbacks

2021-08-27 Thread Sean Christopherson

Drop arm64's version of the callbacks in favor of the callbacks provided
by generic KVM, which are semantically identical.  Implement the "get ip"
hook as needed.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 12 +++
 arch/arm64/kvm/arm.c  |  5 +
 arch/arm64/kvm/perf.c | 34 ++-
 3 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index ed940aec89e0..73dc402ded1f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -673,6 +673,18 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
fault_ipa);
 void kvm_perf_init(void);
 void kvm_perf_teardown(void);
 
+#ifdef CONFIG_PERF_EVENTS
+#define __KVM_WANT_PERF_CALLBACKS
+static inline bool kvm_arch_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   /* Any callback while a vCPU is loaded is considered to be in guest. */
+   return !!vcpu;
+}
+#else
+static inline void kvm_register_perf_callbacks(void) {}
+static inline void kvm_unregister_perf_callbacks(void) {}
+#endif
+
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
 void kvm_update_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..2b542fdc237e 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -500,6 +500,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu_mode_priv(vcpu);
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return *vcpu_pc(vcpu);
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 893de1a51fea..0b902e0d5b5d 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,45 +13,15 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!vcpu)
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (!vcpu_mode_priv(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return *vcpu_pc(vcpu);
-}
-
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_get_guest_ip,
-};
-
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
 
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(NULL);
 }
 
 void kvm_perf_teardown(void)
 {
-   perf_unregister_guest_info_callbacks();
+   kvm_unregister_perf_callbacks();
 }
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 10/13] KVM: Move x86's perf guest info callbacks to generic KVM

2021-08-27 Thread Sean Christopherson

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/x86.c  | 53 +++--
 include/linux/kvm_host.h| 12 
 virt/kvm/kvm_main.c | 40 +
 4 files changed, 67 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2d86a2dfc775..a98c7907110c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1543,6 +1543,10 @@ static inline int kvm_arch_flush_remote_tlb(struct kvm 
*kvm)
return -ENOTSUPP;
 }
 
+#define __KVM_WANT_PERF_CALLBACKS
+#define kvm_arch_pmi_in_guest(vcpu) \
+   ((vcpu) && (vcpu)->arch.handling_intr_from_guest)
+
 int kvm_mmu_module_init(void);
 void kvm_mmu_module_exit(void);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1427ac1fc1f2..1bea616402e6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,43 +8264,12 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
-{
-   return vcpu && vcpu->arch.handling_intr_from_guest;
-}
-
-static unsigned int kvm_guest_state(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-   unsigned int state;
-
-   if (!kvm_pmi_in_guest(vcpu))
-   return 0;
-
-   state = PERF_GUEST_ACTIVE;
-   if (static_call(kvm_x86_get_cpl)(vcpu))
-   state |= PERF_GUEST_USER;
-
-   return state;
-}
-
-static unsigned long kvm_guest_get_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
-   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
-   return 0;
-
-   return kvm_rip_read(vcpu);
-}
-
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!kvm_pmi_in_guest(vcpu))
+   if (!kvm_arch_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
@@ -8309,12 +8278,6 @@ static unsigned int kvm_handle_intel_pt_intr(void)
return 1;
 }
 
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .state  = kvm_guest_state,
-   .get_ip = kvm_guest_get_ip,
-   .handle_intel_pt_intr   = NULL,
-};
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11068,9 +11031,11 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   /* Temporary ugliness. */
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
-   perf_register_guest_info_callbacks(_guest_cbs);
+   kvm_register_perf_callbacks(kvm_handle_intel_pt_intr);
+   else
+   kvm_register_perf_callbacks(NULL);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
@@ -11099,8 +11064,7 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   perf_unregister_guest_info_callbacks();
-   kvm_guest_cbs.handle_intel_pt_intr = NULL;
+   kvm_unregister_perf_callbacks();
 
static_call(kvm_x86_hardware_unsetup)();
 }
@@ -11727,6 +11691,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu->arch.preempted_in_kernel;
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return kvm_rip_read(vcpu);
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e4d712e9f760..34d99034852f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1163,6 +1163,18 @@ static inline bool kvm_arch_intc_initialized(struct kvm 
*kvm)
 }
 #endif
 
+#ifdef __KVM_WANT_PERF_CALLBACKS
+
+void kvm_set_intel_pt_intr_handler(unsigned int (*handler)(void));
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
+
+void kvm_register_perf_callbacks(unsigned int (*pt_intr_handler)(void));
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   perf_unregister_guest_info_callbacks();
+}
+#endif
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
 void kvm_arch_destroy_vm(struct kvm *kvm);
 void kvm_arch_sync_events(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/k

[PATCH v2 09/13] KVM: x86: More precisely identify NMI from guest when handling PMI

2021-08-27 Thread Sean Christopherson

Differntiate between IRQ and NMI for KVM's PMC overflow callback, which
was originally invoked in response to an NMI that arrived while the guest
was running, but was inadvertantly changed to fire on IRQs as well when
support for perf without PMU/NMI was added to KVM.  In practice, this
should be a nop as the PMC overflow callback shouldn't be reached, but
it's a cheap and easy fix that also better documents the situation.

Note, this also doesn't completely prevent false positives if perf
somehow ends up calling into KVM, e.g. an NMI can arrive in host after
KVM sets its flag.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c |  4 +++-
 arch/x86/kvm/x86.c |  2 +-
 arch/x86/kvm/x86.h | 13 ++---
 4 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1a70e11f0487..0a0c01744b63 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3843,7 +3843,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu 
*vcpu)
}
 
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_NMI);
 
kvm_load_host_xsave_state(vcpu);
stgi();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f19d72136f77..61a4f5ff2acd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6344,7 +6344,9 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
unsigned long entry)
 {
-   kvm_before_interrupt(vcpu);
+   bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+   kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : 
KVM_HANDLING_IRQ);
vmx_do_interrupt_nmi_irqoff(entry);
kvm_after_interrupt(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6df300c55461..1427ac1fc1f2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9676,7 +9676,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * interrupts on processors that implement an interrupt shadow, the
 * stat.exits increment will do nicely.
 */
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, KVM_HANDLING_IRQ);
local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index a9c107e7c907..9b26f9b09d2a 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,9 +387,16 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
+enum kvm_intr_type {
+   /* Values are arbitrary, but must be non-zero. */
+   KVM_HANDLING_IRQ = 1,
+   KVM_HANDLING_NMI,
+};
+
+static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu,
+   enum kvm_intr_type intr)
 {
-   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, (u8)intr);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
@@ -399,7 +406,7 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
 
 static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
 {
-   return !!vcpu->arch.handling_intr_from_guest;
+   return vcpu->arch.handling_intr_from_guest == KVM_HANDLING_NMI;
 }
 
 static inline bool kvm_pat_valid(u64 data)
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 08/13] KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu variable

2021-08-27 Thread Sean Christopherson

Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest'
variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu.
kvm_before/after_interrupt() must be called while the vCPU is loaded,
(which protects against preemption), thus kvm_running_vcpu is guaranteed
to be non-NULL when handling_intr_from_guest is non-zero.

Switching to kvm_get_running_vcpu() will allows moving KVM's perf
callbacks to generic code, and the new flag will be used in a future
patch to more precisely identify the "NMI from guest" case.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  3 +--
 arch/x86/kvm/pmu.c  |  2 +-
 arch/x86/kvm/x86.c  | 21 -
 arch/x86/kvm/x86.h  | 10 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1080166fc0cf..2d86a2dfc775 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,6 +763,7 @@ struct kvm_vcpu_arch {
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
bool smi_pending;/* SMI queued after currently running handler */
+   u8 handling_intr_from_guest;
 
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -1874,8 +1875,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
-unsigned int kvm_guest_state(void);
-
 void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 5b68d4188de0..eef48258e50f 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * woken up. So we should wake it, but this is impossible from
 * NMI context. Do it from irq work instead.
 */
-   if (!kvm_guest_state())
+   if (!kvm_handling_nmi_from_guest(pmc->vcpu))
irq_work_queue(_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2a4d085aa4f..6df300c55461 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,15 +8264,17 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
+static inline bool kvm_pmi_in_guest(struct kvm_vcpu *vcpu)
+{
+   return vcpu && vcpu->arch.handling_intr_from_guest;
+}
 
-unsigned int kvm_guest_state(void)
+static unsigned int kvm_guest_state(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
unsigned int state;
 
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
state = PERF_GUEST_ACTIVE;
@@ -8284,9 +8286,10 @@ unsigned int kvm_guest_state(void)
 
 static unsigned long kvm_guest_get_ip(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   if (WARN_ON_ONCE(!vcpu))
+   /* Retrieving the IP must be guarded by a call to kvm_guest_state(). */
+   if (WARN_ON_ONCE(!kvm_pmi_in_guest(vcpu)))
return 0;
 
return kvm_rip_read(vcpu);
@@ -8294,10 +8297,10 @@ static unsigned long kvm_guest_get_ip(void)
 
 static unsigned int kvm_handle_intel_pt_intr(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
/* '0' on failure so that the !PT case can use a RET0 static call. */
-   if (!vcpu)
+   if (!kvm_pmi_in_guest(vcpu))
return 0;
 
kvm_make_request(KVM_REQ_PMI, vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 7d66d63dc55a..a9c107e7c907 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,18 +387,20 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, vcpu);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 1);
 }
 
 static inline void kvm_after_interrupt(struct kvm_vcpu *vcpu)
 {
-   __this_cpu_write(current_vcpu, NULL);
+   WRITE_ONCE(vcpu->arch.handling_intr_from_guest, 0);
 }
 
+static inline bool kvm_handling_nmi_from_guest(struct kvm_vcpu *vcpu)
+{
+   retur

[PATCH v2 07/13] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-08-27 Thread Sean Christopherson

From: Like Xu 

Use static_call to optimize perf's guest callbacks on arm64 and x86,
which are now the only architectures that define the callbacks.  Use
DEFINE_STATIC_CALL_RET0 as the default/NULL for all guest callbacks, as
the callback semantics are that a return value '0' means "not in guest".

static_call obviously avoids the overhead of CONFIG_RETPOLINE=y, but is
also advantageous versus other solutions, e.g. per-cpu callbacks, in that
a per-cpu memory load is not needed to detect the !guest case.

Suggested-by: Peter Zijlstra (Intel) 
Originally-by: Peter Zijlstra (Intel) 
Signed-off-by: Like Xu 
Signed-off-by: Zhu Lingshan 
[sean: split out patch, drop __weak, tweak updaters, rewrite changelog]
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 31 +++-
 arch/x86/events/core.c | 38 ++
 arch/x86/events/intel/core.c   |  7 +++---
 include/linux/perf_event.h |  9 +--
 kernel/events/core.c   |  2 ++
 5 files changed, 54 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 274dc3e11b6d..18cf6e608778 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2015 ARM Limited
  */
 #include 
+#include 
 #include 
 
 #include 
@@ -99,12 +100,24 @@ compat_user_backtrace(struct compat_frame_tail __user 
*tail,
 }
 #endif /* CONFIG_COMPAT */
 
+DEFINE_STATIC_CALL_RET0(arm64_guest_state, *(perf_guest_cbs->state));
+DEFINE_STATIC_CALL_RET0(arm64_guest_get_ip, *(perf_guest_cbs->get_ip));
+
+void arch_perf_update_guest_cbs(struct perf_guest_info_callbacks *guest_cbs)
+{
+   if (guest_cbs) {
+   static_call_update(arm64_guest_state, guest_cbs->state);
+   static_call_update(arm64_guest_get_ip, guest_cbs->get_ip);
+   } else {
+   static_call_update(arm64_guest_state, (void 
*)&__static_call_return0);
+   static_call_update(arm64_guest_get_ip, (void 
*)&__static_call_return0);
+   }
+}
+
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state()) {
+   if (static_call(arm64_guest_state)()) {
/* We don't support guest os callchain now */
return;
}
@@ -149,10 +162,9 @@ static bool callchain_trace(void *data, unsigned long pc)
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->state()) {
+   if (static_call(arm64_guest_state)()) {
/* We don't support guest os callchain now */
return;
}
@@ -163,18 +175,15 @@ void perf_callchain_kernel(struct 
perf_callchain_entry_ctx *entry,
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->state())
-   return guest_cbs->get_ip();
+   if (static_call(arm64_guest_state)())
+   return static_call(arm64_guest_get_ip)();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
+   unsigned int guest_state = static_call(arm64_guest_state)();
int misc = 0;
 
if (guest_state) {
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 3a7630fdd340..508a677edd8c 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -90,6 +90,29 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, 
*x86_pmu.pebs_aliases);
  */
 DEFINE_STATIC_CALL_RET0(x86_pmu_guest_get_msrs, *x86_pmu.guest_get_msrs);
 
+DEFINE_STATIC_CALL_RET0(x86_guest_state, *(perf_guest_cbs->state));
+DEFINE_STATIC_CALL_RET0(x86_guest_get_ip, *(perf_guest_cbs->get_ip));
+DEFINE_STATIC_CALL_RET0(x86_guest_handle_intel_pt_intr, 
*(perf_guest_cbs->handle_intel_pt_intr));
+
+void arch_perf_update_guest_cbs(struct perf_guest_info_callbacks *guest_cbs)
+{
+   if (guest_cbs) {
+   static_call_update(x86_guest_state, guest_cbs->state);
+   static_call_update(x86_guest_get_ip, guest_cbs->get_ip);
+   } else {
+   static_call_update(x86_guest_state, (void 
*)&__static_call_return0);
+   static_call_update(x86_guest_get_ip, (void 
*)&__static_call_return0);
+   }
+
+   /* Implementing ->handle_intel_pt_intr is optional.

[PATCH v2 06/13] perf/core: Rework guest callbacks to prepare for static_call support

2021-08-27 Thread Sean Christopherson

From: Like Xu 

To prepare for using static_calls to optimize perf's guest callbacks,
replace ->is_in_guest and ->is_user_mode with a new multiplexed hook
->state, tweak ->handle_intel_pt_intr to play nice with being called when
there is no active guest, and drop "guest" from ->is_in_guest.

Return '0' from ->state and ->handle_intel_pt_intr to indicate "not in
guest" so that DEFINE_STATIC_CALL_RET0 can be used to define the static
calls, i.e. no callback == !guest.

Suggested-by: Peter Zijlstra (Intel) 
Originally-by: Peter Zijlstra (Intel) 
Signed-off-by: Like Xu 
Signed-off-by: Zhu Lingshan 
[sean: extracted from static_call patch, fixed get_ip() bug, wrote changelog]
Signed-off-by: Sean Christopherson 
---
 arch/arm64/kernel/perf_callchain.c | 13 +-
 arch/arm64/kvm/perf.c  | 35 +++---
 arch/x86/events/core.c | 13 +-
 arch/x86/events/intel/core.c   |  5 +---
 arch/x86/include/asm/kvm_host.h|  2 +-
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/x86.c | 40 --
 arch/x86/xen/pmu.c | 32 ++--
 include/linux/perf_event.h | 10 +---
 kernel/events/core.c   |  1 +
 10 files changed, 74 insertions(+), 79 deletions(-)

diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 86d9f2013172..274dc3e11b6d 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -104,7 +104,7 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -152,7 +152,7 @@ void perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry,
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe frame;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->state()) {
/* We don't support guest os callchain now */
return;
}
@@ -165,8 +165,8 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
+   if (guest_cbs && guest_cbs->state())
+   return guest_cbs->get_ip();
 
return instruction_pointer(regs);
 }
@@ -174,10 +174,11 @@ unsigned long perf_instruction_pointer(struct pt_regs 
*regs)
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+   unsigned int guest_state = guest_cbs ? guest_cbs->state() : 0;
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
+   if (guest_state) {
+   if (guest_state & PERF_GUEST_USER)
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 039fe59399a2..893de1a51fea 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,39 +13,34 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-static int kvm_is_in_guest(void)
+static unsigned int kvm_guest_state(void)
 {
-return kvm_get_running_vcpu() != NULL;
-}
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+   unsigned int state;
 
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu;
-
-   vcpu = kvm_get_running_vcpu();
+   if (!vcpu)
+   return 0;
 
-   if (vcpu)
-   return !vcpu_mode_priv(vcpu);
+   state = PERF_GUEST_ACTIVE;
+   if (!vcpu_mode_priv(vcpu))
+   state |= PERF_GUEST_USER;
 
-   return 0;
+   return state;
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return *vcpu_pc(vcpu);
-
-   return 0;
+   return *vcpu_pc(vcpu);
 }
 
 static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
+   .state  = kvm_guest_state,
+   .get_ip = kvm_get_guest_ip,
 };
 
 void kvm_perf_init(void)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index f

[PATCH v2 05/13] perf: Force architectures to opt-in to guest callbacks

2021-08-27 Thread Sean Christopherson

Introduce HAVE_GUEST_PERF_EVENTS and require architectures to select it
to allow registering guest callbacks in perf.  Future patches will convert
the callbacks to static_call.  Rather than churn a bunch of arch code (that
was presumably copy+pasted from x86), remove it wholesale as it's useless
and at best wasting cycles.

Wrap even the stubs with an #ifdef to avoid an arch sneaking in a bogus
registration with CONFIG_PERF_EVENTS=n.

Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 33 -
 arch/arm64/Kconfig |  1 +
 arch/csky/kernel/perf_callchain.c  | 12 ---
 arch/nds32/kernel/perf_event_cpu.c | 34 --
 arch/riscv/kernel/perf_callchain.c | 13 
 arch/x86/Kconfig   |  1 +
 include/linux/perf_event.h |  4 
 init/Kconfig   |  3 +++
 kernel/events/core.c   |  2 ++
 9 files changed, 19 insertions(+), 84 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 1626dfc6f6ce..bc6b246ab55e 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,14 +62,8 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
perf_callchain_store(entry, regs->ARM_pc);
 
if (!current->mm)
@@ -99,44 +93,25 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   /* We don't support guest os callchain now */
-   return;
-   }
-
arm_get_current_stackframe(regs, );
walk_stackframe(, callchain_trace, entry);
 }
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
-
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return guest_cbs->get_guest_ip();
-
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   if (guest_cbs->is_user_mode())
-   misc |= PERF_RECORD_MISC_GUEST_USER;
-   else
-   misc |= PERF_RECORD_MISC_GUEST_KERNEL;
-   } else {
-   if (user_mode(regs))
-   misc |= PERF_RECORD_MISC_USER;
-   else
-   misc |= PERF_RECORD_MISC_KERNEL;
-   }
+   if (user_mode(regs))
+   misc |= PERF_RECORD_MISC_USER;
+   else
+   misc |= PERF_RECORD_MISC_KERNEL;
 
return misc;
 }
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index b5b13a932561..72a201a686c5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -190,6 +190,7 @@ config ARM64
select HAVE_NMI
select HAVE_PATA_PLATFORM
select HAVE_PERF_EVENTS
+   select HAVE_GUEST_PERF_EVENTS
select HAVE_PERF_REGS
select HAVE_PERF_USER_STACK_DUMP
select HAVE_REGS_AND_STACK_ACCESS_API
diff --git a/arch/csky/kernel/perf_callchain.c 
b/arch/csky/kernel/perf_callchain.c
index 35318a635a5f..92057de08f4f 100644
--- a/arch/csky/kernel/perf_callchain.c
+++ b/arch/csky/kernel/perf_callchain.c
@@ -86,13 +86,8 @@ static unsigned long user_backtrace(struct 
perf_callchain_entry_ctx *entry,
 void perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
unsigned long fp = 0;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest())
-   return;
-
fp = regs->regs[4];
perf_callchain_store(entry, regs->pc);
 
@@ -111,15 +106,8 @@ void perf_callchain_user(struct perf_callchain_entry_ctx 
*entry,
 void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry,
   struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   /* C-SKY does not support virtualization. */
-   if (guest_cbs && guest_cbs->is_in_guest()) {
-   pr_warn("C-SKY does not suppor

[PATCH v2 04/13] perf: Stop pretending that perf can handle multiple guest callbacks

2021-08-27 Thread Sean Christopherson

Drop the 'int' return value from the perf (un)register callbacks helpers
and stop pretending perf can support multiple callbacks.  The 'int'
returns are not future proofing anything as none of the callers take
action on an error.  It's also not obvious that there will ever be
co-tenant hypervisors, and if there are, that allowing multiple callbacks
to be registered is desirable or even correct.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  4 ++--
 arch/arm64/kvm/perf.c |  8 
 arch/x86/kvm/x86.c|  2 +-
 include/linux/perf_event.h| 11 +--
 kernel/events/core.c  | 14 +++---
 5 files changed, 15 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 41911585ae0c..ed940aec89e0 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,8 +670,8 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-int kvm_perf_init(void);
-int kvm_perf_teardown(void);
+void kvm_perf_init(void);
+void kvm_perf_teardown(void);
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 151c31fb9860..039fe59399a2 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -48,15 +48,15 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
-int kvm_perf_init(void)
+void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
 
-   return perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks(_guest_cbs);
 }
 
-int kvm_perf_teardown(void)
+void kvm_perf_teardown(void)
 {
-   return perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks();
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ffc6c2d73508..bae951344e28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11092,7 +11092,7 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   perf_unregister_guest_info_callbacks(_guest_cbs);
+   perf_unregister_guest_info_callbacks();
kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 6b0405e578c1..2b77e2154b61 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1245,8 +1245,8 @@ static inline struct perf_guest_info_callbacks 
*perf_get_guest_cbs(void)
/* Prevent reloading between a !NULL check and dereferences. */
return READ_ONCE(perf_guest_cbs);
 }
-extern int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*callbacks);
-extern int perf_unregister_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_unregister_guest_info_callbacks(void);
 
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
@@ -1489,10 +1489,9 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs 
*regs, u64 addr)  { }
 static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
-static inline int perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
-static inline int perf_unregister_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { 
return 0; }
+static inline void perf_register_guest_info_callbacks
+(struct perf_guest_info_callbacks *callbacks)  { }
+static inline void perf_unregister_guest_info_callbacks(void)  { }
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2126f6327321..1be95d9ace46 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6482,29 +6482,21 @@ static void perf_pending_event(struct irq_work *entry)
perf_swevent_put_recursion_context(rctx);
 }
 
-/*
- * We assume there is only KVM supporting the callbacks.
- * Later on, we might change it to a list if there is
- * another virtualization implementation supporting the callbacks.
- */
 struct perf_guest_info_callbacks *perf_guest_cbs;
-
-int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
+void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
if (WARN_ON_ONCE(perf_g

[PATCH v2 03/13] KVM: x86: Register Processor Trace interrupt hook iff PT enabled in guest

2021-08-27 Thread Sean Christopherson

Override the Processor Trace (PT) interrupt handler for guest mode if and
only if PT is configured for host+guest mode, i.e. is being used
independently by both host and guest.  If PT is configured for system
mode, the host fully controls PT and must handle all events.

Fixes: 8479e04e7d6b ("KVM: x86: Inject PMI for KVM guest")
Cc: sta...@vger.kernel.org
Cc: Like Xu 
Reported-by: Alexander Shishkin 
Reported-by: Artem Kashkanov 
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/vmx.c  | 1 +
 arch/x86/kvm/x86.c  | 5 -
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09b256db394a..1ea4943a73d7 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1494,6 +1494,7 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
+   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fada1055f325..f19d72136f77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7896,6 +7896,7 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
+   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb6015f97f9e..ffc6c2d73508 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8305,7 +8305,7 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.is_in_guest= kvm_is_in_guest,
.is_user_mode   = kvm_is_user_mode,
.get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = kvm_handle_intel_pt_intr,
+   .handle_intel_pt_intr   = NULL,
 };
 
 #ifdef CONFIG_X86_64
@@ -11061,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
+   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
perf_register_guest_info_callbacks(_guest_cbs);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
@@ -11091,6 +11093,7 @@ int kvm_arch_hardware_setup(void *opaque)
 void kvm_arch_hardware_unsetup(void)
 {
perf_unregister_guest_info_callbacks(_guest_cbs);
+   kvm_guest_cbs.handle_intel_pt_intr = NULL;
 
static_call(kvm_x86_hardware_unsetup)();
 }
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 02/13] KVM: x86: Register perf callbacks after calling vendor's hardware_setup()

2021-08-27 Thread Sean Christopherson

Wait to register perf callbacks until after doing vendor hardaware setup.
VMX's hardware_setup() configures Intel Processor Trace (PT) mode, and a
future fix to register the Intel PT guest interrupt hook if and only if
Intel PT is exposed to the guest will consume the configured PT mode.

Delaying registration to hardware setup is effectively a nop as KVM's perf
hooks all pivot on the per-CPU current_vcpu, which is non-NULL only when
KVM is handling an IRQ/NMI in a VM-Exit path.  I.e. current_vcpu will be
NULL throughout both kvm_arch_init() and kvm_arch_hardware_setup().

Cc: Alexander Shishkin 
Cc: Artem Kashkanov 
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..fb6015f97f9e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8426,8 +8426,6 @@ int kvm_arch_init(void *opaque)
 
kvm_timer_init();
 
-   perf_register_guest_info_callbacks(_guest_cbs);
-
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
@@ -8461,7 +8459,6 @@ void kvm_arch_exit(void)
clear_hv_tscchange_cb();
 #endif
kvm_lapic_exit();
-   perf_unregister_guest_info_callbacks(_guest_cbs);
 
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
cpufreq_unregister_notifier(_cpufreq_notifier_block,
@@ -11064,6 +11061,8 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
+   perf_register_guest_info_callbacks(_guest_cbs);
+
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11091,6 +11090,8 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
+   perf_unregister_guest_info_callbacks(_guest_cbs);
+
static_call(kvm_x86_hardware_unsetup)();
 }
 
-- 
2.33.0.259.gc128427fd7-goog

[PATCH v2 01/13] perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and deref

2021-08-27 Thread Sean Christopherson

Protect perf_guest_cbs with READ_ONCE/WRITE_ONCE to ensure it's not
reloaded between a !NULL check and a dereference, and wait for all
readers via syncrhonize_rcu() to prevent use-after-free, e.g. if the
callbacks are being unregistered during module unload.  Because the
callbacks are global, it's possible for readers to run in parallel with
an unregister operation.

The bug has escaped notice because all dereferences of perf_guest_cbs
follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern,
and it's extremely unlikely a compiler will reload perf_guest_cbs in this
sequence.  Compilers do reload perf_guest_cbs for future derefs, e.g. for
->is_user_mode(), but the ->is_in_guest() guard all but guarantees the
PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to
completely exit the guest and teardown down all VMs before KVM start its
module unload / unregister sequence.

But with help, unloading kvm_intel can trigger a NULL pointer derference,
e.g. wrapping perf_guest_cbs with READ_ONCE in perf_misc_flags() while
spamming kvm_intel module load/unload leads to:

  BUG: kernel NULL pointer dereference, address: 
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x) - not-present page
  PGD 0 P4D 0
  Oops:  [#1] PREEMPT SMP
  CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:perf_misc_flags+0x1c/0x70
  Call Trace:
   perf_prepare_sample+0x53/0x6b0
   perf_event_output_forward+0x67/0x160
   __perf_event_overflow+0x52/0xf0
   handle_pmi_common+0x207/0x300
   intel_pmu_handle_irq+0xcf/0x410
   perf_event_nmi_handler+0x28/0x50
   nmi_handle+0xc7/0x260
   default_do_nmi+0x6b/0x170
   exc_nmi+0x103/0x130
   asm_exc_nmi+0x76/0xbf

Fixes: 39447b386c84 ("perf: Enhance perf to allow for guest statistic 
collection from host")
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 arch/arm/kernel/perf_callchain.c   | 17 +++--
 arch/arm64/kernel/perf_callchain.c | 18 --
 arch/csky/kernel/perf_callchain.c  |  6 --
 arch/nds32/kernel/perf_event_cpu.c | 17 +++--
 arch/riscv/kernel/perf_callchain.c |  7 +--
 arch/x86/events/core.c | 17 +++--
 arch/x86/events/intel/core.c   |  9 ++---
 include/linux/perf_event.h |  8 
 kernel/events/core.c   |  9 +++--
 9 files changed, 75 insertions(+), 33 deletions(-)

diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c
index 3b69a76d341e..1626dfc6f6ce 100644
--- a/arch/arm/kernel/perf_callchain.c
+++ b/arch/arm/kernel/perf_callchain.c
@@ -62,9 +62,10 @@ user_backtrace(struct frame_tail __user *tail,
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct frame_tail __user *tail;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -98,9 +99,10 @@ callchain_trace(struct stackframe *fr,
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stackframe fr;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
+   if (guest_cbs && guest_cbs->is_in_guest()) {
/* We don't support guest os callchain now */
return;
}
@@ -111,18 +113,21 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
*entry, struct pt_regs *re
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
-   return perf_guest_cbs->get_guest_ip();
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
+
+   if (guest_cbs && guest_cbs->is_in_guest())
+   return guest_cbs->get_guest_ip();
 
return instruction_pointer(regs);
 }
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) {
-   if (perf_guest_cbs->is_user_mode())
+   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
misc |= PERF_RECORD_MISC_GUEST_KERNEL;
diff --git a/arch/arm64/kernel/perf_callchain.c 
b/arch/arm64/kernel/perf_callchain.c
index 4a72c2727309..86d9f2013172 100644
--- a/arch/arm64/kernel/

[PATCH v2 00/13] perf: KVM: Fix, optimize, and clean up callbacks

2021-08-27 Thread Sean Christopherson

This is a combination of ~2 series to fix bugs in the perf+KVM callbacks,
optimize the callbacks by employing static_call, and do a variety of
cleanup in both perf and KVM.

Patch 1 fixes a mostly-theoretical bug where perf can deref a NULL
pointer if KVM unregisters its callbacks while they're being accessed.
In practice, compilers tend to avoid problematic reloads of the pointer
and the PMI handler doesn't lose the race against module unloading,
i.e doesn't hit a use-after-free.

Patches 2 and 3 fix an Intel PT handling bug where KVM incorrectly
eats PT interrupts when PT is supposed to be owned entirely by the host.

Patches 4-7 clean up perf's callback infrastructure and switch to
static_call for arm64 and x86 (the only survivors).

Patches 8-13 clean up related KVM code and unify the arm64/x86 callbacks.

Based on "git://git.kernel.org/pub/scm/virt/kvm/kvm.git queue", commit
680c7e3be6a3 ("KVM: x86: Exit to userspace ...").

v2 (relatively to static_call v10)
  - Split the patch into the semantic change (multiplexed ->state) and
introduction of static_call.
  - Don't use '0' for "not a guest RIP".
  - Handle unregister path.
  - Drop changes for architectures that can be culled entirely.

v2 (relative to v1)
  - Drop per-cpu approach. [Peter]
  - Fix mostly-theoretical reload and use-after-free with READ_ONCE(),
WRITE_ONCE(), and synchronize_rcu(). [Peter]
  - Avoid new exports like the plague. [Peter]

v1:
  - https://lkml.kernel.org/r/20210827005718.585190-1-sea...@google.com

v10 static_call:
  - https://lkml.kernel.org/r/20210806133802.3528-2-lingshan@intel.com

Like Xu (2):
  perf/core: Rework guest callbacks to prepare for static_call support
  perf/core: Use static_call to optimize perf_guest_info_callbacks

Sean Christopherson (11):
  perf: Ensure perf_guest_cbs aren't reloaded between !NULL check and
deref
  KVM: x86: Register perf callbacks after calling vendor's
hardware_setup()
  KVM: x86: Register Processor Trace interrupt hook iff PT enabled in
guest
  perf: Stop pretending that perf can handle multiple guest callbacks
  perf: Force architectures to opt-in to guest callbacks
  KVM: x86: Drop current_vcpu for kvm_running_vcpu + kvm_arch_vcpu
variable
  KVM: x86: More precisely identify NMI from guest when handling PMI
  KVM: Move x86's perf guest info callbacks to generic KVM
  KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c
  KVM: arm64: Convert to the generic perf callbacks
  KVM: arm64: Drop perf.c and fold its tiny bits of code into arm.c /
pmu.c

 arch/arm/kernel/perf_callchain.c   | 28 ++
 arch/arm64/Kconfig |  1 +
 arch/arm64/include/asm/kvm_host.h  | 13 ++-
 arch/arm64/kernel/perf_callchain.c | 28 +++---
 arch/arm64/kvm/Makefile|  2 +-
 arch/arm64/kvm/arm.c   | 11 +-
 arch/arm64/kvm/perf.c  | 62 --
 arch/arm64/kvm/pmu.c   |  8 
 arch/csky/kernel/perf_callchain.c  | 10 -
 arch/nds32/kernel/perf_event_cpu.c | 29 ++
 arch/riscv/kernel/perf_callchain.c | 10 -
 arch/x86/Kconfig   |  1 +
 arch/x86/events/core.c | 36 ++---
 arch/x86/events/intel/core.c   |  7 ++--
 arch/x86/include/asm/kvm_host.h|  8 +++-
 arch/x86/kvm/pmu.c |  2 +-
 arch/x86/kvm/svm/svm.c |  2 +-
 arch/x86/kvm/vmx/vmx.c | 25 +++-
 arch/x86/kvm/x86.c | 58 +---
 arch/x86/kvm/x86.h | 17 ++--
 arch/x86/xen/pmu.c | 32 +++
 include/kvm/arm_pmu.h  |  1 +
 include/linux/kvm_host.h   | 10 +
 include/linux/perf_event.h | 26 -
 init/Kconfig   |  3 ++
 kernel/events/core.c   | 24 ++--
 virt/kvm/kvm_main.c| 40 +++
 27 files changed, 245 insertions(+), 249 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

-- 
2.33.0.259.gc128427fd7-goog

Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks

2021-08-27 Thread Sean Christopherson

On Fri, Aug 06, 2021, Zhu Lingshan wrote:
> @@ -2944,18 +2966,21 @@ static unsigned long code_segment_base(struct pt_regs 
> *regs)
>  
>  unsigned long perf_instruction_pointer(struct pt_regs *regs)
>  {
> - if (perf_guest_cbs && perf_guest_cbs->is_in_guest())
> - return perf_guest_cbs->get_guest_ip();
> + unsigned long ip = static_call(x86_guest_get_ip)();
> +
> + if (likely(!ip))

Pivoting on ip==0 isn't correct, it's perfectly legal for a guest to execute
from %rip=0.  Unless there's some static_call() magic that supports this with a
default function:

if (unlikely(!static_call(x86_guest_get_ip)()))
regs->ip + code_segment_base(regs)

return ip;

The easiest thing is keep the existing:

if (unlikely(static_call(x86_guest_state)()))
return static_call(x86_guest_get_ip)();

return regs->ip + code_segment_base(regs);

It's an extra call for PMIs in guest, but I don't think any of the KVM folks 
care
_that_ much about the performance in this case.

> + ip = regs->ip + code_segment_base(regs);
>  
> - return regs->ip + code_segment_base(regs);
> + return ip;
>  }

Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Sean Christopherson

On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 02:49:50PM +0000, Sean Christopherson wrote:
> > On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> > > On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> > > > Use a per-CPU pointer to track perf's guest callbacks so that KVM can 
> > > > set
> > > > the callbacks more precisely and avoid a lurking NULL pointer 
> > > > dereference.
> > > 
> > > I'm completely failing to see how per-cpu helps anything here...
> > 
> > It doesn't help until KVM is converted to set the per-cpu pointer in flows 
> > that
> > are protected against preemption, and more specifically when KVM only 
> > writes to
> > the pointer from the owning CPU.  
> 
> So the 'problem' I have with this is that sane (!KVM using) people, will
> still have to suffer that load, whereas with the static_call() we patch
> in an 'xor %rax,%rax' and only have immediate code flow.

Again, I've no objection to the static_call() approach.  I didn't even see the
patch until I had finished testing my series :-/

> > Ignoring static call for the moment, I don't see how the unreg side can be 
> > safe
> > using a bare single global pointer.  There is no way for KVM to prevent an 
> > NMI
> > from running in parallel on a different CPU.  If there's a more elegant 
> > solution,
> > especially something that can be backported, e.g. an rcu-protected pointer, 
> > I'm
> > all for it.  I went down the per-cpu path because it allowed for cleanups 
> > in KVM,
> > but similar cleanups can be done without per-cpu perf callbacks.
> 
> If all the perf_guest_cbs dereferences are with preemption disabled
> (IRQs disabled, IRQ context, NMI context included), then the sequence:
> 
>   WRITE_ONCE(perf_guest_cbs, NULL);
>   synchronize_rcu();
> 
> Ensures that all prior observers of perf_guest_csb will have completed
> and future observes must observe the NULL value.

That alone won't be sufficient, as the read side also needs to ensure it doesn't
reload perf_guest_cbs between NULL checks and dereferences.  But that's easy
enough to solve with a READ_ONCE and maybe a helper to make it more cumbersome
to use perf_guest_cbs directly.

How about this for a series?

  1. Use READ_ONCE/WRITE_ONCE + synchronize_rcu() to fix the underlying bug
  2. Fix KVM PT interrupt handler bug
  3. Kill off perf_guest_cbs usage in architectures that don't need the 
callbacks
  4. Replace ->is_in_guest()/->is_user_mode() with ->state(), and 
s/get_guest_ip/get_ip
  5. Implement static_call() support
  6. Cleanups, if there are any
  6..N KVM cleanups, e.g. to eliminate current_vcpu and share x86+arm64 
callbacks

Re: [PATCH 07/15] KVM: Use dedicated flag to track if KVM is handling an NMI from guest

2021-08-27 Thread Sean Christopherson

On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> On Thu, Aug 26, 2021 at 05:57:10PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 5cedc0e8a5d5..4c5ba4128b38 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -395,9 +395,10 @@ static inline void kvm_unregister_perf_callbacks(void)
> >  
> >  DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
> >  
> > -static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
> > +static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
> >  {
> > __this_cpu_write(current_vcpu, vcpu);
> > +   WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
> >  
> > kvm_register_perf_callbacks();
> >  }
> > @@ -406,6 +407,7 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
> > *vcpu)
> >  {
> > kvm_unregister_perf_callbacks();
> >  
> > +   WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, false);
> > __this_cpu_write(current_vcpu, NULL);
> >  }
> 
> Does this rely on kvm_{,un}register_perf_callback() being a function
> call and thus implying a sequence point to order the stores? 

No, I'm just terrible at remembering which macros provide what ordering 
guarantees,
i.e. I was thinking WRITE_ONCE provided guarantees against compiler reordering.

Re: [PATCH 05/15] perf: Track guest callbacks on a per-CPU basis

2021-08-27 Thread Sean Christopherson

On Fri, Aug 27, 2021, Peter Zijlstra wrote:
> On Thu, Aug 26, 2021 at 05:57:08PM -0700, Sean Christopherson wrote:
> > Use a per-CPU pointer to track perf's guest callbacks so that KVM can set
> > the callbacks more precisely and avoid a lurking NULL pointer dereference.
> 
> I'm completely failing to see how per-cpu helps anything here...

It doesn't help until KVM is converted to set the per-cpu pointer in flows that
are protected against preemption, and more specifically when KVM only writes to
the pointer from the owning CPU.  

> > On x86, KVM supports being built as a module and thus can be unloaded.
> > And because the shared callbacks are referenced from IRQ/NMI context,
> > unloading KVM can run concurrently with perf, and thus all of perf's
> > checks for a NULL perf_guest_cbs are flawed as perf_guest_cbs could be
> > nullified between the check and dereference.
> 
> No longer allowing KVM to be a module would be *AWESOME*. I detest how
> much we have to export for KVM :/
> 
> Still, what stops KVM from doing a coherent unreg? Even the
> static_call() proposed in the other patch, unreg can do
> static_call_update() + synchronize_rcu() to ensure everybody sees the
> updated pointer (would require a quick audit to see all users are with
> preempt disabled, but I think your using per-cpu here already imposes
> the same).

Ignoring static call for the moment, I don't see how the unreg side can be safe
using a bare single global pointer.  There is no way for KVM to prevent an NMI
from running in parallel on a different CPU.  If there's a more elegant 
solution,
especially something that can be backported, e.g. an rcu-protected pointer, I'm
all for it.  I went down the per-cpu path because it allowed for cleanups in 
KVM,
but similar cleanups can be done without per-cpu perf callbacks.

As for static calls, I certainly have no objection to employing static calls for
the callbacks, but IMO we should not be relying on static call for correctness,
i.e. the existing bug needs to be fixed first.

[PATCH 14/15] perf: Disallow bulk unregistering of guest callbacks and do cleanup

2021-08-26 Thread Sean Christopherson

Drop the helper that allows bulk unregistering of the per-CPU callbacks
now that KVM, the only entity that actually unregisters callbacks, uses
the per-CPU helpers.  Bulk unregistering is inherently unsafe as there
are no protections against nullifying a pointer for a CPU that is using
said pointer in a PMI handler.

Opportunistically tweak names to better reflect reality.

Signed-off-by: Sean Christopherson 
---
 arch/x86/xen/pmu.c |  2 +-
 include/linux/kvm_host.h   |  2 +-
 include/linux/perf_event.h |  9 +++--
 kernel/events/core.c   | 31 +++
 virt/kvm/kvm_main.c|  2 +-
 5 files changed, 17 insertions(+), 29 deletions(-)

diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c
index e13b0b49fcdf..57834de043c3 100644
--- a/arch/x86/xen/pmu.c
+++ b/arch/x86/xen/pmu.c
@@ -548,7 +548,7 @@ void xen_pmu_init(int cpu)
per_cpu(xenpmu_shared, cpu).flags = 0;
 
if (cpu == 0) {
-   perf_register_guest_info_callbacks(_guest_cbs);
+   perf_register_guest_info_callbacks_all_cpus(_guest_cbs);
xen_pmu_arch_init();
}
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0db9af0b628c..d68a49d5fc53 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1171,7 +1171,7 @@ unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
 void kvm_register_perf_callbacks(void);
 static inline void kvm_unregister_perf_callbacks(void)
 {
-   __perf_unregister_guest_info_callbacks();
+   perf_unregister_guest_info_callbacks();
 }
 #endif
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7a367bf1b78d..db701409a62f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1238,10 +1238,9 @@ extern void perf_event_bpf_event(struct bpf_prog *prog,
 
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
 DECLARE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs);
-extern void __perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
-extern void __perf_unregister_guest_info_callbacks(void);
-extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *callbacks);
+extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(void);
+extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
 #endif /* CONFIG_HAVE_GUEST_PERF_EVENTS */
 
 extern void perf_event_exec(void);
@@ -1486,9 +1485,7 @@ static inline void
 perf_bp_event(struct perf_event *event, void *data){ }
 
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
-static inline void perf_register_guest_info_callbacks
-(struct perf_guest_info_callbacks *callbacks)  { }
-static inline void perf_unregister_guest_info_callbacks(void)  { }
+extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
 #endif
 
 static inline void perf_event_mmap(struct vm_area_struct *vma) { }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2f28d9d8dc94..f1964096c4c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6485,35 +6485,26 @@ static void perf_pending_event(struct irq_work *entry)
 #ifdef CONFIG_HAVE_GUEST_PERF_EVENTS
 DEFINE_PER_CPU(struct perf_guest_info_callbacks *, perf_guest_cbs);
 
-void __perf_register_guest_info_callbacks(struct perf_guest_info_callbacks 
*cbs)
-{
-   __this_cpu_write(perf_guest_cbs, cbs);
-}
-EXPORT_SYMBOL_GPL(__perf_register_guest_info_callbacks);
-
-void __perf_unregister_guest_info_callbacks(void)
-{
-   __this_cpu_write(perf_guest_cbs, NULL);
-}
-EXPORT_SYMBOL_GPL(__perf_unregister_guest_info_callbacks);
-
 void perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs)
 {
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   per_cpu(perf_guest_cbs, cpu) = cbs;
+   __this_cpu_write(perf_guest_cbs, cbs);
 }
 EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks);
 
 void perf_unregister_guest_info_callbacks(void)
 {
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   per_cpu(perf_guest_cbs, cpu) = NULL;
+   __this_cpu_write(perf_guest_cbs, NULL);
 }
 EXPORT_SYMBOL_GPL(perf_unregister_guest_info_callbacks);
+
+void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs)
+{
+   int cpu;
+
+   for_each_possible_cpu(cpu)
+   per_cpu(perf_guest_cbs, cpu) = cbs;
+}
+EXPORT_SYMBOL_GPL(perf_register_guest_info_callbacks_all_cpus);
 #endif
 
 static void
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e0b1c9386926..1bcc3eab510b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5502,7 +5502,7 @@ EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler);
 
 void kvm_register_perf_callbacks(void)
 {
-   __perf_register_guest_info_callbacks(_guest_cbs

[PATCH 13/15] KVM: arm64: Drop perf.c and fold its tiny bit of code into pmu.c

2021-08-26 Thread Sean Christopherson

Fold that last few remnants of perf.c into pmu.c and rename the init
helper as appropriate.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  2 --
 arch/arm64/kvm/Makefile   |  2 +-
 arch/arm64/kvm/arm.c  |  3 ++-
 arch/arm64/kvm/perf.c | 20 
 arch/arm64/kvm/pmu.c  |  8 
 include/kvm/arm_pmu.h |  1 +
 6 files changed, 12 insertions(+), 24 deletions(-)
 delete mode 100644 arch/arm64/kvm/perf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 12e8d789e1ac..86c0fdd11ad2 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -670,8 +670,6 @@ unsigned long kvm_mmio_read_buf(const void *buf, unsigned 
int len);
 int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
-void kvm_perf_init(void);
-
 #ifdef CONFIG_PERF_EVENTS
 #define __KVM_WANT_PERF_CALLBACKS
 #else
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index 989bb5dad2c8..0bcc378b7961 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_KVM) += hyp/
 
 kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \
 $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \
-arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \
+arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 inject_fault.o va_layout.o handle_exit.o \
 guest.o debug.o reset.o sys_regs.o \
 vgic-sys-reg-v3.o fpsimd.o pmu.o \
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index dfc8078dd4f9..57e637dee71d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -1747,7 +1747,8 @@ static int init_subsystems(void)
if (err)
goto out;
 
-   kvm_perf_init();
+   kvm_pmu_init();
+
kvm_sys_reg_table_init();
 
 out:
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
deleted file mode 100644
index ad9fdc2f2f70..
--- a/arch/arm64/kvm/perf.c
+++ /dev/null
@@ -1,20 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Based on the x86 implementation.
- *
- * Copyright (C) 2012 ARM Ltd.
- * Author: Marc Zyngier 
- */
-
-#include 
-#include 
-
-#include 
-
-DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
-
-void kvm_perf_init(void)
-{
-   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-   static_branch_enable(_arm_pmu_available);
-}
diff --git a/arch/arm64/kvm/pmu.c b/arch/arm64/kvm/pmu.c
index 03a6c1f4a09a..d98b57a17043 100644
--- a/arch/arm64/kvm/pmu.c
+++ b/arch/arm64/kvm/pmu.c
@@ -7,6 +7,14 @@
 #include 
 #include 
 
+DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+
+void kvm_pmu_init(void)
+{
+   if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
+   static_branch_enable(_arm_pmu_available);
+}
+
 /*
  * Given the perf event attributes and system type, determine
  * if we are going to need to switch counters at guest entry/exit.
diff --git a/include/kvm/arm_pmu.h b/include/kvm/arm_pmu.h
index 864b9997efb2..42270676498d 100644
--- a/include/kvm/arm_pmu.h
+++ b/include/kvm/arm_pmu.h
@@ -14,6 +14,7 @@
 #define ARMV8_PMU_MAX_COUNTER_PAIRS((ARMV8_PMU_MAX_COUNTERS + 1) >> 1)
 
 DECLARE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
+void kvm_pmu_init(void);
 
 static __always_inline bool kvm_arm_support_pmu_v3(void)
 {
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 12/15] KVM: arm64: Convert to the generic perf callbacks

2021-08-26 Thread Sean Christopherson

Drop arm64's version of the callbacks in favor of the callbacks provided
by generic KVM, which are semantically identical.  Implement the "get ip"
hook as needed.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h |  6 +
 arch/arm64/kvm/arm.c  |  5 
 arch/arm64/kvm/perf.c | 38 ---
 3 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 007c38d77fd9..12e8d789e1ac 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -673,11 +673,7 @@ int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t 
fault_ipa);
 void kvm_perf_init(void);
 
 #ifdef CONFIG_PERF_EVENTS
-void kvm_register_perf_callbacks(void);
-static inline void kvm_unregister_perf_callbacks(void)
-{
-   __perf_unregister_guest_info_callbacks();
-}
+#define __KVM_WANT_PERF_CALLBACKS
 #else
 static inline void kvm_register_perf_callbacks(void) {}
 static inline void kvm_unregister_perf_callbacks(void) {}
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ec386971030d..dfc8078dd4f9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -503,6 +503,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu_mode_priv(vcpu);
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return *vcpu_pc(vcpu);
+}
+
 /* Just ensure a guest exit from a particular CPU */
 static void exit_vm_noop(void *info)
 {
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 2556b0a3b096..ad9fdc2f2f70 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,44 +13,6 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
-#ifdef CONFIG_PERF_EVENTS
-static int kvm_is_in_guest(void)
-{
-   return true;
-}
-
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return !vcpu_mode_priv(vcpu);
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return *vcpu_pc(vcpu);
-}
-
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
-};
-
-void kvm_register_perf_callbacks(void)
-{
-   __perf_register_guest_info_callbacks(_guest_cbs);
-}
-#endif /* CONFIG_PERF_EVENTS*/
-
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 10/15] KVM: Move x86's perf guest info callbacks to generic KVM

2021-08-26 Thread Sean Christopherson

Move x86's perf guest callbacks into common KVM, as they are semantically
identical to arm64's callbacks (the only other such KVM callbacks).
arm64 will convert to the common versions in a future patch.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c  | 48 +
 arch/x86/kvm/x86.h  |  6 -
 include/linux/kvm_host.h| 12 +
 virt/kvm/kvm_main.c | 46 +++
 5 files changed, 66 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 465b35736d9b..63553a1f43ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -36,6 +36,7 @@
 #include 
 
 #define __KVM_HAVE_ARCH_VCPU_DEBUGFS
+#define __KVM_WANT_PERF_CALLBACKS
 
 #define KVM_MAX_VCPUS 288
 #define KVM_SOFT_MAX_VCPUS 240
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e337aef60793..7cb0f04e24ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,32 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static int kvm_is_in_guest(void)
-{
-   /* x86's callbacks are registered only when handling a guest NMI. */
-   return true;
-}
-
-static int kvm_is_user_mode(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return static_call(kvm_x86_get_cpl)(vcpu) != 0;
-}
-
-static unsigned long kvm_get_guest_ip(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return 0;
-
-   return kvm_rip_read(vcpu);
-}
-
 static void kvm_handle_intel_pt_intr(void)
 {
struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8302,19 +8276,6 @@ static void kvm_handle_intel_pt_intr(void)
(unsigned long *)>arch.pmu.global_status);
 }
 
-static struct perf_guest_info_callbacks kvm_guest_cbs = {
-   .is_in_guest= kvm_is_in_guest,
-   .is_user_mode   = kvm_is_user_mode,
-   .get_guest_ip   = kvm_get_guest_ip,
-   .handle_intel_pt_intr   = NULL,
-};
-
-void kvm_register_perf_callbacks(void)
-{
-   __perf_register_guest_info_callbacks(_guest_cbs);
-}
-EXPORT_SYMBOL_GPL(kvm_register_perf_callbacks);
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11069,7 +11030,7 @@ int kvm_arch_hardware_setup(void *opaque)
kvm_ops_static_call_update();
 
if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_guest_cbs.handle_intel_pt_intr = kvm_handle_intel_pt_intr;
+   kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr);
 
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
@@ -11098,7 +11059,7 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   kvm_guest_cbs.handle_intel_pt_intr = NULL;
+   kvm_set_intel_pt_intr_handler(NULL);
 
static_call(kvm_x86_hardware_unsetup)();
 }
@@ -11725,6 +11686,11 @@ bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
return vcpu->arch.preempted_in_kernel;
 }
 
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu)
+{
+   return kvm_rip_read(vcpu);
+}
+
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
 {
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f13f15d2fab8..e1fe738c3827 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -387,12 +387,6 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
return kvm->arch.cstate_in_guest;
 }
 
-void kvm_register_perf_callbacks(void);
-static inline void kvm_unregister_perf_callbacks(void)
-{
-   __perf_unregister_guest_info_callbacks();
-}
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index e4d712e9f760..0db9af0b628c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1163,6 +1163,18 @@ static inline bool kvm_arch_intc_initialized(struct kvm 
*kvm)
 }
 #endif
 
+#ifdef __KVM_WANT_PERF_CALLBACKS
+
+void kvm_set_intel_pt_intr_handler(void (*handler)(void));
+unsigned long kvm_arch_vcpu_get_ip(struct kvm_vcpu *vcpu);
+
+void kvm_register_perf_callbacks(void);
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   __perf_unregister_guest_info_callbacks();
+}
+#endif
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type);
 void kvm_arch_destroy_vm(struct kvm *kvm);
 void kvm_arch_sync_events(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e67c93ca403..13c4f58a75e5 100644
--- a/virt/kvm/kvm_main.c

[PATCH 15/15] perf: KVM: Indicate "in guest" via NULL ->is_in_guest callback

2021-08-26 Thread Sean Christopherson

Interpret a null ->is_in_guest callback as meaning "in guest" and use
the new semantics in KVM, which currently returns 'true' unconditionally
in its implementation of ->is_in_guest().  This avoids a retpoline on
the indirect call for PMIs that arrive in a KVM guest, and also provides
a handy excuse for a wrapper around retrieval of perf_get_guest_cbs,
e.g. to reduce the probability of an errant direct read of perf_guest_cbs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/events/core.c   | 16 
 arch/x86/events/intel/core.c |  5 ++---
 include/linux/perf_event.h   | 17 +
 virt/kvm/kvm_main.c  |  9 ++---
 4 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 34155a52e498..b60c339ae06b 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2761,11 +2761,11 @@ static bool perf_hw_regs(struct pt_regs *regs)
 void
 perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct unwind_state state;
unsigned long addr;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2865,11 +2865,11 @@ perf_callchain_user32(struct pt_regs *regs, struct 
perf_callchain_entry_ctx *ent
 void
 perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs 
*regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
struct stack_frame frame;
const struct stack_frame __user *fp;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
/* TODO: We don't support guest os callchain now */
return;
}
@@ -2946,9 +2946,9 @@ static unsigned long code_segment_base(struct pt_regs 
*regs)
 
 unsigned long perf_instruction_pointer(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
 
-   if (guest_cbs && guest_cbs->is_in_guest())
+   if (guest_cbs)
return guest_cbs->get_guest_ip();
 
return regs->ip + code_segment_base(regs);
@@ -2956,10 +2956,10 @@ unsigned long perf_instruction_pointer(struct pt_regs 
*regs)
 
 unsigned long perf_misc_flags(struct pt_regs *regs)
 {
-   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+   struct perf_guest_info_callbacks *guest_cbs = perf_get_guest_cbs();
int misc = 0;
 
-   if (guest_cbs && guest_cbs->is_in_guest()) {
+   if (guest_cbs) {
if (guest_cbs->is_user_mode())
misc |= PERF_RECORD_MISC_GUEST_USER;
else
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 96001962c24d..9a8c18b51a96 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2853,9 +2853,8 @@ static int handle_pmi_common(struct pt_regs *regs, u64 
status)
 */
if (__test_and_clear_bit(GLOBAL_STATUS_TRACE_TOPAPMI_BIT, (unsigned 
long *))) {
handled++;
-   guest_cbs = this_cpu_read(perf_guest_cbs);
-   if (unlikely(guest_cbs && guest_cbs->is_in_guest() &&
-guest_cbs->handle_intel_pt_intr))
+   guest_cbs = perf_get_guest_cbs();
+   if (unlikely(guest_cbs && guest_cbs->handle_intel_pt_intr))
guest_cbs->handle_intel_pt_intr();
else
intel_pt_interrupt();
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index db701409a62f..6e3a10784d24 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1241,6 +1241,23 @@ DECLARE_PER_CPU(struct perf_guest_info_callbacks *, 
perf_guest_cbs);
 extern void perf_register_guest_info_callbacks(struct 
perf_guest_info_callbacks *cbs);
 extern void perf_unregister_guest_info_callbacks(void);
 extern void perf_register_guest_info_callbacks_all_cpus(struct 
perf_guest_info_callbacks *cbs);
+/*
+ * Returns guest callbacks for the current CPU if callbacks are registered and
+ * the PMI fired while a guest was running, otherwise returns NULL.
+ */
+static inline struct perf_guest_info_callbacks *perf_get_guest_cbs(void)
+{
+   struct perf_guest_info_callbacks *guest_cbs = 
this_cpu_read(perf_guest_cbs);
+
+   /*
+* Implementing is_in_guest is optional if the callbacks are registered
+* only

[PATCH 11/15] KVM: x86: Move Intel Processor Trace interrupt handler to vmx.c

2021-08-26 Thread Sean Christopherson

Now that all state needed for VMX's PT interrupt handler is exposed to
vmx.c (specifically the currently running vCPU), move the handler into
vmx.c where it belongs.

Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/vmx/vmx.c  | 24 +---
 arch/x86/kvm/x86.c  | 17 -
 virt/kvm/kvm_main.c |  1 +
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63553a1f43ee..daa33147650a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1496,7 +1496,6 @@ struct kvm_x86_init_ops {
int (*disabled_by_bios)(void);
int (*check_processor_compatibility)(void);
int (*hardware_setup)(void);
-   bool (*intel_pt_intr_in_guest)(void);
 
struct kvm_x86_ops *runtime_ops;
 };
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f08980ef7c44..4665a272249a 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7535,6 +7535,8 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 
 static void hardware_unsetup(void)
 {
+   kvm_set_intel_pt_intr_handler(NULL);
+
if (nested)
nested_vmx_hardware_unsetup();
 
@@ -7685,6 +7687,18 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 };
 
+static void vmx_handle_intel_pt_intr(void)
+{
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+
+   if (WARN_ON_ONCE(!vcpu))
+   return;
+
+   kvm_make_request(KVM_REQ_PMI, vcpu);
+   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
+   (unsigned long *)>arch.pmu.global_status);
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
@@ -7886,9 +7900,14 @@ static __init int hardware_setup(void)
vmx_set_cpu_caps();
 
r = alloc_kvm_area();
-   if (r)
+   if (r) {
nested_vmx_hardware_unsetup();
-   return r;
+   return r;
+   }
+
+   if (pt_mode == PT_MODE_HOST_GUEST)
+   kvm_set_intel_pt_intr_handler(vmx_handle_intel_pt_intr);
+   return 0;
 }
 
 static struct kvm_x86_init_ops vmx_init_ops __initdata = {
@@ -7896,7 +7915,6 @@ static struct kvm_x86_init_ops vmx_init_ops __initdata = {
.disabled_by_bios = vmx_disabled_by_bios,
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_setup = hardware_setup,
-   .intel_pt_intr_in_guest = vmx_pt_mode_is_host_guest,
 
.runtime_ops = _x86_ops,
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7cb0f04e24ee..11c7a02f839c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,18 +8264,6 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-static void kvm_handle_intel_pt_intr(void)
-{
-   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
-
-   if (WARN_ON_ONCE(!vcpu))
-   return;
-
-   kvm_make_request(KVM_REQ_PMI, vcpu);
-   __set_bit(MSR_CORE_PERF_GLOBAL_OVF_CTRL_TRACE_TOPA_PMI_BIT,
-   (unsigned long *)>arch.pmu.global_status);
-}
-
 #ifdef CONFIG_X86_64
 static void pvclock_gtod_update_fn(struct work_struct *work)
 {
@@ -11029,9 +11017,6 @@ int kvm_arch_hardware_setup(void *opaque)
memcpy(_x86_ops, ops->runtime_ops, sizeof(kvm_x86_ops));
kvm_ops_static_call_update();
 
-   if (ops->intel_pt_intr_in_guest && ops->intel_pt_intr_in_guest())
-   kvm_set_intel_pt_intr_handler(kvm_handle_intel_pt_intr);
-
if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
supported_xss = 0;
 
@@ -11059,8 +11044,6 @@ int kvm_arch_hardware_setup(void *opaque)
 
 void kvm_arch_hardware_unsetup(void)
 {
-   kvm_set_intel_pt_intr_handler(NULL);
-
static_call(kvm_x86_hardware_unsetup)();
 }
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 13c4f58a75e5..e0b1c9386926 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5498,6 +5498,7 @@ void kvm_set_intel_pt_intr_handler(void (*handler)(void))
 {
kvm_guest_cbs.handle_intel_pt_intr = handler;
 }
+EXPORT_SYMBOL_GPL(kvm_set_intel_pt_intr_handler);
 
 void kvm_register_perf_callbacks(void)
 {
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 08/15] KVM: x86: Drop current_vcpu in favor of kvm_running_vcpu

2021-08-26 Thread Sean Christopherson

Now that KVM registers perf callbacks only when the CPU is "in guest",
use kvm_running_vcpu instead of current_vcpu to retrieve the associated
vCPU and drop current_vcpu.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 12 +---
 arch/x86/kvm/x86.h |  4 
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d4d91944fde7..e337aef60793 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8264,17 +8264,15 @@ static void kvm_timer_init(void)
  kvmclock_cpu_online, kvmclock_cpu_down_prep);
 }
 
-DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
-
 static int kvm_is_in_guest(void)
 {
-   return __this_cpu_read(current_vcpu) != NULL;
+   /* x86's callbacks are registered only when handling a guest NMI. */
+   return true;
 }
 
 static int kvm_is_user_mode(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return 0;
@@ -8284,7 +8282,7 @@ static int kvm_is_user_mode(void)
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return 0;
@@ -8294,7 +8292,7 @@ static unsigned long kvm_get_guest_ip(void)
 
 static void kvm_handle_intel_pt_intr(void)
 {
-   struct kvm_vcpu *vcpu = __this_cpu_read(current_vcpu);
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
if (WARN_ON_ONCE(!vcpu))
return;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4c5ba4128b38..f13f15d2fab8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -393,11 +393,8 @@ static inline void kvm_unregister_perf_callbacks(void)
__perf_unregister_guest_info_callbacks();
 }
 
-DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
-
 static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
-   __this_cpu_write(current_vcpu, vcpu);
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
 
kvm_register_perf_callbacks();
@@ -408,7 +405,6 @@ static inline void kvm_after_interrupt(struct kvm_vcpu 
*vcpu)
kvm_unregister_perf_callbacks();
 
WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, false);
-   __this_cpu_write(current_vcpu, NULL);
 }
 
 
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 09/15] KVM: arm64: Register/unregister perf callbacks at vcpu load/put

2021-08-26 Thread Sean Christopherson

Register/unregister perf callbacks at vcpu_load()/vcpu_put() instead of
keeping the callbacks registered for all eternity after loading KVM.
This will allow future cleanups and optimizations as the registration
of the callbacks signifies "in guest".  This will also allow moving the
callbacks into common KVM as they arm64 and x86 now have semantically
identical callback implementations.

Note, KVM could likely be more precise in its registration, but that's a
cleanup for the future.

Signed-off-by: Sean Christopherson 
---
 arch/arm64/include/asm/kvm_host.h | 12 ++-
 arch/arm64/kvm/arm.c  |  5 -
 arch/arm64/kvm/perf.c | 36 ++-
 3 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index ed940aec89e0..007c38d77fd9 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -671,7 +671,17 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu);
 int io_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa);
 
 void kvm_perf_init(void);
-void kvm_perf_teardown(void);
+
+#ifdef CONFIG_PERF_EVENTS
+void kvm_register_perf_callbacks(void);
+static inline void kvm_unregister_perf_callbacks(void)
+{
+   __perf_unregister_guest_info_callbacks();
+}
+#else
+static inline void kvm_register_perf_callbacks(void) {}
+static inline void kvm_unregister_perf_callbacks(void) {}
+#endif
 
 long kvm_hypercall_pv_features(struct kvm_vcpu *vcpu);
 gpa_t kvm_init_stolen_time(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e9a2b8f27792..ec386971030d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -429,10 +429,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (vcpu_has_ptrauth(vcpu))
vcpu_ptrauth_disable(vcpu);
kvm_arch_vcpu_load_debug_state_flags(vcpu);
+
+   kvm_register_perf_callbacks();
 }
 
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 {
+   kvm_unregister_perf_callbacks();
kvm_arch_vcpu_put_debug_state_flags(vcpu);
kvm_arch_vcpu_put_fp(vcpu);
if (has_vhe())
@@ -2155,7 +2158,7 @@ int kvm_arch_init(void *opaque)
 /* NOP: Compiling as a module not supported */
 void kvm_arch_exit(void)
 {
-   kvm_perf_teardown();
+
 }
 
 static int __init early_kvm_mode_cfg(char *arg)
diff --git a/arch/arm64/kvm/perf.c b/arch/arm64/kvm/perf.c
index 039fe59399a2..2556b0a3b096 100644
--- a/arch/arm64/kvm/perf.c
+++ b/arch/arm64/kvm/perf.c
@@ -13,33 +13,30 @@
 
 DEFINE_STATIC_KEY_FALSE(kvm_arm_pmu_available);
 
+#ifdef CONFIG_PERF_EVENTS
 static int kvm_is_in_guest(void)
 {
-return kvm_get_running_vcpu() != NULL;
+   return true;
 }
 
 static int kvm_is_user_mode(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return !vcpu_mode_priv(vcpu);
-
-   return 0;
+   return !vcpu_mode_priv(vcpu);
 }
 
 static unsigned long kvm_get_guest_ip(void)
 {
-   struct kvm_vcpu *vcpu;
+   struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
 
-   vcpu = kvm_get_running_vcpu();
+   if (WARN_ON_ONCE(!vcpu))
+   return 0;
 
-   if (vcpu)
-   return *vcpu_pc(vcpu);
-
-   return 0;
+   return *vcpu_pc(vcpu);
 }
 
 static struct perf_guest_info_callbacks kvm_guest_cbs = {
@@ -48,15 +45,14 @@ static struct perf_guest_info_callbacks kvm_guest_cbs = {
.get_guest_ip   = kvm_get_guest_ip,
 };
 
+void kvm_register_perf_callbacks(void)
+{
+   __perf_register_guest_info_callbacks(_guest_cbs);
+}
+#endif /* CONFIG_PERF_EVENTS*/
+
 void kvm_perf_init(void)
 {
if (kvm_pmu_probe_pmuver() != 0xf && !is_protected_kvm_enabled())
static_branch_enable(_arm_pmu_available);
-
-   perf_register_guest_info_callbacks(_guest_cbs);
-}
-
-void kvm_perf_teardown(void)
-{
-   perf_unregister_guest_info_callbacks();
 }
-- 
2.33.0.259.gc128427fd7-goog

[PATCH 07/15] KVM: Use dedicated flag to track if KVM is handling an NMI from guest

2021-08-26 Thread Sean Christopherson

Add a dedicated flag to detect the case where KVM's PMC overflow
callback was originally invoked in response to an NMI that arrived while
the guest was running.  Using current_vcpu is less precise as IRQs also
set current_vcpu (though presumably KVM's callback should not be reached
in that case), and more importantly, this will allow dropping
current_vcpu as the perf callbacks can switch to kvm_running_vcpu now
that the perf callbacks are precisely registered, i.e. kvm_running_vcpu
doesn't need to be used to detect if a PMI arrived in the guest.

Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting")
Signed-off-by: Sean Christopherson 
---
 arch/x86/include/asm/kvm_host.h | 3 +--
 arch/x86/kvm/pmu.c  | 2 +-
 arch/x86/kvm/svm/svm.c  | 2 +-
 arch/x86/kvm/vmx/vmx.c  | 2 +-
 arch/x86/kvm/x86.c  | 4 ++--
 arch/x86/kvm/x86.h  | 4 +++-
 6 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1ea4943a73d7..465b35736d9b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,6 +763,7 @@ struct kvm_vcpu_arch {
unsigned nmi_pending; /* NMI queued after currently running handler */
bool nmi_injected;/* Trying to inject an NMI this entry */
bool smi_pending;/* SMI queued after currently running handler */
+   bool handling_nmi_from_guest;
 
struct kvm_mtrr mtrr_state;
u64 pat;
@@ -1874,8 +1875,6 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu);
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err);
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu);
 
-int kvm_is_in_guest(void);
-
 void __user *__x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa,
 u32 size);
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 0772bad9165c..2b8934b452ea 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -87,7 +87,7 @@ static void kvm_perf_overflow_intr(struct perf_event 
*perf_event,
 * woken up. So we should wake it, but this is impossible from
 * NMI context. Do it from irq work instead.
 */
-   if (!kvm_is_in_guest())
+   if (!pmc->vcpu->arch.handling_nmi_from_guest)
irq_work_queue(_to_pmu(pmc)->irq_work);
else
kvm_make_request(KVM_REQ_PMI, pmc->vcpu);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 1a70e11f0487..3fc6767e5fd8 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3843,7 +3843,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu 
*vcpu)
}
 
if (unlikely(svm->vmcb->control.exit_code == SVM_EXIT_NMI))
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, true);
 
kvm_load_host_xsave_state(vcpu);
stgi();
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f19d72136f77..f08980ef7c44 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6344,7 +6344,7 @@ void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
unsigned long entry)
 {
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, entry == (unsigned long)asm_exc_nmi_noist);
vmx_do_interrupt_nmi_irqoff(entry);
kvm_after_interrupt(vcpu);
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bc4ee6ea7752..d4d91944fde7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8267,7 +8267,7 @@ static void kvm_timer_init(void)
 DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
 
-int kvm_is_in_guest(void)
+static int kvm_is_in_guest(void)
 {
return __this_cpu_read(current_vcpu) != NULL;
 }
@@ -9678,7 +9678,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 * interrupts on processors that implement an interrupt shadow, the
 * stat.exits increment will do nicely.
 */
-   kvm_before_interrupt(vcpu);
+   kvm_before_interrupt(vcpu, false);
local_irq_enable();
++vcpu->stat.exits;
local_irq_disable();
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5cedc0e8a5d5..4c5ba4128b38 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -395,9 +395,10 @@ static inline void kvm_unregister_perf_callbacks(void)
 
 DECLARE_PER_CPU(struct kvm_vcpu *, current_vcpu);
 
-static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu)
+static inline void kvm_before_interrupt(struct kvm_vcpu *vcpu, bool is_nmi)
 {
__this_cpu_write(current_vcpu, vcpu);
+   WRITE_ONCE(vcpu->arch.handling_nmi_from_guest, is_nmi);
 
kvm_register_perf_callbacks();
 }
@@ -

1 2 >

1 - 100 of 114 matches

Mail list logo