from:"Sean"

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Sean Christopherson

On Fri, Apr 19, 2024, Will Deacon wrote:
> > @@ -663,10 +669,22 @@ static __always_inline kvm_mn_ret_t 
> > __kvm_handle_hva_range(struct kvm *kvm,
> > break;
> > }
> > r.ret |= range->handler(kvm, _range);
> > +
> > +  /*
> > +   * Use a precise gfn-based TLB flush when possible, as
> > +   * most mmu_notifier events affect a small-ish range.
> > +   * Fall back to a full TLB flush if the gfn-based flush
> > +   * fails, and don't bother trying the gfn-based flush
> > +   * if a full flush is already pending.
> > +   */
> > +  if (range->flush_on_ret && !need_flush && r.ret &&
> > +  kvm_arch_flush_remote_tlbs_range(kvm, 
> > gfn_range.start,
> > +   gfn_range.end - 
> > gfn_range.start + 1))
> 
> What's that '+ 1' needed for here?

 (a) To see if you're paying attention.
 (b) Because more is always better.
 (c) Because math is hard.
 (d) Because I haven't tested this.
 (e) Both (c) and (d).

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Sean Christopherson

On Thu, Apr 18, 2024, Will Deacon wrote:
> On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > >  wrote:
> > > > 
> > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > wrote:
> > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > > would be really great to change the .clear_flush_young() callback so
> > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > moment,
> > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > 'flush_on_ret'
> > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > code
> > > > > > when we actually update the ptes for small ranges.
> > > > > 
> > > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > > costly consequences).
> > > > > 
> > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > code that deals with the page tables, as it has a pretty good idea of
> > > > > what needs to be invalidated and how -- specially on architectures
> > > > > that have a HW-broadcast facility like arm64.
> > > > 
> > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > simpler, more
> > > > straightforward solution would be to let architectures override 
> > > > flush_on_ret,
> > > > but I would prefer something like the below as x86 can also utilize a 
> > > > range-based
> > > > flush when running as a nested hypervisor.
> > 
> > ...
> > 
> > > I think this works for us on HW that has range invalidation, which
> > > would already be a positive move.
> > > 
> > > For the lesser HW that isn't range capable, it also gives the
> > > opportunity to perform the iteration ourselves or go for the nuclear
> > > option if the range is larger than some arbitrary constant (though
> > > this is additional work).
> > > 
> > > But this still considers the whole range as being affected by
> > > range->handler(). It'd be interesting to try and see whether more
> > > precise tracking is (or isn't) generally beneficial.
> > 
> > I assume the idea would be to let arch code do single-page invalidations of
> > stage-2 entries for each gfn?
> 
> Right, as it's the only code which knows which ptes actually ended up
> being aged.
> 
> > Unless I'm having a brain fart, x86 can't make use of that functionality.  
> > Intel
> > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > AMD
> > provides an instruction to do broadcast invalidations, but it takes a 
> > virtual
> > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > address or
> > a guest virtual address, but it's a moot point because KVM doen't have the 
> > guest
> > virtual address, and if it's a host virtual address, there would need to be 
> > valid
> > mappings in the host page tables for it to work, which KVM can't guarantee.
> 
> Ah, so it sounds like it would need to be an arch opt-in then.

Even if x86 (or some other arch code) could use the precise tracking, I think it
would make sense to have the behavior be arch specific.  Adding infrastructure
to get information from arch code, only to turn around and give it back to arch
code would be odd.

Unless arm64 can't do the invalidation immediately after aging the stage-2 PTE,
the best/easiest solution would be to let arm64 opt out of the common TLB flush
when a SPTE is made young.

With the range-based flushing bundled in, this?

---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c  | 40 +---
 2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index afbc99264ffa..8fe5f5e16919 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2010,6 +2010,8 @@ extern const struct k

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-15 Thread Sean Christopherson

On Sat, Apr 13, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> wrote:
> > 
> > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > would be really great to change the .clear_flush_young() callback so
> > > > that the architecture could handle the TLB invalidation. At the moment,
> > > > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > when we actually update the ptes for small ranges.
> > > 
> > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > costly consequences).
> > > 
> > > In general, it feels like the TLB invalidation should stay with the
> > > code that deals with the page tables, as it has a pretty good idea of
> > > what needs to be invalidated and how -- specially on architectures
> > > that have a HW-broadcast facility like arm64.
> > 
> > Would this be roughly on par with an in-line flush on arm64?  The simpler, 
> > more
> > straightforward solution would be to let architectures override 
> > flush_on_ret,
> > but I would prefer something like the below as x86 can also utilize a 
> > range-based
> > flush when running as a nested hypervisor.

...

> I think this works for us on HW that has range invalidation, which
> would already be a positive move.
> 
> For the lesser HW that isn't range capable, it also gives the
> opportunity to perform the iteration ourselves or go for the nuclear
> option if the range is larger than some arbitrary constant (though
> this is additional work).
> 
> But this still considers the whole range as being affected by
> range->handler(). It'd be interesting to try and see whether more
> precise tracking is (or isn't) generally beneficial.

I assume the idea would be to let arch code do single-page invalidations of
stage-2 entries for each gfn?

Unless I'm having a brain fart, x86 can't make use of that functionality.  Intel
doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
provides an instruction to do broadcast invalidations, but it takes a virtual
address, i.e. a stage-1 address.  I can't tell if it's a host virtual address or
a guest virtual address, but it's a moot point because KVM doen't have the guest
virtual address, and if it's a host virtual address, there would need to be 
valid
mappings in the host page tables for it to work, which KVM can't guarantee.

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Sean Christopherson

On Fri, Apr 12, 2024, Marc Zyngier wrote:
> On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > Also, if you're in the business of hacking the MMU notifier code, it
> > would be really great to change the .clear_flush_young() callback so
> > that the architecture could handle the TLB invalidation. At the moment,
> > the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
> > being set by kvm_handle_hva_range(), whereas we could do a much
> > lighter-weight and targetted TLBI in the architecture page-table code
> > when we actually update the ptes for small ranges.
> 
> Indeed, and I was looking at this earlier this week as it has a pretty
> devastating effect with NV (it blows the shadow S2 for that VMID, with
> costly consequences).
> 
> In general, it feels like the TLB invalidation should stay with the
> code that deals with the page tables, as it has a pretty good idea of
> what needs to be invalidated and how -- specially on architectures
> that have a HW-broadcast facility like arm64.

Would this be roughly on par with an in-line flush on arm64?  The simpler, more
straightforward solution would be to let architectures override flush_on_ret,
but I would prefer something like the below as x86 can also utilize a 
range-based
flush when running as a nested hypervisor.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff0a20565f90..b65116294efe 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -601,6 +601,7 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
struct kvm_gfn_range gfn_range;
struct kvm_memory_slot *slot;
struct kvm_memslots *slots;
+   bool need_flush = false;
int i, idx;
 
if (WARN_ON_ONCE(range->end <= range->start))
@@ -653,10 +654,22 @@ static __always_inline kvm_mn_ret_t 
__kvm_handle_hva_range(struct kvm *kvm,
break;
}
r.ret |= range->handler(kvm, _range);
+
+   /*
+* Use a precise gfn-based TLB flush when possible, as
+* most mmu_notifier events affect a small-ish range.
+* Fall back to a full TLB flush if the gfn-based flush
+* fails, and don't bother trying the gfn-based flush
+* if a full flush is already pending.
+*/
+   if (range->flush_on_ret && !need_flush && r.ret &&
+   kvm_arch_flush_remote_tlbs_range(kvm, 
gfn_range.start
+gfn_range.end - 
gfn_range.start + 1))
+   need_flush = true;
}
}
 
-   if (range->flush_on_ret && r.ret)
+   if (need_flush)
kvm_flush_remote_tlbs(kvm);
 
if (r.found_memslot)

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-17 Thread Sean Christopherson

On Mon, Oct 16, 2023, Haitao Huang wrote:
> Hi Sean
> 
> On Mon, 16 Oct 2023 16:32:31 -0500, Sean Christopherson 
> wrote:
> 
> > On Mon, Oct 16, 2023, Haitao Huang wrote:
> > > From this perspective, I think the current implementation is
> > > "well-defined":
> > > EPC cgroup limits for VMs are only enforced at VM launch time, not
> > > runtime.  In practice,  SGX VM can be launched only with fixed EPC size
> > > and all those EPCs are fully committed to the VM once launched.
> > 
> > Fully committed doesn't mean those numbers are reflected in the cgroup.  A
> > VM scheduler can easily "commit" EPC to a guest, but allocate EPC on
> > demand, i.e.  when the guest attempts to actually access a page.
> > Preallocating memory isn't free, e.g. it can slow down guest boot, so it's
> > entirely reasonable to have virtual EPC be allocated on-demand.  Enforcing
> > at launch time doesn't work for such setups, because from the cgroup's
> > perspective, the VM is using 0 pages of EPC at launch.
> > 
> Maybe I understood the current implementation wrong. From what I see, vEPC
> is impossible not fully commit at launch time. The guest would EREMOVE all
> pages during initialization resulting #PF and all pages allocated. This
> essentially makes "prealloc=off" the same as "prealloc=on".
> Unless you are talking about some custom OS or kernel other than upstream
> Linux here?

Yes, a customer could be running an older kernel, something other than Linux, a
custom kernel, an out-of-tree SGX driver, etc.  The host should never assume
anything about the guest kernel when it comes to correctness (unless the guest
kernel is controlled by the host).

Doing EREMOVE on all pages is definitely not mandatory, especially if the kernel
detects a hypervisor, i.e. knows its running as a guest.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-16 Thread Sean Christopherson

On Mon, Oct 16, 2023, Haitao Huang wrote:
> From this perspective, I think the current implementation is "well-defined":
> EPC cgroup limits for VMs are only enforced at VM launch time, not runtime.
> In practice,  SGX VM can be launched only with fixed EPC size and all those
> EPCs are fully committed to the VM once launched.

Fully committed doesn't mean those numbers are reflected in the cgroup.  A VM
scheduler can easily "commit" EPC to a guest, but allocate EPC on demand, i.e.
when the guest attempts to actually access a page.  Preallocating memory isn't
free, e.g. it can slow down guest boot, so it's entirely reasonable to have 
virtual
EPC be allocated on-demand.  Enforcing at launch time doesn't work for such 
setups,
because from the cgroup's perspective, the VM is using 0 pages of EPC at launch.

> Because of that, I imagine people are using VMs to primarily partition the
> physical EPCs, i.e, the static size itself is the 'limit' for the workload of
> a single VM and not expecting EPCs taken away at runtime.

If everything goes exactly as planned, sure.  But it's not hard to imagine some
configuration change way up the stack resulting in the hard limit for an EPC 
cgroup
being lowered.

> So killing does not really add much value for the existing usages IIUC.

As I said earlier, the behavior doesn't have to result in terminating a VM, e.g.
the virtual EPC code could provide a knob to send a signal/notification if the
owning cgroup has gone above the limit and the VM is targeted for forced 
reclaim.

> That said, I don't anticipate adding the enforcement of killing VMs at
> runtime would break such usages as admin/user can simply choose to set the
> limit equal to the static size to launch the VM and forget about it.
> 
> Given that, I'll propose an add-on patch to this series as RFC and have some
> feedback from community before we decide if that needs be included in first
> version or we can skip it until we have EPC reclaiming for VMs.

Gracefully *swapping* virtual EPC isn't required for oversubscribing virtual 
EPC.
Think of it like airlines overselling tickets.  The airline sells more tickets
than they have seats, and banks on some passengers canceling.  If too many 
people
show up, the airline doesn't swap passengers to the cargo bay, they just shunt 
them
to a different plane.

The same could be easily be done for hosts and virtual EPC.  E.g. if every VM
*might* use 1GiB, but in practice 99% of VMs only consume 128MiB, then it's not
too crazy to advertise 1GiB to each VM, but only actually carve out 256MiB per 
VM
in order to pack more VMs on a host.  If the host needs to free up EPC, then the
most problematic VMs can be migrated to a different host.

Genuinely curious, who is asking for EPC cgroup support that *isn't* running 
VMs?
AFAIK, these days, SGX is primarily targeted at cloud.  I assume virtual EPC is
the primary use case for an EPC cgroup.

I don't have any skin in the game beyond my name being attached to some of the
patches, i.e. I certainly won't stand in the way.  I just don't understand why
you would go through all the effort of adding an EPC cgroup and then not go the
extra few steps to enforce limits for virtual EPC.  Compared to the complexity
of the rest of the series, that little bit seems quite trivial.

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-10 Thread Sean Christopherson

On Tue, Oct 10, 2023, Haitao Huang wrote:
> On Mon, 09 Oct 2023 21:23:12 -0500, Huang, Kai  wrote:
> 
> > On Mon, 2023-10-09 at 20:42 -0500, Haitao Huang wrote:
> > > Hi Sean
> > > 
> > > On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson
> > >  wrote:
> > > > I can see userspace wanting to explicitly terminate the VM instead of
> > > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the
> > > > virtual EPC
> > > > code.
> > > 
> > > If my understanding above is correct and understanding your statement
> > > above correctly, then don't see we really need separate knob for vEPC
> > > code. Reaching a cgroup limit by a running guest (assuming dynamic
> > > allocation implemented) should not translate automatically killing
> > > the VM.
> > > Instead, it's user space job to work with guest to handle allocation
> > > failure. Guest could page and kill enclaves.
> > > 
> > 
> > IIUC Sean was talking about changing misc.max _after_ you launch SGX VMs:
> > 
> > 1) misc.max = 100M
> > 2) Launch VMs with total virtual EPC size = 100M<- success
> > 3) misc.max = 50M
> > 
> > 3) will also succeed, but nothing will happen, the VMs will be still
> > holding 100M EPC.
> > 
> > You need to somehow track virtual EPC and kill VM instead.
> > 
> > (or somehow fail to do 3) if it is also an acceptable option.)
> > 
> Thanks for explaining it.
> 
> There is an error code to return from max_write. I can add that too to the
> callback definition and fail it when it can't be enforced for any reason.
> Would like some community feedback if this is acceptable though.

That likely isn't acceptable.  E.g. create a cgroup with both a host enclave and
virtual EPC, set the hard limit to 100MiB.  Virtual EPC consumes 50MiB, and the
host enclave consumes 50MiB.  Userspace lowers the limit to 49MiB.  The cgroup
code would reclaim all of the enclave's reclaimable EPC, and then kill the 
enclave
because it's still over the limit.  And then fail the max_write because the 
cgroup
is *still* over the limit.  So in addition to burning a lot of cycles, from
userspace's perspective its enclave was killed for no reason, as the new limit
wasn't actually set.

> I think to solve it ultimately, we need be able to adjust 'capacity' of VMs
> not to just kill them, which is basically the same as dynamic allocation
> support for VMs (being able to increase/decrease epc size when it is
> running). For now, we only have static allocation so max can't be enforced
> once it is launched.

No, reclaiming virtual EPC is not a requirement.  VMM EPC oversubscription is
insanely complex, and I highly doubt any users actually want to oversubcribe 
VMs.

There are use cases for cgroups beyond oversubscribing/swapping, e.g. privileged
userspace may set limits on a container to ensure the container doesn't 
*accidentally*
consume more EPC than it was allotted, e.g. due to a configuration bug that 
created
a VM with more EPC than it was supposed to have.  

My comments on virtual EPC vs. cgroups is much more about having sane, 
well-defined
behavior, not about saying the kernel actually needs to support oversubscribing 
EPC
for KVM guests.

Re: [PATCH] KVM: deprecate KVM_WERROR in favor of general WERROR

2023-10-10 Thread Sean Christopherson

On Tue, Oct 10, 2023, Jakub Kicinski wrote:
> On Tue, 10 Oct 2023 11:04:18 +0300 Jani Nikula wrote:
> > > If you do invest in build testing automation, why can't your automation
> > > count warnings rather than depend on WERROR? I don't understand.  
> > 
> > Because having both CI and the subsystem/driver developers enable a
> > local WERROR actually works in keeping the subsystem/driver clean of
> > warnings.
> > 
> > For i915, we also enable W=1 warnings and kernel-doc -Werror with it,
> > keeping all of them warning clean. I don't much appreciate calling that
> > anti-social.
> 
> Anti-social is not the right word, that's fair.
> 
> Werror makes your life easier while increasing the blast radius 
> of your mistakes. So you're trading off your convenience for risk
> of breakage to others. Note that you can fix issues locally very
> quickly and move on. Others have to wait to get your patches thru
> Linus.
> 
> > >> I disagree.  WERROR simply doesn't provide the same coverage.  E.g. it 
> > >> can't be
> > >> enabled for i386 without tuning FRAME_WARN, which (a) won't be at all 
> > >> obvious to
> > >> the average contributor and (b) increasing FRAME_WARN effectively 
> > >> reduces the
> > >> test coverage of KVM i386.
> > >> 
> > >> For KVM x86, I want the rules for contributing to be clearly documented, 
> > >> and as
> > >> simple as possible.  I don't see a sane way to achieve that with 
> > >> WERROR=y.  
> > 
> > The DRM_I915_WERROR config depends on EXPERT and !COMPILE_TEST, and to
> > my knowledge this has never caused issues outside of i915 developers and
> > CI.
> 
> Ack, I think you do it right. I was trying to establish a precedent
> so that we can delete these as soon as they cause an issue, not sooner.

So isn't the underlying problem simply that KVM_WERROR is enabled by default for
some configurations?  If that's the case, then my proposal to make KVM_WERROR
always off by default, and "depends on KVM && EXPERT && !KASAN", would make this
go away, no?

Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

2023-10-09 Thread Sean Christopherson

On Mon, Oct 09, 2023, Kai Huang wrote:
> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > +/**
> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > + * @lru:   LRU that is low
> > + *
> > + * Return: %true if a victim was found and kicked.
> > + */
> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > +{
> > +   struct sgx_epc_page *victim;
> > +
> > +   spin_lock(>lock);
> > +   victim = sgx_oom_get_victim(lru);
> > +   spin_unlock(>lock);
> > +
> > +   if (!victim)
> > +   return false;
> > +
> > +   if (victim->flags & SGX_EPC_OWNER_PAGE)
> > +   return sgx_oom_encl_page(victim->encl_page);
> > +
> > +   if (victim->flags & SGX_EPC_OWNER_ENCL)
> > +   return sgx_oom_encl(victim->encl);
> 
> I hate to bring this up, at least at this stage, but I am wondering why we 
> need
> to put VA and SECS pages to the unreclaimable list, but cannot keep an
> "enclave_list" instead?

The motivation for tracking EPC pages instead of enclaves was so that the EPC
OOM-killer could "kill" VMs as well as host-owned enclaves.  The virtual EPC 
code
didn't actually kill the VM process, it instead just freed all of the EPC pages
and abused the SGX architecture to effectively make the guest recreate all its
enclaves (IIRC, QEMU does the same thing to "support" live migration).

Looks like y'all punted on that with:

  The EPC pages allocated for KVM guests by the virtual EPC driver are not
  reclaimable by the host kernel [5]. Therefore they are not tracked by any
  LRU lists for reclaiming purposes in this implementation, but they are
  charged toward the cgroup of the user processs (e.g., QEMU) launching the
  guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
  driver will stop allocating more EPC for the VM, and return SIGBUS to the
  user process which would abort the VM launch.

which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
Relying
on userspace to be kind enough to kill its VMs kinda defeats the purpose of 
cgroup
enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace 
running
encalves in a VM could continue on and refuse to give up its EPC, and thus run 
above
its limit in perpetuity.

I can see userspace wanting to explicitly terminate the VM instead of "silently"
the VM's enclaves, but that seems like it should be a knob in the virtual EPC
code.

Re: [PATCH] KVM: deprecate KVM_WERROR in favor of general WERROR

2023-10-09 Thread Sean Christopherson

On Mon, Oct 09, 2023, Jakub Kicinski wrote:
> On Mon, 9 Oct 2023 10:43:43 -0700 Sean Christopherson wrote:
> > On Fri, Oct 06, 2023, Jakub Kicinski wrote:
> > On a related topic, this is comically stale as WERROR is on by default for 
> > both
> > allmodconfig and allyesconfig, which work because they trigger 64-bit 
> > builds.
> > And KASAN on x86 is 64-bit only.
> > 
> > Rather than yank out KVM_WERROR entirely, what if we make default=n and 
> > trim the
> > depends down to "KVM && EXPERT && !KASAN"?  E.g.
> 
> IMO setting WERROR is a bit perverse. The way I see it WERROR is a
> crutch for people who don't have the time / infra to properly build
> test changes they send to Linus. Or wait for build bots to do their job.

KVM_WERROR reduces the probability of issues in patches being sent to *me*.  The
reality is that most contributors do not have the knowledge and/or resources to
"properly" build test changes without specific guidance on what/how to test, or
what configs to prioritize.

Nor is it realistic to expect that build bots will detect every issue in every
possible configuration in every patch that's posted.

Call -Werror a crutch if you will, but for me it's a crutch that I'm more than
willing to lean on in order to increase the overall quality of KVM x86 
submissions.

> We do have sympathy for these folks, we are mostly volunteers after
> all. At the same time someone's under-investment should not be causing
> pain to those of us who _do_ build test stuff carefully.

This is a bit over the top.  Yeah, I need to add W=1 to my build scripts, but 
that's
not a lack of investment, just an oversight.  Though in this case it likely 
wouldn't
have made any difference since Paolo grabbed the patches directly and might have
even bypassed linux-next.  But again I would argue that's bad process, not a 
lack
of investment.

> Rather than tweak stuff I'd prefer if we could agree that local -Werror
> is anti-social :(
> 
> The global WERROR seems to be a good compromise.

I disagree.  WERROR simply doesn't provide the same coverage.  E.g. it can't be
enabled for i386 without tuning FRAME_WARN, which (a) won't be at all obvious to
the average contributor and (b) increasing FRAME_WARN effectively reduces the
test coverage of KVM i386.

For KVM x86, I want the rules for contributing to be clearly documented, and as
simple as possible.  I don't see a sane way to achieve that with WERROR=y.

Re: [PATCH] KVM: deprecate KVM_WERROR in favor of general WERROR

2023-10-09 Thread Sean Christopherson

On Fri, Oct 06, 2023, Jakub Kicinski wrote:
> Setting WERROR for random subsystems make life really hard
> for subsystems which want to build-test their stuff with W=1.
> WERROR for the entire kernel now exists and can be used
> instead. W=1 people probably know how to deal with the global
> W=1 already, tracking all per-subsystem WERRORs is too much...

I assume s/W=1/WERROR=y in this line?

> Link: 
> https://lore.kernel.org/all/0da9874b6e9fcbaaa5edeb345d7e2a7c859fc818.1696271334.git.thomas.lenda...@amd.com/
> Signed-off-by: Jakub Kicinski 
> ---
>  Documentation/process/maintainer-kvm-x86.rst |  2 +-
>  arch/x86/kvm/Kconfig | 14 --
>  arch/x86/kvm/Makefile|  1 -
>  3 files changed, 1 insertion(+), 16 deletions(-)
> 
> diff --git a/Documentation/process/maintainer-kvm-x86.rst 
> b/Documentation/process/maintainer-kvm-x86.rst
> index 9183bd449762..cd70c0351108 100644
> --- a/Documentation/process/maintainer-kvm-x86.rst
> +++ b/Documentation/process/maintainer-kvm-x86.rst
> @@ -243,7 +243,7 @@ context and disambiguate the reference.
>  Testing
>  ---
>  At a bare minimum, *all* patches in a series must build cleanly for 
> KVM_INTEL=m
> -KVM_AMD=m, and KVM_WERROR=y.  Building every possible combination of Kconfigs
> +KVM_AMD=m, and WERROR=y.  Building every possible combination of Kconfigs
>  isn't feasible, but the more the merrier.  KVM_SMM, KVM_XEN, PROVE_LOCKING, 
> and
>  X86_64 are particularly interesting knobs to turn.
>  
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index ed90f148140d..12929324ac3e 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -63,20 +63,6 @@ config KVM
>  
> If unsure, say N.
>  
> -config KVM_WERROR
> - bool "Compile KVM with -Werror"
> - # KASAN may cause the build to fail due to larger frames
> - default y if X86_64 && !KASAN

Hrm, I am loath to give up KVM's targeted -Werror as it allows for more 
aggresive
enabling, e.g. enabling CONFIG_WERROR for i386 builds with other defaults 
doesn't
work because of CONFIG_FRAME_WARN=1024.  That in turns means making WERROR=y a
requirement in maintainer-kvm-x86.rst is likely unreasonable.

And arguably KVM_WERROR is doing its job by flagging the linked W=1 error.  The
problem there lies more in my build testing, which I'll go fix by adding a W=1
configuration or three.  As the changelog notes, I highly doubt W=1 builds work
with WERROR, whereas keeping KVM x86 warning-free even with W=1 is feasible.

> - # We use the dependency on !COMPILE_TEST to not be enabled
> - # blindly in allmodconfig or allyesconfig configurations
> - depends on KVM
> - depends on (X86_64 && !KASAN) || !COMPILE_TEST

On a related topic, this is comically stale as WERROR is on by default for both
allmodconfig and allyesconfig, which work because they trigger 64-bit builds.
And KASAN on x86 is 64-bit only.

Rather than yank out KVM_WERROR entirely, what if we make default=n and trim the
depends down to "KVM && EXPERT && !KASAN"?  E.g.

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8452ed0228cb..c2466304aa6a 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -65,13 +65,12 @@ config KVM
 
 config KVM_WERROR
bool "Compile KVM with -Werror"
-   # KASAN may cause the build to fail due to larger frames
-   default y if X86_64 && !KASAN
-   # We use the dependency on !COMPILE_TEST to not be enabled
-   # blindly in allmodconfig or allyesconfig configurations
-   depends on KVM
-   depends on (X86_64 && !KASAN) || !COMPILE_TEST
-   depends on EXPERT
+   # Disallow KVM's -Werror if KASAN=y, e.g. to guard against randomized
+   # configs from selecting KVM_WERROR=y.  KASAN builds generates warnings
+   # for the default FRAME_WARN, i.e. KVM_WERROR=y with KASAN=y requires
+   # special tuning.  Building KVM with -Werror and KASAN is still doable
+   * via enabling the kernel-wide WERROR=y.
+   depends on KVM && EXPERT && !KASAN
help
  Add -Werror to the build flags for KVM.

[PATCH] KVM: x86: Fix implicit enum conversion goof in scattered reverse CPUID code

2021-04-20 Thread Sean Christopherson

Take "enum kvm_only_cpuid_leafs" in scattered specific CPUID helpers
(which is obvious in hindsight), and use "unsigned int" for leafs that
can be the kernel's standard "enum cpuid_leaf" or the aforementioned
KVM-only variant.  Loss of the enum params is a bit disapponting, but
gcc obviously isn't providing any extra sanity checks, and the various
BUILD_BUG_ON() assertions ensure the input is in range.

This fixes implicit enum conversions that are detected by clang-11.

Fixes: 4e66c0cb79b7 ("KVM: x86: Add support for reverse CPUID lookup of 
scattered features")
Cc: Kai Huang 
Signed-off-by: Sean Christopherson 
---

Hopefully it's not too late to squash this...

 arch/x86/kvm/cpuid.c | 5 +++--
 arch/x86/kvm/cpuid.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 96e41e1a1bde..e9d644147bf5 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,7 +365,7 @@ int kvm_vcpu_ioctl_get_cpuid2(struct kvm_vcpu *vcpu,
 }
 
 /* Mask kvm_cpu_caps for @leaf with the raw CPUID capabilities of this CPU. */
-static __always_inline void __kvm_cpu_cap_mask(enum cpuid_leafs leaf)
+static __always_inline void __kvm_cpu_cap_mask(unsigned int leaf)
 {
const struct cpuid_reg cpuid = x86_feature_cpuid(leaf * 32);
struct kvm_cpuid_entry2 entry;
@@ -378,7 +378,8 @@ static __always_inline void __kvm_cpu_cap_mask(enum 
cpuid_leafs leaf)
kvm_cpu_caps[leaf] &= *__cpuid_entry_get_reg(, cpuid.reg);
 }
 
-static __always_inline void kvm_cpu_cap_init_scattered(enum cpuid_leafs leaf, 
u32 mask)
+static __always_inline
+void kvm_cpu_cap_init_scattered(enum kvm_only_cpuid_leafs leaf, u32 mask)
 {
/* Use kvm_cpu_cap_mask for non-scattered leafs. */
BUILD_BUG_ON(leaf < NCAPINTS);
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index eeb4a3020e1b..7bb4504a2944 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -236,7 +236,7 @@ static __always_inline void cpuid_entry_change(struct 
kvm_cpuid_entry2 *entry,
 }
 
 static __always_inline void cpuid_entry_override(struct kvm_cpuid_entry2 
*entry,
-enum cpuid_leafs leaf)
+unsigned int leaf)
 {
u32 *reg = cpuid_entry_get_reg(entry, leaf * 32);
 
-- 
2.31.1.368.gbe11c130af-goog

Re: [PATCH v3 3/9] KVM: x86: Defer tick-based accounting 'til after IRQ handling

2021-04-20 Thread Sean Christopherson

On Wed, Apr 21, 2021, Frederic Weisbecker wrote:
> On Thu, Apr 15, 2021 at 03:21:00PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 16fb39503296..e4d475df1d4a 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > local_irq_disable();
> > kvm_after_interrupt(vcpu);
> >  
> > +   /*
> > +* When using tick-based accounting, wait until after servicing IRQs to
> > +* account guest time so that any ticks that occurred while running the
> > +* guest are properly accounted to the guest.
> > +*/
> > +   if (!vtime_accounting_enabled_this_cpu())
> > +   vtime_account_guest_exit();
> 
> Can we rather have instead:
> 
> static inline void tick_account_guest_exit(void)
> {
>   if (!vtime_accounting_enabled_this_cpu())
>   current->flags &= ~PF_VCPU;
> }
> 
> It duplicates a bit of code but I think this will read less confusing.

Either way works for me.  I used vtime_account_guest_exit() to try to keep as
many details as possible inside vtime, e.g. in case the implemenation is tweaked
in the future.  But I agree that pretending KVM isn't already deeply intertwined
with the details is a lie.

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Sean Christopherson wrote:
> On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > On 20/04/21 22:16, Sean Christopherson wrote:
> > > On Tue, Apr 20, 2021, Sean Christopherson wrote:
> > > > On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > > > > In this particular case, if userspace sets the bit in CPUID2 but 
> > > > > doesn't
> > > > > handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind 
> > > > > of
> > > > > assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS 
> > > > > hypercall.
> > > 
> > > Oh!  Almost forgot my hail mary idea.  Instead of a new capability, can we
> > > reject the hypercall if userspace has _not_ set 
> > > KVM_CAP_ENFORCE_PV_FEATURE_CPUID?
> > > 
> > >   if (vcpu->arch.pv_cpuid.enforce &&
> > >   !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)
> > >   break;
> > 
> > Couldn't userspace enable that capability and _still_ copy the supported
> > CPUID blindly to the guest CPUID, without supporting the hypercall?
> 
> Yes.  I was going to argue that we get to define the behavior, but that's not
> true because it would break existing VMMs that blindly copy.  Capability it 
> is...

Hrm, that won't quite work though.  If userspace blindly copies CPUID, but 
doesn't
enable the capability, the guest will think the hypercall is supported.  The
guest hopefully won't freak out too much on the resulting -KVM_ENOSYS, but it
does make the CPUID flag rather useless.

We can make it work with:

u64 gpa = a0, npages = a1, enc = a2;

if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
break;

if (!PAGE_ALIGNED(gpa) || !npages ||
gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
ret = -EINVAL;
break;
}

if (!vcpu->kvm->arch.hypercall_exit_enabled) {
ret = 0;
break;
}

vcpu->run->exit_reason= KVM_EXIT_HYPERCALL;
vcpu->run->hypercall.nr   = KVM_HC_PAGE_ENC_STATUS;
vcpu->run->hypercall.args[0]  = gpa;
vcpu->run->hypercall.args[1]  = npages;
vcpu->run->hypercall.args[2]  = enc;
vcpu->run->hypercall.longmode = op_64_bit;
vcpu->arch.complete_userspace_io = complete_hypercall_exit;

That's dancing pretty close to hypercall filtering, which I was hoping to avoid.
I guess it's not relly filtering since the exit check happens after the
validity checks.

> > > (BTW, it's better to return a bitmask of hypercalls that will exit to
> > > userspace from KVM_CHECK_EXTENSION.  Userspace can still reject with 
> > > -ENOSYS
> > > those that it doesn't know, but it's important that it knows in general 
> > > how
> > > to handle KVM_EXIT_HYPERCALL).

Speaking of bitmasks, what about also accepting a bitmask for enabling the
capability?  (not sure if the above implies that).  E.g.

if (!(vcpu->kvm->arch.hypercall_exit_enabled & BIT_ULL(nr))) {
ret = 0;
break;
}

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 20/04/21 22:16, Sean Christopherson wrote:
> > On Tue, Apr 20, 2021, Sean Christopherson wrote:
> > > On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > > > In this particular case, if userspace sets the bit in CPUID2 but doesn't
> > > > handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of
> > > > assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS 
> > > > hypercall.
> > 
> > Oh!  Almost forgot my hail mary idea.  Instead of a new capability, can we
> > reject the hypercall if userspace has _not_ set 
> > KVM_CAP_ENFORCE_PV_FEATURE_CPUID?
> > 
> > if (vcpu->arch.pv_cpuid.enforce &&
> > !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)
> > break;
> 
> Couldn't userspace enable that capability and _still_ copy the supported
> CPUID blindly to the guest CPUID, without supporting the hypercall?

Yes.  I was going to argue that we get to define the behavior, but that's not
true because it would break existing VMMs that blindly copy.  Capability it 
is...

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Sean Christopherson wrote:
> On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > On 20/04/21 19:31, Sean Christopherson wrote:
> > > > +   case KVM_HC_PAGE_ENC_STATUS: {
> > > > +   u64 gpa = a0, npages = a1, enc = a2;
> > > > +
> > > > +   ret = -KVM_ENOSYS;
> > > > +   if (!vcpu->kvm->arch.hypercall_exit_enabled)
> > > 
> > > I don't follow, why does the hypercall need to be gated by a capability?  
> > > What
> > > would break if this were changed to?
> > > 
> > >   if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
> > 
> > The problem is that it's valid to take KVM_GET_SUPPORTED_CPUID and send it
> > unmodified to KVM_SET_CPUID2.  For this reason, features that are
> > conditional on other ioctls, or that require some kind of userspace support,
> > must not be in KVM_GET_SUPPORTED_CPUID.  For example:
> > 
> > - TSC_DEADLINE because it is only implemented after KVM_CREATE_IRQCHIP (or
> > after KVM_ENABLE_CAP of KVM_CAP_IRQCHIP_SPLIT)
> > 
> > - MONITOR only makes sense if userspace enables KVM_CAP_X86_DISABLE_EXITS
> > 
> > X2APIC is reported even though it shouldn't be.  Too late to fix that, I
> > think.
> > 
> > In this particular case, if userspace sets the bit in CPUID2 but doesn't
> > handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of
> > assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall.
> 
> Gah, I was thinking of the MSR behavior and forgot that the hypercall exiting
> behavior intentionally doesn't require extra filtering.
> 
> It's also worth noting that guest_pv_has() is particularly useless since it
> will unconditionally return true for older VMMs that dont' enable
> KVM_CAP_ENFORCE_PV_FEATURE_CPUID.
> 
> Bummer.

Oh!  Almost forgot my hail mary idea.  Instead of a new capability, can we
reject the hypercall if userspace has _not_ set 
KVM_CAP_ENFORCE_PV_FEATURE_CPUID?

if (vcpu->arch.pv_cpuid.enforce &&
!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS)
break;

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 20/04/21 19:31, Sean Christopherson wrote:
> > > + case KVM_HC_PAGE_ENC_STATUS: {
> > > + u64 gpa = a0, npages = a1, enc = a2;
> > > +
> > > + ret = -KVM_ENOSYS;
> > > + if (!vcpu->kvm->arch.hypercall_exit_enabled)
> > 
> > I don't follow, why does the hypercall need to be gated by a capability?  
> > What
> > would break if this were changed to?
> > 
> > if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
> 
> The problem is that it's valid to take KVM_GET_SUPPORTED_CPUID and send it
> unmodified to KVM_SET_CPUID2.  For this reason, features that are
> conditional on other ioctls, or that require some kind of userspace support,
> must not be in KVM_GET_SUPPORTED_CPUID.  For example:
> 
> - TSC_DEADLINE because it is only implemented after KVM_CREATE_IRQCHIP (or
> after KVM_ENABLE_CAP of KVM_CAP_IRQCHIP_SPLIT)
> 
> - MONITOR only makes sense if userspace enables KVM_CAP_X86_DISABLE_EXITS
> 
> X2APIC is reported even though it shouldn't be.  Too late to fix that, I
> think.
> 
> In this particular case, if userspace sets the bit in CPUID2 but doesn't
> handle KVM_EXIT_HYPERCALL, the guest will probably trigger some kind of
> assertion failure as soon as it invokes the HC_PAGE_ENC_STATUS hypercall.

Gah, I was thinking of the MSR behavior and forgot that the hypercall exiting
behavior intentionally doesn't require extra filtering.

It's also worth noting that guest_pv_has() is particularly useless since it
will unconditionally return true for older VMMs that dont' enable
KVM_CAP_ENFORCE_PV_FEATURE_CPUID.

Bummer.

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Sean Christopherson wrote:
> On Tue, Apr 20, 2021, Ashish Kalra wrote:
> > On Tue, Apr 20, 2021 at 05:31:07PM +0000, Sean Christopherson wrote:
> > > On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > > > +   case KVM_HC_PAGE_ENC_STATUS: {
> > > > +   u64 gpa = a0, npages = a1, enc = a2;
> > > > +
> > > > +   ret = -KVM_ENOSYS;
> > > > +   if (!vcpu->kvm->arch.hypercall_exit_enabled)
> > > 
> > > I don't follow, why does the hypercall need to be gated by a capability?  
> > > What
> > > would break if this were changed to?
> > > 
> > >   if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
> > > 
> > 
> > But, the above indicates host support for page_enc_status_hc, so we want
> > to ensure that host supports and has enabled support for the hypercall
> > exit, i.e., hypercall has been enabled.
> 
> I still don't see how parroting back KVM_GET_SUPPORTED_CPUID, i.e. 
> "unintentionally"
> setting KVM_FEATURE_HC_PAGE_ENC_STATUS, would break anything.  Sure, the guest
> does unnecessary hypercalls, but they're eaten by KVM.  On the flip side, 
> gating
> the hypercall on the capability, and especially only the capability, creates
> weird scenarios where the guest can observe KVM_FEATURE_HC_PAGE_ENC_STATUS=1
> but fail the hypercall.  Those would be fairly clearcut VMM bugs, but at the
> same time KVM is essentially going out of its way to manufacture the problem.

Doh, I was thinking of the MSR behavior, not the hypercall.  I'll respond to
Paolo's mail, I have one more hail mary idea.

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Ashish Kalra wrote:
> On Tue, Apr 20, 2021 at 05:31:07PM +0000, Sean Christopherson wrote:
> > On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> > > + case KVM_HC_PAGE_ENC_STATUS: {
> > > + u64 gpa = a0, npages = a1, enc = a2;
> > > +
> > > + ret = -KVM_ENOSYS;
> > > + if (!vcpu->kvm->arch.hypercall_exit_enabled)
> > 
> > I don't follow, why does the hypercall need to be gated by a capability?  
> > What
> > would break if this were changed to?
> > 
> > if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
> > 
> 
> But, the above indicates host support for page_enc_status_hc, so we want
> to ensure that host supports and has enabled support for the hypercall
> exit, i.e., hypercall has been enabled.

I still don't see how parroting back KVM_GET_SUPPORTED_CPUID, i.e. 
"unintentionally"
setting KVM_FEATURE_HC_PAGE_ENC_STATUS, would break anything.  Sure, the guest
does unnecessary hypercalls, but they're eaten by KVM.  On the flip side, gating
the hypercall on the capability, and especially only the capability, creates
weird scenarios where the guest can observe KVM_FEATURE_HC_PAGE_ENC_STATUS=1
but fail the hypercall.  Those would be fairly clearcut VMM bugs, but at the
same time KVM is essentially going out of its way to manufacture the problem.

Re: [PATCH v5 1/3] KVM: nVMX: Sync L2 guest CET states between L1/L2

2021-04-20 Thread Sean Christopherson

On Fri, Apr 09, 2021, Yang Weijiang wrote:
> These fields are rarely updated by L1 QEMU/KVM, sync them when L1 is trying to
> read/write them and after they're changed. If CET guest entry-load bit is not
> set by L1 guest, migrate them to L2 manaully.
> 
> Opportunistically remove one blank line in previous patch.
> 
> Suggested-by: Sean Christopherson 
> Signed-off-by: Yang Weijiang 
> ---
>  arch/x86/kvm/cpuid.c  |  1 -
>  arch/x86/kvm/vmx/nested.c | 30 ++
>  arch/x86/kvm/vmx/vmx.h|  3 +++
>  3 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index d191de769093..8692f53b8cd0 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -143,7 +143,6 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>   }
>   vcpu->arch.guest_supported_xss =
>   (((u64)best->edx << 32) | best->ecx) & supported_xss;
> -
>   } else {
>   vcpu->arch.guest_supported_xss = 0;
>   }
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 9728efd529a1..87beb1c034e1 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2516,6 +2516,13 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, 
> struct vmcs12 *vmcs12)
>   vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
>  
>   set_cr4_guest_host_mask(vmx);
> +
> + if (kvm_cet_supported() && vmx->nested.nested_run_pending &&
> + (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)) {
> + vmcs_writel(GUEST_SSP, vmcs12->guest_ssp);
> + vmcs_writel(GUEST_S_CET, vmcs12->guest_s_cet);
> + vmcs_writel(GUEST_INTR_SSP_TABLE, vmcs12->guest_ssp_tbl);
> + }
>  }
>  
>  /*
> @@ -2556,6 +2563,15 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   if (kvm_mpx_supported() && (!vmx->nested.nested_run_pending ||
>   !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS)))
>   vmcs_write64(GUEST_BNDCFGS, vmx->nested.vmcs01_guest_bndcfgs);
> +
> + if (kvm_cet_supported() && (!vmx->nested.nested_run_pending ||
> + !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE))) {
> + vmcs_writel(GUEST_SSP, vmx->nested.vmcs01_guest_ssp);
> + vmcs_writel(GUEST_S_CET, vmx->nested.vmcs01_guest_s_cet);
> + vmcs_writel(GUEST_INTR_SSP_TABLE,
> + vmx->nested.vmcs01_guest_ssp_tbl);
> + }
> +
>   vmx_set_rflags(vcpu, vmcs12->guest_rflags);
>  
>   /* EXCEPTION_BITMAP and CR0_GUEST_HOST_MASK should basically be the
> @@ -3375,6 +3391,11 @@ enum nvmx_vmentry_status 
> nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>   if (kvm_mpx_supported() &&
>   !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_BNDCFGS))
>   vmx->nested.vmcs01_guest_bndcfgs = vmcs_read64(GUEST_BNDCFGS);
> + if (kvm_cet_supported() && !vmx->nested.nested_run_pending) {

This needs to be:

if (kvm_cet_supported() && (!vmx->nested.nested_run_pending ||
!(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_CET_STATE)))

otherwise the vmcs01_* members will be stale when emulating VM-Enter with
vcmc12.vm_entry_controls.LOAD_CET_STATE=0.

> + vmx->nested.vmcs01_guest_ssp = vmcs_readl(GUEST_SSP);
> + vmx->nested.vmcs01_guest_s_cet = vmcs_readl(GUEST_S_CET);
> + vmx->nested.vmcs01_guest_ssp_tbl = 
> vmcs_readl(GUEST_INTR_SSP_TABLE);
> + }

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> From ef78673f78e3f2eedc498c1fbf9271146caa83cb Mon Sep 17 00:00:00 2001
> From: Ashish Kalra 
> Date: Thu, 15 Apr 2021 15:57:02 +
> Subject: [PATCH 2/3] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall
> 
> This hypercall is used by the SEV guest to notify a change in the page
> encryption status to the hypervisor. The hypercall should be invoked
> only when the encryption attribute is changed from encrypted -> decrypted
> and vice versa. By default all guest pages are considered encrypted.
> 
> The hypercall exits to userspace to manage the guest shared regions and
> integrate with the userspace VMM's migration code.

...

> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index fd4a84911355..2bc353d1f356 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6766,3 +6766,14 @@ they will get passed on to user space. So user space 
> still has to have
>  an implementation for these despite the in kernel acceleration.
>  
>  This capability is always enabled.
> +
> +8.32 KVM_CAP_EXIT_HYPERCALL
> +---
> +
> +:Capability: KVM_CAP_EXIT_HYPERCALL
> +:Architectures: x86
> +:Type: vm
> +
> +This capability, if enabled, will cause KVM to exit to userspace
> +with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
> +Right now, the only such hypercall is KVM_HC_PAGE_ENC_STATUS.
> diff --git a/Documentation/virt/kvm/cpuid.rst 
> b/Documentation/virt/kvm/cpuid.rst
> index cf62162d4be2..c9378d163b5a 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -96,6 +96,11 @@ KVM_FEATURE_MSI_EXT_DEST_ID15  guest 
> checks this feature bit
> before using extended 
> destination
> ID bits in MSI address bits 
> 11-5.
>  
> +KVM_FEATURE_HC_PAGE_ENC_STATUS 16  guest checks this feature bit 
> before
> +   using the page encryption 
> state
> +   hypercall to notify the page 
> state
> +   change

...

>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>  {
>   unsigned long nr, a0, a1, a2, a3, ret;
> @@ -8334,6 +8346,28 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   kvm_sched_yield(vcpu, a0);
>   ret = 0;
>   break;
> + case KVM_HC_PAGE_ENC_STATUS: {
> + u64 gpa = a0, npages = a1, enc = a2;
> +
> + ret = -KVM_ENOSYS;
> + if (!vcpu->kvm->arch.hypercall_exit_enabled)

I don't follow, why does the hypercall need to be gated by a capability?  What
would break if this were changed to?

if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))

> + break;
> +
> + if (!PAGE_ALIGNED(gpa) || !npages ||
> + gpa_to_gfn(gpa) + npages <= gpa_to_gfn(gpa)) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + vcpu->run->exit_reason= KVM_EXIT_HYPERCALL;
> + vcpu->run->hypercall.nr   = KVM_HC_PAGE_ENC_STATUS;
> + vcpu->run->hypercall.args[0]  = gpa;
> + vcpu->run->hypercall.args[1]  = npages;
> + vcpu->run->hypercall.args[2]  = enc;
> + vcpu->run->hypercall.longmode = op_64_bit;
> + vcpu->arch.complete_userspace_io = complete_hypercall_exit;
> + return 0;
> + }
>   default:
>   ret = -KVM_ENOSYS;
>   break;

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 590cc811c99a..d696a9f13e33 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3258,6 +3258,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   vcpu->arch.msr_kvm_poll_control = data;
>   break;
>  
> + case MSR_KVM_MIGRATION_CONTROL:
> + if (data & ~KVM_PAGE_ENC_STATUS_UPTODATE)
> + return 1;
> +
> + if (data && !guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))

Why let the guest write '0'?  Letting the guest do WRMSR but not RDMSR is
bizarre.

> + return 1;
> + break;
> +
>   case MSR_IA32_MCG_CTL:
>   case MSR_IA32_MCG_STATUS:
>   case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
> @@ -3549,6 +3557,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF))
>   return 1;
>  
> + msr_info->data = 0;
> + break;
> + case MSR_KVM_MIGRATION_CONTROL:
> + if (!guest_pv_has(vcpu, KVM_FEATURE_HC_PAGE_ENC_STATUS))
> + return 1;
> +
>   msr_info->data = 0;
>

Re: [PATCH v13 08/12] KVM: X86: Introduce KVM_HC_PAGE_ENC_STATUS hypercall

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 15/04/21 17:57, Ashish Kalra wrote:
> > From: Ashish Kalra 
> > 
> > This hypercall is used by the SEV guest to notify a change in the page
> > encryption status to the hypervisor. The hypercall should be invoked
> > only when the encryption attribute is changed from encrypted -> decrypted
> > and vice versa. By default all guest pages are considered encrypted.
> > 
> > The hypercall exits to userspace to manage the guest shared regions and
> > integrate with the userspace VMM's migration code.
> 
> I think this should be exposed to userspace as a capability, rather than as
> a CPUID bit.  Userspace then can enable the capability and set the CPUID bit
> if it wants.
> 
> The reason is that userspace could pass KVM_GET_SUPPORTED_CPUID to
> KVM_SET_CPUID2 and the hypercall then would break the guest.

Right, and that's partly why I was advocating that KVM emulate the MSR as a nop.

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 08:09:13PM +0000, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > The critical question is whether we ever need to translate hva->pfn after
> > > the page is added to the guest private memory. I believe we do, but I
> > > never checked. And that's the reason we need to keep hwpoison entries
> > > around, which encode pfn.
> > 
> > As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
> > guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail 
> > on
> > the backend).  But in that case, KVM still has the original PFN, the "new"
> > translation becomes a sanity check to make sure that the zapped translation
> > wasn't moved unexpectedly.
> > 
> > Regardless, I don't see what that has to do with kvm_pfn_map.  At some 
> > point,
> > gup() has to fault in the page or look at the host PTE value.  For the 
> > latter,
> > at least on x86, we can throw info into the PTE itself to tag it as 
> > guest-only.
> > No matter what implementation we settle on, I think we've failed if we end 
> > up in
> > a situation where the primary MMU has pages it doesn't know are guest-only.
> 
> I try to understand if it's a problem if KVM sees a guest-only PTE, but
> it's for other VM. Like two VM's try to use the same tmpfs file as guest
> memory. We cannot insert the pfn into two TD/SEV guest at once, but can it
> cause other problems? I'm not sure.

For TDX and SNP, "firmware" will prevent assigning the same PFN to multiple VMs.

For SEV and SEV-ES, the PSP (what I'm calling "firmware") will not prevent
assigning the same page to multiple guests.  But the failure mode in that case,
assuming the guests have different ASIDs, is limited to corruption of the guest.

On the other hand, for SEV/SEV-ES it's not invalid to assign the same ASID to
multiple guests (there's an in-flight patch to do exactly that[*]), and sharing
PFNs between guests with the same ASID would also be valid.  In other words, if
we want to enforce PFN association in the kernel, I think the association should
be per-ASID, not per-KVM guest.

So, I don't think we _need_ to rely on the TDX/SNP behavior, but if leveraging
firmware to handle those checks means avoiding additional complexity in the
kernel, then I think it's worth leaning on firmware even if it means SEV/SEV-ES
don't enjoy the same level of robustness.

[*] https://lkml.kernel.org/r/20210408223214.2582277-1-na...@google.com

Re: [PATCH 0/3] KVM: x86: guest interface for SEV live migration

2021-04-20 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> From 547d4d4edcd05fdfac6ce650d65db1d42bcd2807 Mon Sep 17 00:00:00 2001
> From: Paolo Bonzini 
> Date: Tue, 20 Apr 2021 05:49:11 -0400
> Subject: [PATCH 1/3] KVM: SEV: mask CPUID[0x801F].eax according to
>  supported features

Your mailer is obviously a bit wonky, took me a while to find this patch :-)
 
> Do not return the SEV-ES bit from KVM_GET_SUPPORTED_CPUID unless
> the corresponding module parameter is 1, and clear the memory encryption
> leaf completely if SEV is disabled.

Impeccable timing, I was planning on refreshing my SEV cleanup series[*] today.
There's going to be an annoying conflict with the svm_set_cpu_caps() change
(see below), any objecting to folding your unintentional feedback into my 
series?

[*] https://lkml.kernel.org/r/20210306015905.186698-1-sea...@google.com

> Signed-off-by: Paolo Bonzini 
> ---
>  arch/x86/kvm/cpuid.c   | 5 -
>  arch/x86/kvm/cpuid.h   | 1 +
>  arch/x86/kvm/svm/svm.c | 7 +++
>  3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 2ae061586677..d791d1f093ab 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -944,8 +944,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
> *array, u32 function)
>   break;
>   /* Support memory encryption cpuid if host supports it */
>   case 0x801F:
> - if (!boot_cpu_has(X86_FEATURE_SEV))
> + if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
>   entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
> + break;
> + }
> + cpuid_entry_override(entry, CPUID_8000_001F_EAX);

I find this easier to read:

if (!kvm_cpu_cap_has(X86_FEATURE_SEV))
entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
else
cpuid_entry_override(entry, CPUID_8000_001F_EAX);

>   break;
>   /*Add support for Centaur's CPUID instruction*/
>   case 0xC000:
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index 888e88b42e8d..e873a60a4830 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -99,6 +99,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
>   [CPUID_7_EDX] = { 7, 0, CPUID_EDX},
>   [CPUID_7_1_EAX]   = { 7, 1, CPUID_EAX},
>   [CPUID_12_EAX]= {0x0012, 0, CPUID_EAX},
> + [CPUID_8000_001F_EAX] = {0x801F, 0, CPUID_EAX},
>  };
>  
>  /*
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index cd8c333ed2dc..acdb8457289e 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -923,6 +923,13 @@ static __init void svm_set_cpu_caps(void)
>   if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
>   boot_cpu_has(X86_FEATURE_AMD_SSBD))
>   kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
> +
> + /* CPUID 0x801F */
> + if (sev) {
> + kvm_cpu_cap_set(X86_FEATURE_SEV);
> + if (sev_es)
> + kvm_cpu_cap_set(X86_FEATURE_SEV_ES);

Gah, I completely spaced on the module params in my series, which is more
problematic than normal because it also moves "sev" and "sev_es" to sev.c.  The
easy solution is to add sev_set_cpu_caps().

On the other, this misses SME_COHERENT.  I also think it makes sense to call
kvm_cpu_cap_mask() for the leaf, even if it's just to crush KVM's caps to zero.
However, because of SME_COHERENT and other potential bits in the future, I think
I prefer starting with the bits carried over from boot_cpu_data.  E.g.

kvm_cpu_cap_mask(CPUID_8000_001F_EAX,
0 /* SME */ | F(SEV) | 0 /* VM_PAGE_FLUSH */ | F(SEV_ES) |
F(SME_COHERENT));

and (with renamed module params):

if (sev_enabled)
kvm_cpu_cap_clear(X86_FEATURE_SEV);
if (sev_es_enabled)
kvm_cpu_cap_clear(X86_FEATURE_SEV_ES);

> + }
>  }
>  
>  static __init int svm_hardware_setup(void)

Re: [PATCH v4 4/4] pinctrl: add rsel setting on MT8195

2021-04-19 Thread Sean Wang

On Mon, Apr 12, 2021 at 10:57 PM Zhiyong Tao  wrote:


> @@ -176,6 +180,12 @@ static int mtk_pinconf_get(struct pinctrl_dev *pctldev,
> else
> err = -ENOTSUPP;
> break;
> +   case MTK_PIN_CONFIG_RSEL:
> +   if (hw->soc->rsel_get)
> +   err = hw->soc->rsel_get(hw, desc, );
> +   else
> +   err = -EOPNOTSUPP;

I think that should want to be -ENOTSUPP to align other occurrences.

> +   break;
> default:
> err = -ENOTSUPP;
> }
> @@ -295,6 +305,12 @@ static int mtk_pinconf_set(struct pinctrl_dev *pctldev, 
> unsigned int pin,
> else
> err = -ENOTSUPP;
> break;
> +   case MTK_PIN_CONFIG_RSEL:
> +   if (hw->soc->rsel_set)
> +   err = hw->soc->rsel_set(hw, desc, arg);
> +   else
> +   err = -EOPNOTSUPP;

Ditto

> +   break;
> default:
> err = -ENOTSUPP;
> }
> --
> 2.18.0
>

Re: [PATCH v4 3/4] pinctrl: add drive for I2C related pins on MT8195

2021-04-19 Thread Sean Wang

On Mon, Apr 12, 2021 at 10:57 PM Zhiyong Tao  wrote:
>
> This patch provides the advanced drive raw data setting version
> for I2C used pins on MT8195.
>
> Signed-off-by: Zhiyong Tao 

Acked-by: Sean Wang 

> ---
>  drivers/pinctrl/mediatek/pinctrl-mt8195.c | 22 +++
>  .../pinctrl/mediatek/pinctrl-mtk-common-v2.c  | 14 
>  .../pinctrl/mediatek/pinctrl-mtk-common-v2.h  |  5 +
>  3 files changed, 41 insertions(+)
>
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mt8195.c 
> b/drivers/pinctrl/mediatek/pinctrl-mt8195.c
> index 063f164d7c9b..a7500e18bb1d 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-mt8195.c
> +++ b/drivers/pinctrl/mediatek/pinctrl-mt8195.c
> @@ -760,6 +760,25 @@ static const struct mtk_pin_field_calc 
> mt8195_pin_drv_range[] = {
> PIN_FIELD_BASE(143, 143, 1, 0x020, 0x10, 24, 3),
>  };
>
> +static const struct mtk_pin_field_calc mt8195_pin_drv_adv_range[] = {
> +   PIN_FIELD_BASE(8, 8, 4, 0x020, 0x10, 15, 3),
> +   PIN_FIELD_BASE(9, 9, 4, 0x020, 0x10, 0, 3),
> +   PIN_FIELD_BASE(10, 10, 4, 0x020, 0x10, 18, 3),
> +   PIN_FIELD_BASE(11, 11, 4, 0x020, 0x10, 3, 3),
> +   PIN_FIELD_BASE(12, 12, 4, 0x020, 0x10, 21, 3),
> +   PIN_FIELD_BASE(13, 13, 4, 0x020, 0x10, 6, 3),
> +   PIN_FIELD_BASE(14, 14, 4, 0x020, 0x10, 24, 3),
> +   PIN_FIELD_BASE(15, 15, 4, 0x020, 0x10, 9, 3),
> +   PIN_FIELD_BASE(16, 16, 4, 0x020, 0x10, 27, 3),
> +   PIN_FIELD_BASE(17, 17, 4, 0x020, 0x10, 12, 3),
> +   PIN_FIELD_BASE(29, 29, 2, 0x020, 0x10, 0, 3),
> +   PIN_FIELD_BASE(30, 30, 2, 0x020, 0x10, 3, 3),
> +   PIN_FIELD_BASE(34, 34, 1, 0x040, 0x10, 0, 3),
> +   PIN_FIELD_BASE(35, 35, 1, 0x040, 0x10, 3, 3),
> +   PIN_FIELD_BASE(44, 44, 1, 0x040, 0x10, 6, 3),
> +   PIN_FIELD_BASE(45, 45, 1, 0x040, 0x10, 9, 3),
> +};
> +
>  static const struct mtk_pin_reg_calc mt8195_reg_cals[PINCTRL_PIN_REG_MAX] = {
> [PINCTRL_PIN_REG_MODE] = MTK_RANGE(mt8195_pin_mode_range),
> [PINCTRL_PIN_REG_DIR] = MTK_RANGE(mt8195_pin_dir_range),
> @@ -773,6 +792,7 @@ static const struct mtk_pin_reg_calc 
> mt8195_reg_cals[PINCTRL_PIN_REG_MAX] = {
> [PINCTRL_PIN_REG_PUPD] = MTK_RANGE(mt8195_pin_pupd_range),
> [PINCTRL_PIN_REG_R0] = MTK_RANGE(mt8195_pin_r0_range),
> [PINCTRL_PIN_REG_R1] = MTK_RANGE(mt8195_pin_r1_range),
> +   [PINCTRL_PIN_REG_DRV_ADV] = MTK_RANGE(mt8195_pin_drv_adv_range),
>  };
>
>  static const char * const mt8195_pinctrl_register_base_names[] = {
> @@ -801,6 +821,8 @@ static const struct mtk_pin_soc mt8195_data = {
> .bias_get_combo = mtk_pinconf_bias_get_combo,
> .drive_set = mtk_pinconf_drive_set_rev1,
> .drive_get = mtk_pinconf_drive_get_rev1,
> +   .adv_drive_get = mtk_pinconf_adv_drive_get_raw,
> +   .adv_drive_set = mtk_pinconf_adv_drive_set_raw,
>  };
>
>  static const struct of_device_id mt8195_pinctrl_of_match[] = {
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c 
> b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> index 72f17f26acd8..2b51f4a9b860 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> +++ b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.c
> @@ -1027,6 +1027,20 @@ int mtk_pinconf_adv_drive_get(struct mtk_pinctrl *hw,
>  }
>  EXPORT_SYMBOL_GPL(mtk_pinconf_adv_drive_get);
>
> +int mtk_pinconf_adv_drive_set_raw(struct mtk_pinctrl *hw,
> + const struct mtk_pin_desc *desc, u32 arg)
> +{
> +   return mtk_hw_set_value(hw, desc, PINCTRL_PIN_REG_DRV_ADV, arg);
> +}
> +EXPORT_SYMBOL_GPL(mtk_pinconf_adv_drive_set_raw);
> +
> +int mtk_pinconf_adv_drive_get_raw(struct mtk_pinctrl *hw,
> + const struct mtk_pin_desc *desc, u32 *val)
> +{
> +   return mtk_hw_get_value(hw, desc, PINCTRL_PIN_REG_DRV_ADV, val);
> +}
> +EXPORT_SYMBOL_GPL(mtk_pinconf_adv_drive_get_raw);
> +
>  MODULE_LICENSE("GPL v2");
>  MODULE_AUTHOR("Sean Wang ");
>  MODULE_DESCRIPTION("Pin configuration library module for mediatek SoCs");
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h 
> b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h
> index e2aae285b5fc..fd5ce9c5dcbd 100644
> --- a/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h
> +++ b/drivers/pinctrl/mediatek/pinctrl-mtk-common-v2.h
> @@ -66,6 +66,7 @@ enum {
> PINCTRL_PIN_REG_DRV_EN,
> PINCTRL_PIN_REG_DRV_E0,
> PINCTRL_PIN_REG_DRV_E1,
> +   PINCTRL_PIN_REG_DRV_ADV,
> PINCTRL_PIN_REG_MAX,
>  };
>
> @@ -314,6 +315,10 @@ int mtk_pinconf_adv_drive_set(struct mtk_pinctrl *hw,
>   const struct mtk_pin_desc *desc, u3

Re: [PATCH v4 2/4] pinctrl: add pinctrl driver on mt8195

2021-04-19 Thread Sean Wang

On Mon, Apr 12, 2021 at 10:57 PM Zhiyong Tao  wrote:
>
> This commit includes pinctrl driver for mt8195.
>
> Signed-off-by: Zhiyong Tao 

Acked-by: Sean Wang 

> ---
>  drivers/pinctrl/mediatek/Kconfig  |6 +
>  drivers/pinctrl/mediatek/Makefile |1 +
>  drivers/pinctrl/mediatek/pinctrl-mt8195.c |  828 
>  drivers/pinctrl/mediatek/pinctrl-mtk-mt8195.h | 1669 +
>  4 files changed, 2504 insertions(+)
>  create mode 100644 drivers/pinctrl/mediatek/pinctrl-mt8195.c
>  create mode 100644 drivers/pinctrl/mediatek/pinctrl-mtk-mt8195.h
>
> diff --git a/drivers/pinctrl/mediatek/Kconfig 
> b/drivers/pinctrl/mediatek/Kconfig
> index eef17f228669..90f0c8255eaf 100644
> --- a/drivers/pinctrl/mediatek/Kconfig
> +++ b/drivers/pinctrl/mediatek/Kconfig
> @@ -147,6 +147,12 @@ config PINCTRL_MT8192
> default ARM64 && ARCH_MEDIATEK
> select PINCTRL_MTK_PARIS
>
> +config PINCTRL_MT8195
> +   bool "Mediatek MT8195 pin control"
> +   depends on OF
> +   depends on ARM64 || COMPILE_TEST
> +   select PINCTRL_MTK_PARIS
> +
>  config PINCTRL_MT8516
> bool "Mediatek MT8516 pin control"
> depends on OF
> diff --git a/drivers/pinctrl/mediatek/Makefile 
> b/drivers/pinctrl/mediatek/Makefile
> index 01218bf4dc30..06fde993ace2 100644
> --- a/drivers/pinctrl/mediatek/Makefile
> +++ b/drivers/pinctrl/mediatek/Makefile
> @@ -21,5 +21,6 @@ obj-$(CONFIG_PINCTRL_MT8167)  += pinctrl-mt8167.o
>  obj-$(CONFIG_PINCTRL_MT8173)   += pinctrl-mt8173.o
>  obj-$(CONFIG_PINCTRL_MT8183)   += pinctrl-mt8183.o
>  obj-$(CONFIG_PINCTRL_MT8192)   += pinctrl-mt8192.o
> +obj-$(CONFIG_PINCTRL_MT8195)+= pinctrl-mt8195.o
>  obj-$(CONFIG_PINCTRL_MT8516)   += pinctrl-mt8516.o
>  obj-$(CONFIG_PINCTRL_MT6397)   += pinctrl-mt6397.o
> diff --git a/drivers/pinctrl/mediatek/pinctrl-mt8195.c 
> b/drivers/pinctrl/mediatek/pinctrl-mt8195.c
> new file mode 100644
> index ..063f164d7c9b
> --- /dev/null
> +++ b/drivers/pinctrl/mediatek/pinctrl-mt8195.c
> @@ -0,0 +1,828 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2020 MediaTek Inc.
> + *
> + * Author: Zhiyong Tao 
> + *
> + */
> +
> +#include "pinctrl-mtk-mt8195.h"
> +#include "pinctrl-paris.h"
> +
> +/* MT8195 have multiple bases to program pin configuration listed as the 
> below:
> + * iocfg[0]:0x10005000, iocfg[1]:0x11d1, iocfg[2]:0x11d3,
> + * iocfg[3]:0x11d4, iocfg[4]:0x11e2, iocfg[5]:0x11eb,
> + * iocfg[6]:0x11f4.
> + * _i_based could be used to indicate what base the pin should be mapped 
> into.
> + */
> +
> +#define PIN_FIELD_BASE(s_pin, e_pin, i_base, s_addr, x_addrs, s_bit, x_bits) 
> \
> +   PIN_FIELD_CALC(s_pin, e_pin, i_base, s_addr, x_addrs, s_bit, x_bits, \
> +  32, 0)
> +
> +#define PINS_FIELD_BASE(s_pin, e_pin, i_base, s_addr, x_addrs, s_bit, 
> x_bits) \
> +   PIN_FIELD_CALC(s_pin, e_pin, i_base, s_addr, x_addrs, s_bit, x_bits,  
> \
> +  32, 1)
> +
> +static const struct mtk_pin_field_calc mt8195_pin_mode_range[] = {
> +   PIN_FIELD(0, 144, 0x300, 0x10, 0, 4),
> +};
> +
> +static const struct mtk_pin_field_calc mt8195_pin_dir_range[] = {
> +   PIN_FIELD(0, 144, 0x0, 0x10, 0, 1),
> +};
> +
> +static const struct mtk_pin_field_calc mt8195_pin_di_range[] = {
> +   PIN_FIELD(0, 144, 0x200, 0x10, 0, 1),
> +};
> +
> +static const struct mtk_pin_field_calc mt8195_pin_do_range[] = {
> +   PIN_FIELD(0, 144, 0x100, 0x10, 0, 1),
> +};
> +
> +static const struct mtk_pin_field_calc mt8195_pin_ies_range[] = {
> +   PIN_FIELD_BASE(0, 0, 4, 0x040, 0x10, 0, 1),
> +   PIN_FIELD_BASE(1, 1, 4, 0x040, 0x10, 1, 1),
> +   PIN_FIELD_BASE(2, 2, 4, 0x040, 0x10, 2, 1),
> +   PIN_FIELD_BASE(3, 3, 4, 0x040, 0x10, 3, 1),
> +   PIN_FIELD_BASE(4, 4, 4, 0x040, 0x10, 4, 1),
> +   PIN_FIELD_BASE(5, 5, 4, 0x040, 0x10, 5, 1),
> +   PIN_FIELD_BASE(6, 6, 4, 0x040, 0x10, 6, 1),
> +   PIN_FIELD_BASE(7, 7, 4, 0x040, 0x10, 7, 1),
> +   PIN_FIELD_BASE(8, 8, 4, 0x040, 0x10, 13, 1),
> +   PIN_FIELD_BASE(9, 9, 4, 0x040, 0x10, 8, 1),
> +   PIN_FIELD_BASE(10, 10, 4, 0x040, 0x10, 14, 1),
> +   PIN_FIELD_BASE(11, 11, 4, 0x040, 0x10, 9, 1),
> +   PIN_FIELD_BASE(12, 12, 4, 0x040, 0x10, 15, 1),
> +   PIN_FIELD_BASE(13, 13, 4, 0x040, 0x10, 10, 1),
> +   PIN_FIELD_BASE(14, 14, 4, 0x040, 0x10, 16, 1),
> +   PIN_FIELD_BASE(15, 15, 4, 0x040, 0x10, 11, 1),
> +   PIN_FIELD_BASE(16, 16, 4, 0x040, 0x10, 17, 1),
> +   PIN_FIELD_BASE(17, 17, 4, 0x040, 0x10, 12,

Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Sean Christopherson

On Tue, Apr 20, 2021, Paolo Bonzini wrote:
> On 19/04/21 17:09, Sean Christopherson wrote:
> > > - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> > > own interval-tree-based filter is also using a similar mechanism that is
> > > likewise not fair, so it should be okay.
> > 
> > The one concern I had with an unfair mechanism of this nature is that, in 
> > theory,
> > the memslot update could be blocked indefinitely.
> 
> Yep, that's why I mentioned it.
> 
> > > @@ -1333,9 +1351,22 @@ static struct kvm_memslots 
> > > *install_new_memslots(struct kvm *kvm,
> > >   WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
> > >   slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> > > - down_write(>mmu_notifier_slots_lock);
> > > + /*
> > > +  * This cannot be an rwsem because the MMU notifier must not run
> > > +  * inside the critical section.  A sleeping rwsem cannot exclude
> > > +  * that.
> > 
> > How on earth did you decipher that from the splat?  I stared at it for a 
> > good
> > five minutes and was completely befuddled.
> 
> Just scratch that, it makes no sense.  It's much simpler, but you have
> to look at include/linux/mmu_notifier.h to figure it out:

LOL, glad you could figure it out, I wasn't getting anywhere, mmu_notifier.h or
not.

> invalidate_range_start
>   take pseudo lock
>   down_read()   (*)
>   release pseudo lock
> invalidate_range_end
>   take pseudo lock  (**)
>   up_read()
>   release pseudo lock
> 
> At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
> at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
> 
> This could cause a deadlock (ignoring for a second that the pseudo lock
> is not a lock):
> 
> - invalidate_range_start waits on down_read(), because the rwsem is
> held by install_new_memslots
> 
> - install_new_memslots waits on down_write(), because the rwsem is
> held till (another) invalidate_range_end finishes
> 
> - invalidate_range_end sits waits on the pseudo lock, held by
> invalidate_range_start.
> 
> Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
> it would change the *shared* rwsem readers into *shared recursive*
> readers).  This also means that there's no need for a raw spinlock.

Ahh, thanks, this finally made things click.

> Given this simple explanation, I think it's okay to include this

LOL, "simple".

> patch in the merge window pull request, with the fix after my
> signature squashed in.  The fix actually undoes a lot of the
> changes to __kvm_handle_hva_range that this patch made, so the
> result is relatively simple.  You can already find the result
> in kvm/queue.

...

>  static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> const struct kvm_hva_range 
> *range)
>  {
> @@ -515,10 +495,6 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   idx = srcu_read_lock(>srcu);
> - if (range->must_lock &&
> - kvm_mmu_lock_and_check_handler(kvm, range, ))
> - goto out_unlock;
> -
>   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>   slots = __kvm_memslots(kvm, i);
>   kvm_for_each_memslot(slot, slots) {
> @@ -547,8 +523,14 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE 
> - 1, slot);
>   gfn_range.slot = slot;
> - if (kvm_mmu_lock_and_check_handler(kvm, range, ))
> - goto out_unlock;
> + if (!locked) {
> + locked = true;
> + KVM_MMU_LOCK(kvm);
> + if (!IS_KVM_NULL_FN(range->on_lock))
> + range->on_lock(kvm, range->start, 
> range->end);
> + if (IS_KVM_NULL_FN(range->handler))
> + break;

This can/should be "goto out_unlock", "break" only takes us out of the memslots
walk, we want to get out of the address space loop.  Not a functional problem,
but we might walk all SMM memslots unnecessarily.

> + }
>   ret |= range->handler(kvm, _range);
>   }
> @@ -557,7 +539,6 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>   if (range->flush_on_ret

Re: [PATCH v13 10/12] KVM: x86: Introduce new KVM_FEATURE_SEV_LIVE_MIGRATION feature & Custom MSR.

2021-04-19 Thread Sean Christopherson

On Thu, Apr 15, 2021, Ashish Kalra wrote:
> From: Ashish Kalra 
> 
> Add new KVM_FEATURE_SEV_LIVE_MIGRATION feature for guest to check
> for host-side support for SEV live migration. Also add a new custom
> MSR_KVM_SEV_LIVE_MIGRATION for guest to enable the SEV live migration
> feature.
> 
> MSR is handled by userspace using MSR filters.
> 
> Signed-off-by: Ashish Kalra 
> Reviewed-by: Steve Rutherford 
> ---
>  Documentation/virt/kvm/cpuid.rst |  5 +
>  Documentation/virt/kvm/msr.rst   | 12 
>  arch/x86/include/uapi/asm/kvm_para.h |  4 
>  arch/x86/kvm/cpuid.c |  3 ++-
>  4 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/cpuid.rst 
> b/Documentation/virt/kvm/cpuid.rst
> index cf62162d4be2..0bdb6cdb12d3 100644
> --- a/Documentation/virt/kvm/cpuid.rst
> +++ b/Documentation/virt/kvm/cpuid.rst
> @@ -96,6 +96,11 @@ KVM_FEATURE_MSI_EXT_DEST_ID15  guest 
> checks this feature bit
> before using extended 
> destination
> ID bits in MSI address bits 
> 11-5.
>  
> +KVM_FEATURE_SEV_LIVE_MIGRATION 16  guest checks this feature bit 
> before
> +   using the page encryption 
> state
> +   hypercall to notify the page 
> state
> +   change

Hrm, I think there are two separate things being intertwined: the hypercall to
communicate private/shared pages, and the MSR to control live migration.  More
thoughts below.

>  KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24  host will warn if no 
> guest-side
> per-cpu warps are expected in
> kvmclock
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index e37a14c323d2..020245d16087 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -376,3 +376,15 @@ data:
>   write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
>   and check if there are more notifications pending. The MSR is available
>   if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
> +
> +MSR_KVM_SEV_LIVE_MIGRATION:
> +0x4b564d08
> +
> + Control SEV Live Migration features.
> +
> +data:
> +Bit 0 enables (1) or disables (0) host-side SEV Live Migration 
> feature,
> +in other words, this is guest->host communication that it's properly
> +handling the shared pages list.
> +
> +All other bits are reserved.
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
> b/arch/x86/include/uapi/asm/kvm_para.h
> index 950afebfba88..f6bfa138874f 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -33,6 +33,7 @@
>  #define KVM_FEATURE_PV_SCHED_YIELD   13
>  #define KVM_FEATURE_ASYNC_PF_INT 14
>  #define KVM_FEATURE_MSI_EXT_DEST_ID  15
> +#define KVM_FEATURE_SEV_LIVE_MIGRATION   16
>  
>  #define KVM_HINTS_REALTIME  0
>  
> @@ -54,6 +55,7 @@
>  #define MSR_KVM_POLL_CONTROL 0x4b564d05
>  #define MSR_KVM_ASYNC_PF_INT 0x4b564d06
>  #define MSR_KVM_ASYNC_PF_ACK 0x4b564d07
> +#define MSR_KVM_SEV_LIVE_MIGRATION   0x4b564d08
>  
>  struct kvm_steal_time {
>   __u64 steal;
> @@ -136,4 +138,6 @@ struct kvm_vcpu_pv_apf_data {
>  #define KVM_PV_EOI_ENABLED KVM_PV_EOI_MASK
>  #define KVM_PV_EOI_DISABLED 0x0
>  
> +#define KVM_SEV_LIVE_MIGRATION_ENABLED BIT_ULL(0)

Even though the intent is to "force" userspace to intercept the MSR, I think KVM
should at least emulate the legal bits as a nop.  Deferring completely to
userspace is rather bizarre as there's not really anything to justify KVM
getting involved.  It would also force userspace to filter the MSR just to
support the hypercall.

Somewhat of a nit, but I think we should do something like s/ENABLED/READY,
or maybe s/ENABLED/SAFE, in the bit name so that the semantics are more along
the lines of an announcement from the guest, as opposed to a command.  Treating
the bit as a hint/announcement makes it easier to bundle the hypercall and the
MSR together under a single feature, e.g. it's slightly more obvious that
userspace can ignore the MSR if it knows its use case doesn't need migration or
that it can't migrate its guest at will.

I also think we should drop the "SEV" part, especially since it sounds like the
feature flag also enumerates that the hypercall is available.

E.g. for the WRMSR side

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eca63625ae..10f90f8491 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3229,6 +3229,13 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)

vcpu->arch.msr_kvm_poll_control = data;
break;
+   case MSR_KVM_LIVE_MIGRATION_CONTROL:
+

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 06:09:29PM +0000, Sean Christopherson wrote:
> > On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> > > On Mon, Apr 19, 2021 at 04:01:46PM +, Sean Christopherson wrote:
> > > > But fundamentally the private pages, are well, private.  They can't be 
> > > > shared
> > > > across processes, so I think we could (should?) require the VMA to 
> > > > always be
> > > > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. 
> > > > is that
> > > > enough to prevent userspace and unaware kernel code from acquiring a 
> > > > reference
> > > > to the underlying page?
> > > 
> > > Shared pages should be fine too (you folks wanted tmpfs support).
> > 
> > Is that a conflict though?  If the private->shared conversion request is 
> > kicked
> > out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?
> > 
> > Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
> > shared, and dirty data can't be written back to the file.
> 
> It can be remapped, but faulting in the page would produce hwpoison entry.

It sounds like you're thinking the whole tmpfs file is poisoned.  My thought is
that userspace would need to do something like for guest private memory:

mmap(NULL, guest_size, PROT_READ|PROT_WRITE, MAP_PRIVATE | 
MAP_GUEST_ONLY, fd, 0);

The MAP_GUEST_ONLY would be used by the kernel to ensure the resulting VMA can
only point at private/poisoned memory, e.g. on fault, the associated PFN would
be tagged with PG_hwpoison or whtaever.  @fd in this case could point at tmpfs,
but I don't think it's a hard requirement.

On conversion to shared, userspace could then do:

munmap(, )
mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | 
MAP_FIXED_NOREPLACE, fd, );

or

mmap(, , PROT_READ|PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 
);

or

ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );
mmap(NULL, , PROT_READ|PROT_WRITE, MAP_SHARED, fd, );
ioctl(kvm, KVM_SET_USER_MEMORY_REGION, );

Combinations would also work, e.g. unmap the private range and move the memslot.
The private and shared memory regions could also be backed differently, e.g.
tmpfs for shared memory, anonymous for private memory.

> I don't see other way to make Google's use-case with tmpfs-backed guest
> memory work.

The underlying use-case is to be able to access guest memory from more than one
process, e.g. so that communication with the guest isn't limited to the VMM
process associated with the KVM instances.  By definition, guest private memory
can't be accessed by the host; I don't see how anyone, Google included, can have
any real requirements about

> > > The poisoned pages must be useless outside of the process with the blessed
> > > struct kvm. See kvm_pfn_map in the patch.
> > 
> > The big requirement for kernel TDX support is that the pages are useless in 
> > the
> > host.  Regarding the guest, for TDX, the TDX Module guarantees that at most 
> > a
> > single KVM guest can have access to a page at any given time.  I believe 
> > the RMP
> > provides the same guarantees for SEV-SNP.
> > 
> > SEV/SEV-ES could still end up with corruption if multiple guests map the 
> > same
> > private page, but that's obviously not the end of the world since it's the 
> > status
> > quo today.  Living with that shortcoming might be a worthy tradeoff if 
> > punting
> > mutual exclusion between guests to firmware/hardware allows us to simplify 
> > the
> > kernel implementation.
> 
> The critical question is whether we ever need to translate hva->pfn after
> the page is added to the guest private memory. I believe we do, but I
> never checked. And that's the reason we need to keep hwpoison entries
> around, which encode pfn.

As proposed in the TDX RFC, KVM would "need" the hva->pfn translation if the
guest private EPT entry was zapped, e.g. by NUMA balancing (which will fail on
the backend).  But in that case, KVM still has the original PFN, the "new"
translation becomes a sanity check to make sure that the zapped translation
wasn't moved unexpectedly.

Regardless, I don't see what that has to do with kvm_pfn_map.  At some point,
gup() has to fault in the page or look at the host PTE value.  For the latter,
at least on x86, we can throw info into the PTE itself to tag it as guest-only.
No matter what implementation we settle on, I think we've failed if we end up in
a situation where the primary MMU has pages it doesn't know are guest-only.

> If we don't, it would simplify the solution: kvm_pfn_map is not needed.
> Single bit-per page would be enough.

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Mon, Apr 19, 2021 at 04:01:46PM +0000, Sean Christopherson wrote:
> > But fundamentally the private pages, are well, private.  They can't be 
> > shared
> > across processes, so I think we could (should?) require the VMA to always be
> > MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is 
> > that
> > enough to prevent userspace and unaware kernel code from acquiring a 
> > reference
> > to the underlying page?
> 
> Shared pages should be fine too (you folks wanted tmpfs support).

Is that a conflict though?  If the private->shared conversion request is kicked
out to userspace, then userspace can re-mmap() the files as MAP_SHARED, no?

Allowing MAP_SHARED for guest private memory feels wrong.  The data can't be
shared, and dirty data can't be written back to the file.

> The poisoned pages must be useless outside of the process with the blessed
> struct kvm. See kvm_pfn_map in the patch.

The big requirement for kernel TDX support is that the pages are useless in the
host.  Regarding the guest, for TDX, the TDX Module guarantees that at most a
single KVM guest can have access to a page at any given time.  I believe the RMP
provides the same guarantees for SEV-SNP.

SEV/SEV-ES could still end up with corruption if multiple guests map the same
private page, but that's obviously not the end of the world since it's the 
status
quo today.  Living with that shortcoming might be a worthy tradeoff if punting
mutual exclusion between guests to firmware/hardware allows us to simplify the
kernel implementation.

> > >  - Add a new GUP flag to retrive such pages from the userspace mapping.
> > >Used only for private mapping population.
> > 
> > >  - Shared gfn ranges managed by userspace, based on hypercalls from the
> > >guest.
> > > 
> > >  - Shared mappings get populated via normal VMA. Any poisoned pages here
> > >would lead to SIGBUS.
> > > 
> > > So far it looks pretty straight-forward.
> > > 
> > > The only thing that I don't understand is at way point the page gets tied
> > > to the KVM instance. Currently we do it just before populating shadow
> > > entries, but it would not work with the new scheme: as we poison pages
> > > on fault it they may never get inserted into shadow entries. That's not
> > > good as we rely on the info to unpoison page on free.
> > 
> > Can you elaborate on what you mean by "unpoison"?  If the page is never 
> > actually
> > mapped into the guest, then its poisoned status is nothing more than a 
> > software
> > flag, i.e. nothing extra needs to be done on free.
> 
> Normally, poisoned flag preserved for freed pages as it usually indicate
> hardware issue. In this case we need return page to the normal circulation.
> So we need a way to differentiate two kinds of page poison. Current patch
> does this by adding page's pfn to kvm_pfn_map. But this will not work if
> we uncouple poisoning and adding to shadow PTE.

Why use PG_hwpoison then?

Re: [PATCH] KVM: Boost vCPU candidiate in user mode which is delivering interrupt

2021-04-19 Thread Sean Christopherson

On Mon, Apr 19, 2021, Wanpeng Li wrote:
> On Sat, 17 Apr 2021 at 21:09, Paolo Bonzini  wrote:
> >
> > On 16/04/21 05:08, Wanpeng Li wrote:
> > > From: Wanpeng Li 
> > >
> > > Both lock holder vCPU and IPI receiver that has halted are condidate for
> > > boost. However, the PLE handler was originally designed to deal with the
> > > lock holder preemption problem. The Intel PLE occurs when the spinlock
> > > waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
> > > they can be in either kernel or user mode. the vCPU candidate in user mode
> > > will not be boosted even if they should respond to IPIs. Some benchmarks
> > > like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
> > > of the time they are running in user mode. It can lead to a large number
> > > of continuous PLE events because the IPI sender causes PLE events
> > > repeatedly until the receiver is scheduled while the receiver is not
> > > candidate for a boost.
> > >
> > > This patch boosts the vCPU candidiate in user mode which is delivery
> > > interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
> > > VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
> > > 96 HTs Intel CLX box). There is no performance regression for other
> > > benchmarks like Unixbench spawn (most of the time contend read/write
> > > lock in kernel mode), ebizzy (most of the time contend read/write sem
> > > and TLB shoodtdown in kernel mode).
> > >
> > > +bool kvm_arch_interrupt_delivery(struct kvm_vcpu *vcpu)
> > > +{
> > > + if (vcpu->arch.apicv_active && 
> > > static_call(kvm_x86_dy_apicv_has_pending_interrupt)(vcpu))
> > > + return true;
> > > +
> > > + return false;
> > > +}
> >
> > Can you reuse vcpu_dy_runnable instead of this new function?
> 
> I have some concerns. For x86 arch, vcpu_dy_runnable() will add extra
> vCPU candidates by KVM_REQ_EVENT

Is bringing in KVM_REQ_EVENT a bad thing though?  I don't see how using apicv is
special in this case.  apicv is more precise and so there will be fewer false
positives, but it's still just a guess on KVM's part since the interrupt could
be for something completely unrelated.

If false positives are a big concern, what about adding another pass to the loop
and only yielding to usermode vCPUs with interrupts in the second full pass?
I.e. give vCPUs that are already in kernel mode priority, and only yield to
handle an interrupt if there are no vCPUs in kernel mode.

kvm_arch_dy_runnable() pulls in pv_unhalted, which seems like a good thing.

> and async pf(which has already opportunistically made the guest do other 
> stuff).

Any reason not to use kvm_arch_dy_runnable() directly?

> For other arches, kvm_arch_dy_runnale() is equal to kvm_arch_vcpu_runnable()
> except powerpc which has too many events and is not conservative. In general,
> vcpu_dy_runnable() will loose the conditions and add more vCPU candidates.
> 
> Wanpeng

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-19 Thread Sean Christopherson

On Mon, Apr 19, 2021, Kirill A. Shutemov wrote:
> On Fri, Apr 16, 2021 at 05:30:30PM +0000, Sean Christopherson wrote:
> > I like the idea of using "special" PTE value to denote guest private memory,
> > e.g. in this RFC, HWPOISON.  But I strongly dislike having KVM involved in 
> > the
> > manipulation of the special flag/value.
> > 
> > Today, userspace owns the gfn->hva translations and the kernel effectively 
> > owns
> > the hva->pfn translations (with input from userspace).  KVM just connects 
> > the
> > dots.
> > 
> > Having KVM own the shared/private transitions means KVM is now part owner 
> > of the
> > entire gfn->hva->pfn translation, i.e. KVM is effectively now a secondary 
> > MMU
> > and a co-owner of the primary MMU.  This creates locking madness, e.g. KVM 
> > taking
> > mmap_sem for write, mmu_lock under page lock, etc..., and also takes 
> > control away
> > from userspace.  E.g. userspace strategy could be to use a separate 
> > backing/pool
> > for shared memory and change the gfn->hva translation (memslots) in 
> > reaction to
> > a shared/private conversion.  Automatically swizzling things in KVM takes 
> > away
> > that option.
> > 
> > IMO, KVM should be entirely "passive" in this process, e.g. the guest 
> > shares or
> > protects memory, userspace calls into the kernel to change state, and the 
> > kernel
> > manages the page tables to prevent bad actors.  KVM simply does the 
> > plumbing for
> > the guest page tables.
> 
> That's a new perspective for me. Very interesting.
> 
> Let's see how it can look like:
> 
>  - KVM only allows poisoned pages (or whatever flag we end up using for
>protection) in the private mappings. SIGBUS otherwise.
> 
>  - Poisoned pages must be tied to the KVM instance to be allowed in the
>private mappings. Like kvm->id in the current prototype. SIGBUS
>otherwise.
> 
>  - Pages get poisoned on fault in if the VMA has a new vmflag set.
> 
>  - Fault in of a poisoned page leads to hwpoison entry. Userspace cannot
>access such pages.
> 
>  - Poisoned pages produced this way get unpoisoned on free.
> 
>  - The new VMA flag set by userspace. mprotect(2)?

Ya, or mmap(), though I'm not entirely sure a VMA flag would suffice.  The
notion of the page being private is tied to the PFN, which would suggest "struct
page" needs to be involved.

But fundamentally the private pages, are well, private.  They can't be shared
across processes, so I think we could (should?) require the VMA to always be
MAP_PRIVATE.  Does that buy us enough to rely on the VMA alone?  I.e. is that
enough to prevent userspace and unaware kernel code from acquiring a reference
to the underlying page?

>  - Add a new GUP flag to retrive such pages from the userspace mapping.
>Used only for private mapping population.

>  - Shared gfn ranges managed by userspace, based on hypercalls from the
>guest.
> 
>  - Shared mappings get populated via normal VMA. Any poisoned pages here
>would lead to SIGBUS.
> 
> So far it looks pretty straight-forward.
> 
> The only thing that I don't understand is at way point the page gets tied
> to the KVM instance. Currently we do it just before populating shadow
> entries, but it would not work with the new scheme: as we poison pages
> on fault it they may never get inserted into shadow entries. That's not
> good as we rely on the info to unpoison page on free.

Can you elaborate on what you mean by "unpoison"?  If the page is never actually
mapped into the guest, then its poisoned status is nothing more than a software
flag, i.e. nothing extra needs to be done on free.  If the page is mapped into
the guest, then KVM can be made responsible for reinitializing the page with
keyid=0 when the page is removed from the guest.

The TDX Module prevents mapping the same PFN into multiple guests, so the kernel
doesn't actually have to care _which_ KVM instance(s) is associated with a page,
it only needs to prevent installing valid PTEs in the host page tables.

> Maybe we should tie VMA to the KVM instance on setting the vmflags?
> I donno.
> 
> Any comments?
> 
> -- 
>  Kirill A. Shutemov

Re: [PATCH v2 09/10] KVM: Don't take mmu_lock for range invalidation unless necessary

2021-04-19 Thread Sean Christopherson

On Mon, Apr 19, 2021, Paolo Bonzini wrote:
> On 19/04/21 10:49, Wanpeng Li wrote:
> > I saw this splatting:
> > 
> >   ==
> >   WARNING: possible circular locking dependency detected
> >   5.12.0-rc3+ #6 Tainted: G   OE
> >   --
> >   qemu-system-x86/3069 is trying to acquire lock:
> >   9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0},
> > at: __mmu_notifier_invalidate_range_end+0x5/0x190
> > 
> >   but task is already holding lock:
> >   aff7410a9160 (>mmu_notifier_slots_lock){.+.+}-{3:3}, at:
> > kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
> 
> I guess it is possible to open-code the wait using a readers count and a
> spinlock (see patch after signature).  This allows including the
> rcu_assign_pointer in the same critical section that checks the number
> of readers.  Also on the plus side, the init_rwsem() is replaced by
> slightly nicer code.

Ugh, the count approach is nearly identical to Ben's original code.  Using a
rwsem seemed so clever :-/

> IIUC this could be extended to non-sleeping invalidations too, but I
> am not really sure about that.

Yes, that should be fine.

> There are some issues with the patch though:
> 
> - I am not sure if this should be a raw spin lock to avoid the same issue
> on PREEMPT_RT kernel.  That said the critical section is so tiny that using
> a raw spin lock may make sense anyway

If using spinlock_t is problematic, wouldn't mmu_lock already be an issue?  Or
am I misunderstanding your concern?

> - this loses the rwsem fairness.  On the other hand, mm/mmu_notifier.c's
> own interval-tree-based filter is also using a similar mechanism that is
> likewise not fair, so it should be okay.

The one concern I had with an unfair mechanism of this nature is that, in 
theory,
the memslot update could be blocked indefinitely.

> Any opinions?  For now I placed the change below in kvm/queue, but I'm
> leaning towards delaying this optimization to the next merge window.

I think delaying it makes sense.

> @@ -1333,9 +1351,22 @@ static struct kvm_memslots 
> *install_new_memslots(struct kvm *kvm,
>   WARN_ON(gen & KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS);
>   slots->generation = gen | KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS;
> - down_write(>mmu_notifier_slots_lock);
> + /*
> +  * This cannot be an rwsem because the MMU notifier must not run
> +  * inside the critical section.  A sleeping rwsem cannot exclude
> +  * that.

How on earth did you decipher that from the splat?  I stared at it for a good
five minutes and was completely befuddled.

> +  */
> + spin_lock(>mn_invalidate_lock);
> + prepare_to_rcuwait(>mn_memslots_update_rcuwait);
> + while (kvm->mn_active_invalidate_count) {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + spin_unlock(>mn_invalidate_lock);
> + schedule();
> + spin_lock(>mn_invalidate_lock);
> + }
> + finish_rcuwait(>mn_memslots_update_rcuwait);
>   rcu_assign_pointer(kvm->memslots[as_id], slots);
> - up_write(>mmu_notifier_slots_lock);
> + spin_unlock(>mn_invalidate_lock);
>   synchronize_srcu_expedited(>srcu);
>

Re: [RFCv2 13/13] KVM: unmap guest memory using poisoned pages

2021-04-16 Thread Sean Christopherson

On Fri, Apr 16, 2021, Kirill A. Shutemov wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1b404e4d7dd8..f8183386abe7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   kvm_sched_yield(vcpu->kvm, a0);
>   ret = 0;
>   break;
> + case KVM_HC_ENABLE_MEM_PROTECTED:
> + ret = kvm_protect_memory(vcpu->kvm);
> + break;
> + case KVM_HC_MEM_SHARE:
> + ret = kvm_share_memory(vcpu->kvm, a0, a1);

Can you take a look at a proposed hypercall interface for SEV live migration and
holler if you (or anyone else) thinks it will have extensibility issues?

https://lkml.kernel.org/r/93d7f2c2888315adc48905722574d89699edde33.1618498113.git.ashish.ka...@amd.com

> + break;
>   default:
>   ret = -KVM_ENOSYS;
>   break;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index fadaccb95a4c..cd2374802702 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct 
> kvm_vcpu *vcpu)
>  }
>  #endif
>  
> +#define KVM_NR_SHARED_RANGES 32
> +
>  /*
>   * Note:
>   * memslots are not sorted by id anymore, please use id_to_memslot()
> @@ -513,6 +515,10 @@ struct kvm {
>   pid_t userspace_pid;
>   unsigned int max_halt_poll_ns;
>   u32 dirty_ring_size;
> + bool mem_protected;
> + void *id;
> + int nr_shared_ranges;
> + struct range shared_ranges[KVM_NR_SHARED_RANGES];

Hard no for me.  IMO, anything that requires KVM to track shared/pinned pages in
a separate tree/array is non-starter.  More specific to TDX #MCs, KVM should not
be the canonical reference for the state of a page.

>  };
>  
>  #define kvm_err(fmt, ...) \

...

> @@ -1868,11 +1874,17 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
> *async, bool write_fault,
>   flags |= FOLL_WRITE;
>   if (async)
>   flags |= FOLL_NOWAIT;
> + if (kvm->mem_protected)
> + flags |= FOLL_ALLOW_POISONED;

This is unsafe, only the flows that are mapping the PFN into the guest should
use ALLOW_POISONED, e.g. __kvm_map_gfn() should fail on a poisoned page.

>  
>   npages = get_user_pages_unlocked(addr, 1, , flags);
>   if (npages != 1)
>   return npages;
>  
> + if (IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) &&
> + kvm->mem_protected && !kvm_page_allowed(kvm, page))
> + return -EHWPOISON;
> +
>   /* map read fault as writable if possible */
>   if (unlikely(!write_fault) && writable) {
>   struct page *wpage;

...

> @@ -2338,19 +2350,93 @@ static int next_segment(unsigned long len, int offset)
>   return len;
>  }
>  
> -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
> -  void *data, int offset, int len)
> +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len)
> +{
> + int offset = offset_in_page(hva);
> + struct page *page;
> + int npages, seg;
> + void *vaddr;
> +
> + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) ||
> + !kvm->mem_protected) {
> + return __copy_from_user(data, (void __user *)hva, len);
> + }
> +
> + might_fault();
> + kasan_check_write(data, len);
> + check_object_size(data, len, false);
> +
> + while ((seg = next_segment(len, offset)) != 0) {
> + npages = get_user_pages_unlocked(hva, 1, ,
> +  FOLL_ALLOW_POISONED);
> + if (npages != 1)
> + return -EFAULT;
> +
> + if (!kvm_page_allowed(kvm, page))
> + return -EFAULT;
> +
> + vaddr = kmap_atomic(page);
> + memcpy(data, vaddr + offset, seg);
> + kunmap_atomic(vaddr);

Why is KVM allowed to access a poisoned page?  I would expect shared pages to
_not_ be poisoned.  Except for pure software emulation of SEV, KVM can't access
guest private memory.

> +
> + put_page(page);
> + len -= seg;
> + hva += seg;
> + data += seg;
> + offset = 0;
> + }
> +
> + return 0;
> +}

...
  
> @@ -2693,6 +2775,41 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, 
> gfn_t gfn)
>  }
>  EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty);
>  
> +int kvm_protect_memory(struct kvm *kvm)
> +{
> + if (mmap_write_lock_killable(kvm->mm))
> + return -KVM_EINTR;
> +
> + kvm->mem_protected = true;
> + kvm_arch_flush_shadow_all(kvm);
> + mmap_write_unlock(kvm->mm);
> +
> + return 0;
> +}
> +
> +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long 
> npages)
> +{
> + unsigned long end = gfn + npages;
> +
> + if (!npages || !IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY))

[PATCH v3 9/9] KVM: Move instrumentation-safe annotations for enter/exit to x86 code

2021-04-15 Thread Sean Christopherson

Drop the instrumentation_{begin,end}() annonations from the common KVM
guest enter/exit helpers, and massage the x86 code as needed to preserve
the necessary annotations.  x86 is the only architecture whose transition
flow is tagged as noinstr, and more specifically, it is the only
architecture for which instrumentation_{begin,end}() can be non-empty.

No other architecture supports CONFIG_STACK_VALIDATION=y, and s390 is the
only other architecture that support CONFIG_DEBUG_ENTRY=y.  For
instrumentation annontations to be meaningful, both aformentioned configs
must be enabled.

Letting x86 deal with the annotations avoids unnecessary nops by
squashing back-to-back instrumention-safe sequences.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.h   | 4 ++--
 include/linux/kvm_host.h | 9 +
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 285953e81777..b17857ac540b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -25,9 +25,9 @@ static __always_inline void kvm_guest_enter_irqoff(void)
instrumentation_begin();
trace_hardirqs_on_prepare();
lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   instrumentation_end();
-
guest_enter_irqoff();
+   instrumentation_end();
+
lockdep_hardirqs_on(CALLER_ADDR0);
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 444d5f0225cb..e5eb64019f47 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -339,9 +339,7 @@ static __always_inline void guest_enter_irqoff(void)
 * This is running in ioctl context so its safe to assume that it's the
 * stime pending cputime to flush.
 */
-   instrumentation_begin();
vtime_account_guest_enter();
-   instrumentation_end();
 
/*
 * KVM does not hold any references to rcu protected data when it
@@ -351,21 +349,16 @@ static __always_inline void guest_enter_irqoff(void)
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   if (!context_tracking_guest_enter_irqoff()) {
-   instrumentation_begin();
+   if (!context_tracking_guest_enter_irqoff())
rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-   }
 }
 
 static __always_inline void guest_exit_irqoff(void)
 {
context_tracking_guest_exit_irqoff();
 
-   instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
-   instrumentation_end();
 }
 
 static inline void guest_exit(void)
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 8/9] KVM: x86: Consolidate guest enter/exit logic to common helpers

2021-04-15 Thread Sean Christopherson

Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers.  Opportunistically update the somewhat stale comment about the
updates needing to occur immediately after VM-Exit.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c | 46 ++---
 arch/x86/kvm/vmx/vmx.c | 46 ++---
 arch/x86/kvm/x86.h | 52 ++
 3 files changed, 56 insertions(+), 88 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index bb2aa0dde7c5..0677595d07e5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3713,25 +3713,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
-   /*
-* VMENTER enables interrupts (host state), but the kernel state is
-* interrupts disabled when this is invoked. Also tell RCU about
-* it. This is the same logic as for exit_to_user_mode().
-*
-* This ensures that e.g. latency analysis on the host observes
-* guest mode as interrupt enabled.
-*
-* guest_enter_irqoff() informs context tracking about the
-* transition to guest mode and if enabled adjusts RCU state
-* accordingly.
-*/
-   instrumentation_begin();
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   instrumentation_end();
-
-   guest_enter_irqoff();
-   lockdep_hardirqs_on(CALLER_ADDR0);
+   kvm_guest_enter_irqoff();
 
if (sev_es_guest(vcpu->kvm)) {
__svm_sev_es_vcpu_run(svm->vmcb_pa);
@@ -3745,31 +3727,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu)
vmload(__sme_page_pa(sd->save_area));
}
 
-   /*
-* VMEXIT disables interrupts (host state), but tracing and lockdep
-* have them in state 'on' as recorded before entering guest mode.
-* Same as enter_from_user_mode().
-*
-* context_tracking_guest_exit_irqoff() restores host context and
-* reinstates RCU if enabled and required.
-*
-* This needs to be done before the below as native_read_msr()
-* contains a tracepoint and x86_spec_ctrl_restore_host() calls
-* into world and some more.
-*/
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   context_tracking_guest_exit_irqoff();
-
-   instrumentation_begin();
-   /*
-* Account guest time when precise accounting based on context tracking
-* is enabled.  Tick based accounting is deferred until after IRQs that
-* occurred within the VM-Enter/VM-Exit "window" are handled.
-*/
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_account_guest_exit();
-   trace_hardirqs_off_finish();
-   instrumentation_end();
+   kvm_guest_exit_irqoff();
 }
 
 static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5ae9dc197048..19b0e25bf598 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6600,25 +6600,7 @@ static fastpath_t vmx_exit_handlers_fastpath(struct 
kvm_vcpu *vcpu)
 static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
struct vcpu_vmx *vmx)
 {
-   /*
-* VMENTER enables interrupts (host state), but the kernel state is
-* interrupts disabled when this is invoked. Also tell RCU about
-* it. This is the same logic as for exit_to_user_mode().
-*
-* This ensures that e.g. latency analysis on the host observes
-* guest mode as interrupt enabled.
-*
-* guest_enter_irqoff() informs context tracking about the
-* transition to guest mode and if enabled adjusts RCU state
-* accordingly.
-*/
-   instrumentation_begin();
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   instrumentation_end();
-
-   guest_enter_irqoff();
-   lockdep_hardirqs_on(CALLER_ADDR0);
+   kvm_guest_enter_irqoff();
 
/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(_l1d_should_flush))
@@ -6634,31 +6616,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
 
vcpu->arch.cr2 = native_read_cr2();
 
-   /*
-* VMEXIT disables interrupts (host state), but tracing and lockdep
-* have them in state 'on' as recorded before entering guest mode.
-* Same as enter_from_user_mode().
-*
-* context_tracking_guest_exit_irqoff() restores host context and
-* reinstates RCU if enabled and required.
-*
-* This needs to be done before the below as native_read_msr()
-* contains a tracepoint and x86_spec_ctr

[PATCH v3 2/9] context_tracking: Move guest exit vtime accounting to separate helpers

2021-04-15 Thread Sean Christopherson

From: Wanpeng Li 

Provide separate vtime accounting functions for guest exit instead of
open coding the logic within the context tracking code.  This will allow
KVM x86 to handle vtime accounting slightly differently when using
tick-based accounting.

Suggested-by: Thomas Gleixner 
Cc: Thomas Gleixner 
Cc: Michael Tokarev 
Cc: Christian Borntraeger 
Signed-off-by: Wanpeng Li 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 200d30cb3a82..7cf03a8e5708 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -137,15 +137,20 @@ static __always_inline void 
context_tracking_guest_exit_irqoff(void)
__context_tracking_exit(CONTEXT_GUEST);
 }
 
-static __always_inline void guest_exit_irqoff(void)
+static __always_inline void vtime_account_guest_exit(void)
 {
-   context_tracking_guest_exit_irqoff();
-
-   instrumentation_begin();
if (vtime_accounting_enabled_this_cpu())
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
+}
+
+static __always_inline void guest_exit_irqoff(void)
+{
+   context_tracking_guest_exit_irqoff();
+
+   instrumentation_begin();
+   vtime_account_guest_exit();
instrumentation_end();
 }
 
@@ -166,12 +171,17 @@ static __always_inline void guest_enter_irqoff(void)
 
 static __always_inline void context_tracking_guest_exit_irqoff(void) { }
 
-static __always_inline void guest_exit_irqoff(void)
+static __always_inline void vtime_account_guest_exit(void)
 {
-   instrumentation_begin();
-   /* Flush the guest cputime we spent on the guest */
vtime_account_kernel(current);
current->flags &= ~PF_VCPU;
+}
+
+static __always_inline void guest_exit_irqoff(void)
+{
+   instrumentation_begin();
+   /* Flush the guest cputime we spent on the guest */
+   vtime_account_guest_exit();
instrumentation_end();
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 6/9] context_tracking: Consolidate guest enter/exit wrappers

2021-04-15 Thread Sean Christopherson

Consolidate the guest enter/exit wrappers, providing and tweaking stubs
as needed.  This will allow moving the wrappers under KVM without having
to bleed #ifdefs into the soon-to-be KVM code.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 65 
 1 file changed, 24 insertions(+), 41 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 1c05035396ad..e172a547b2d0 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -71,6 +71,19 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
 }
 
+static __always_inline bool context_tracking_guest_enter_irqoff(void)
+{
+   if (context_tracking_enabled())
+   __context_tracking_enter(CONTEXT_GUEST);
+
+   return context_tracking_enabled_this_cpu();
+}
+
+static __always_inline void context_tracking_guest_exit_irqoff(void)
+{
+   if (context_tracking_enabled())
+   __context_tracking_exit(CONTEXT_GUEST);
+}
 
 /**
  * ct_state() - return the current context tracking state if known
@@ -92,6 +105,9 @@ static inline void user_exit_irqoff(void) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
+static inline bool context_tracking_guest_enter_irqoff(void) { return false; }
+static inline void context_tracking_guest_exit_irqoff(void) { }
+
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
@@ -102,74 +118,41 @@ extern void context_tracking_init(void);
 static inline void context_tracking_init(void) { }
 #endif /* CONFIG_CONTEXT_TRACKING_FORCE */
 
-
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 /* must be called with irqs disabled */
 static __always_inline void guest_enter_irqoff(void)
 {
+   /*
+* This is running in ioctl context so its safe to assume that it's the
+* stime pending cputime to flush.
+*/
instrumentation_begin();
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_guest_enter(current);
-   else
-   current->flags |= PF_VCPU;
+   vtime_account_guest_enter();
instrumentation_end();
 
-   if (context_tracking_enabled())
-   __context_tracking_enter(CONTEXT_GUEST);
-
-   /* KVM does not hold any references to rcu protected data when it
+   /*
+* KVM does not hold any references to rcu protected data when it
 * switches CPU into a guest mode. In fact switching to a guest mode
 * is very similar to exiting to userspace from rcu point of view. In
 * addition CPU may stay in a guest mode for quite a long time (up to
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   if (!context_tracking_enabled_this_cpu()) {
+   if (!context_tracking_guest_enter_irqoff()) {
instrumentation_begin();
rcu_virt_note_context_switch(smp_processor_id());
instrumentation_end();
}
 }
 
-static __always_inline void context_tracking_guest_exit_irqoff(void)
-{
-   if (context_tracking_enabled())
-   __context_tracking_exit(CONTEXT_GUEST);
-}
-
 static __always_inline void guest_exit_irqoff(void)
 {
context_tracking_guest_exit_irqoff();
 
-   instrumentation_begin();
-   vtime_account_guest_exit();
-   instrumentation_end();
-}
-
-#else
-static __always_inline void guest_enter_irqoff(void)
-{
-   /*
-* This is running in ioctl context so its safe
-* to assume that it's the stime pending cputime
-* to flush.
-*/
-   instrumentation_begin();
-   vtime_account_guest_enter();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-}
-
-static __always_inline void context_tracking_guest_exit_irqoff(void) { }
-
-static __always_inline void guest_exit_irqoff(void)
-{
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
instrumentation_end();
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 static inline void guest_exit(void)
 {
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 7/9] context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain

2021-04-15 Thread Sean Christopherson

Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage
its context tracking vs. vtime accounting without bleeding too many KVM
details into the context tracking code.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 45 
 include/linux/kvm_host.h | 45 
 2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index e172a547b2d0..d4dc9c4d79aa 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -118,49 +118,4 @@ extern void context_tracking_init(void);
 static inline void context_tracking_init(void) { }
 #endif /* CONFIG_CONTEXT_TRACKING_FORCE */
 
-/* must be called with irqs disabled */
-static __always_inline void guest_enter_irqoff(void)
-{
-   /*
-* This is running in ioctl context so its safe to assume that it's the
-* stime pending cputime to flush.
-*/
-   instrumentation_begin();
-   vtime_account_guest_enter();
-   instrumentation_end();
-
-   /*
-* KVM does not hold any references to rcu protected data when it
-* switches CPU into a guest mode. In fact switching to a guest mode
-* is very similar to exiting to userspace from rcu point of view. In
-* addition CPU may stay in a guest mode for quite a long time (up to
-* one time slice). Lets treat guest mode as quiescent state, just like
-* we do with user-mode execution.
-*/
-   if (!context_tracking_guest_enter_irqoff()) {
-   instrumentation_begin();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-   }
-}
-
-static __always_inline void guest_exit_irqoff(void)
-{
-   context_tracking_guest_exit_irqoff();
-
-   instrumentation_begin();
-   /* Flush the guest cputime we spent on the guest */
-   vtime_account_guest_exit();
-   instrumentation_end();
-}
-
-static inline void guest_exit(void)
-{
-   unsigned long flags;
-
-   local_irq_save(flags);
-   guest_exit_irqoff();
-   local_irq_restore(flags);
-}
-
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b06d12ec37e..444d5f0225cb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,51 @@ struct kvm_vcpu {
struct kvm_dirty_ring dirty_ring;
 };
 
+/* must be called with irqs disabled */
+static __always_inline void guest_enter_irqoff(void)
+{
+   /*
+* This is running in ioctl context so its safe to assume that it's the
+* stime pending cputime to flush.
+*/
+   instrumentation_begin();
+   vtime_account_guest_enter();
+   instrumentation_end();
+
+   /*
+* KVM does not hold any references to rcu protected data when it
+* switches CPU into a guest mode. In fact switching to a guest mode
+* is very similar to exiting to userspace from rcu point of view. In
+* addition CPU may stay in a guest mode for quite a long time (up to
+* one time slice). Lets treat guest mode as quiescent state, just like
+* we do with user-mode execution.
+*/
+   if (!context_tracking_guest_enter_irqoff()) {
+   instrumentation_begin();
+   rcu_virt_note_context_switch(smp_processor_id());
+   instrumentation_end();
+   }
+}
+
+static __always_inline void guest_exit_irqoff(void)
+{
+   context_tracking_guest_exit_irqoff();
+
+   instrumentation_begin();
+   /* Flush the guest cputime we spent on the guest */
+   vtime_account_guest_exit();
+   instrumentation_end();
+}
+
+static inline void guest_exit(void)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   guest_exit_irqoff();
+   local_irq_restore(flags);
+}
+
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 {
/*
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 1/9] context_tracking: Move guest exit context tracking to separate helpers

2021-04-15 Thread Sean Christopherson

From: Wanpeng Li 

Provide separate context tracking helpers for guest exit, the standalone
helpers will be called separately by KVM x86 in later patches to fix
tick-based accounting.

Suggested-by: Thomas Gleixner 
Cc: Thomas Gleixner 
Cc: Sean Christopherson 
Cc: Michael Tokarev 
Cc: Christian Borntraeger 
Signed-off-by: Wanpeng Li 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index bceb06498521..200d30cb3a82 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -131,10 +131,15 @@ static __always_inline void guest_enter_irqoff(void)
}
 }
 
-static __always_inline void guest_exit_irqoff(void)
+static __always_inline void context_tracking_guest_exit_irqoff(void)
 {
if (context_tracking_enabled())
__context_tracking_exit(CONTEXT_GUEST);
+}
+
+static __always_inline void guest_exit_irqoff(void)
+{
+   context_tracking_guest_exit_irqoff();
 
instrumentation_begin();
if (vtime_accounting_enabled_this_cpu())
@@ -159,6 +164,8 @@ static __always_inline void guest_enter_irqoff(void)
instrumentation_end();
 }
 
+static __always_inline void context_tracking_guest_exit_irqoff(void) { }
+
 static __always_inline void guest_exit_irqoff(void)
 {
instrumentation_begin();
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 4/9] sched/vtime: Move vtime accounting external declarations above inlines

2021-04-15 Thread Sean Christopherson

Move the blob of external declarations (and their stubs) above the set of
inline definitions (and their stubs) for vtime accounting.  This will
allow a future patch to bring in more inline definitions without also
having to shuffle large chunks of code.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/vtime.h | 94 +--
 1 file changed, 47 insertions(+), 47 deletions(-)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 041d6524d144..6a4317560539 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,53 +10,6 @@
 
 struct task_struct;
 
-/*
- * vtime_accounting_enabled_this_cpu() definitions/declarations
- */
-#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
-
-static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
-extern void vtime_task_switch(struct task_struct *prev);
-
-#elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
-
-/*
- * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
- * in that case and compute the tickless cputime.
- * For now vtime state is tied to context tracking. We might want to decouple
- * those later if necessary.
- */
-static inline bool vtime_accounting_enabled(void)
-{
-   return context_tracking_enabled();
-}
-
-static inline bool vtime_accounting_enabled_cpu(int cpu)
-{
-   return context_tracking_enabled_cpu(cpu);
-}
-
-static inline bool vtime_accounting_enabled_this_cpu(void)
-{
-   return context_tracking_enabled_this_cpu();
-}
-
-extern void vtime_task_switch_generic(struct task_struct *prev);
-
-static inline void vtime_task_switch(struct task_struct *prev)
-{
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_task_switch_generic(prev);
-}
-
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
-
-static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
-static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
-static inline void vtime_task_switch(struct task_struct *prev) { }
-
-#endif
-
 /*
  * Common vtime APIs
  */
@@ -94,6 +47,53 @@ static inline void vtime_account_hardirq(struct task_struct 
*tsk) { }
 static inline void vtime_flush(struct task_struct *tsk) { }
 #endif
 
+/*
+ * vtime_accounting_enabled_this_cpu() definitions/declarations
+ */
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE)
+
+static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
+extern void vtime_task_switch(struct task_struct *prev);
+
+#elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
+
+/*
+ * Checks if vtime is enabled on some CPU. Cputime readers want to be careful
+ * in that case and compute the tickless cputime.
+ * For now vtime state is tied to context tracking. We might want to decouple
+ * those later if necessary.
+ */
+static inline bool vtime_accounting_enabled(void)
+{
+   return context_tracking_enabled();
+}
+
+static inline bool vtime_accounting_enabled_cpu(int cpu)
+{
+   return context_tracking_enabled_cpu(cpu);
+}
+
+static inline bool vtime_accounting_enabled_this_cpu(void)
+{
+   return context_tracking_enabled_this_cpu();
+}
+
+extern void vtime_task_switch_generic(struct task_struct *prev);
+
+static inline void vtime_task_switch(struct task_struct *prev)
+{
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_task_switch_generic(prev);
+}
+
+#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
+
+static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
+static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
+static inline void vtime_task_switch(struct task_struct *prev) { }
+
+#endif
+
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 extern void irqtime_account_irq(struct task_struct *tsk, unsigned int offset);
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 5/9] sched/vtime: Move guest enter/exit vtime accounting to vtime.h

2021-04-15 Thread Sean Christopherson

Provide separate helpers for guest enter vtime accounting (in addition to
the existing guest exit helpers), and move all vtime accounting helpers
to vtime.h where the existing #ifdef infrastructure can be leveraged to
better delineate the different types of accounting.  This will also allow
future cleanups via deduplication of context tracking code.

Opportunstically delete the vtime_account_kernel() stub now that all
callers are wrapped with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 17 +---
 include/linux/vtime.h| 46 +++-
 2 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 7cf03a8e5708..1c05035396ad 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -137,14 +137,6 @@ static __always_inline void 
context_tracking_guest_exit_irqoff(void)
__context_tracking_exit(CONTEXT_GUEST);
 }
 
-static __always_inline void vtime_account_guest_exit(void)
-{
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_guest_exit(current);
-   else
-   current->flags &= ~PF_VCPU;
-}
-
 static __always_inline void guest_exit_irqoff(void)
 {
context_tracking_guest_exit_irqoff();
@@ -163,20 +155,13 @@ static __always_inline void guest_enter_irqoff(void)
 * to flush.
 */
instrumentation_begin();
-   vtime_account_kernel(current);
-   current->flags |= PF_VCPU;
+   vtime_account_guest_enter();
rcu_virt_note_context_switch(smp_processor_id());
instrumentation_end();
 }
 
 static __always_inline void context_tracking_guest_exit_irqoff(void) { }
 
-static __always_inline void vtime_account_guest_exit(void)
-{
-   vtime_account_kernel(current);
-   current->flags &= ~PF_VCPU;
-}
-
 static __always_inline void guest_exit_irqoff(void)
 {
instrumentation_begin();
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 6a4317560539..3684487d01e1 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -3,21 +3,18 @@
 #define _LINUX_KERNEL_VTIME_H
 
 #include 
+#include 
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #include 
 #endif
 
-
-struct task_struct;
-
 /*
  * Common vtime APIs
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
 extern void vtime_account_kernel(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
-static inline void vtime_account_kernel(struct task_struct *tsk) { }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -55,6 +52,18 @@ static inline void vtime_flush(struct task_struct *tsk) { }
 static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
 extern void vtime_task_switch(struct task_struct *prev);
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   vtime_account_kernel(current);
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+   vtime_account_kernel(current);
+   current->flags &= ~PF_VCPU;
+}
+
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
 
 /*
@@ -86,12 +95,37 @@ static inline void vtime_task_switch(struct task_struct 
*prev)
vtime_task_switch_generic(prev);
 }
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_guest_enter(current);
+   else
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_guest_exit(current);
+   else
+   current->flags &= ~PF_VCPU;
+}
+
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
-static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
 static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
 static inline void vtime_task_switch(struct task_struct *prev) { }
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+   current->flags &= ~PF_VCPU;
+}
+
 #endif
 
 
-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 0/9] KVM: Fix tick-based accounting for x86 guests

2021-04-15 Thread Sean Christopherson

This is a continuation of Wanpeng's series[1] to fix tick-based CPU time
accounting on x86, with my cleanups[2] bolted on top.  The core premise of
Wanpeng's patches are preserved, but they are heavily stripped down.
Specifically, only the "guest exit" paths are split, and no code is
consolidated.  The intent is to do as little as possible in the three
patches that need to be backported.  Keeping those changes as small as
possible also meant that my cleanups did not need to unwind much 
refactoring.

On x86, tested CONFIG_VIRT_CPU_ACCOUNTING_GEN =y and =n, and with
CONFIG_DEBUG_ENTRY=y && CONFIG_VALIDATE_STACKS=y.  Compile tested arm64,
MIPS, PPC, and s390, the latter with CONFIG_DEBUG_ENTRY=y for giggles.

One last note: I elected to use vtime_account_guest_exit() in the x86 code
instead of open coding these equivalents:

if (vtime_accounting_enabled_this_cpu())
vtime_guest_exit(current);
...
if (!vtime_accounting_enabled_this_cpu())
current->flags &= ~PF_VCPU;

With CONFIG_VIRT_CPU_ACCOUNTING_GEN=n, this is a complete non-issue, but
for the =y case it means context_tracking_enabled_this_cpu() is being
checked back-to-back.  The redundant checks bug me, but open coding the
gory details in x86 or providing funky variants in vtime.h felt worse.

Delta from Wanpeng's v2:

  - s/context_guest/context_tracking_guest, purely to match the existing
functions.  I have no strong opinion either way.
  - Split only the "exit" functions.
  - Partially open code vcpu_account_guest_exit() and
__vtime_account_guest_exit() in x86 to avoid churn when segueing into
my cleanups (see above).

[1] 
https://lkml.kernel.org/r/1618298169-3831-1-git-send-email-wanpen...@tencent.com
[2] https://lkml.kernel.org/r/20210413182933.1046389-1-sea...@google.com

Sean Christopherson (6):
  sched/vtime: Move vtime accounting external declarations above inlines
  sched/vtime: Move guest enter/exit vtime accounting to vtime.h
  context_tracking: Consolidate guest enter/exit wrappers
  context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
  KVM: x86: Consolidate guest enter/exit logic to common helpers
  KVM: Move instrumentation-safe annotations for enter/exit to x86 code

Wanpeng Li (3):
  context_tracking: Move guest exit context tracking to separate helpers
  context_tracking: Move guest exit vtime accounting to separate helpers
  KVM: x86: Defer tick-based accounting 'til after IRQ handling

 arch/x86/kvm/svm/svm.c   |  39 +
 arch/x86/kvm/vmx/vmx.c   |  39 +
 arch/x86/kvm/x86.c   |   8 ++
 arch/x86/kvm/x86.h   |  52 
 include/linux/context_tracking.h |  92 -
 include/linux/kvm_host.h |  38 +
 include/linux/vtime.h| 138 +++
 7 files changed, 204 insertions(+), 202 deletions(-)

-- 
2.31.1.368.gbe11c130af-goog

[PATCH v3 3/9] KVM: x86: Defer tick-based accounting 'til after IRQ handling

2021-04-15 Thread Sean Christopherson

From: Wanpeng Li 

When using tick-based accounting, defer the call to account guest time
until after servicing any IRQ(s) that happened in the guest or
immediately after VM-Exit.  Tick-based accounting of vCPU time relies on
PF_VCPU being set when the tick IRQ handler runs, and IRQs are blocked
throughout {svm,vmx}_vcpu_enter_exit().

This fixes a bug[*] where reported guest time remains '0', even when
running an infinite loop in the guest.

[*] https://bugzilla.kernel.org/show_bug.cgi?id=209831

Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
Cc: Thomas Gleixner 
Cc: Sean Christopherson 
Cc: Michael Tokarev 
Cc: sta...@vger.kernel.org#v5.9-rc1+
Suggested-by: Thomas Gleixner 
Signed-off-by: Wanpeng Li 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c | 13 ++---
 arch/x86/kvm/vmx/vmx.c | 13 ++---
 arch/x86/kvm/x86.c |  8 
 3 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 48b396f33bee..bb2aa0dde7c5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3750,17 +3750,24 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu)
 * have them in state 'on' as recorded before entering guest mode.
 * Same as enter_from_user_mode().
 *
-* guest_exit_irqoff() restores host context and reinstates RCU if
-* enabled and required.
+* context_tracking_guest_exit_irqoff() restores host context and
+* reinstates RCU if enabled and required.
 *
 * This needs to be done before the below as native_read_msr()
 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 * into world and some more.
 */
lockdep_hardirqs_off(CALLER_ADDR0);
-   guest_exit_irqoff();
+   context_tracking_guest_exit_irqoff();
 
instrumentation_begin();
+   /*
+* Account guest time when precise accounting based on context tracking
+* is enabled.  Tick based accounting is deferred until after IRQs that
+* occurred within the VM-Enter/VM-Exit "window" are handled.
+*/
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_account_guest_exit();
trace_hardirqs_off_finish();
instrumentation_end();
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c05e6e2854b5..5ae9dc197048 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6639,17 +6639,24 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
 * have them in state 'on' as recorded before entering guest mode.
 * Same as enter_from_user_mode().
 *
-* guest_exit_irqoff() restores host context and reinstates RCU if
-* enabled and required.
+* context_tracking_guest_exit_irqoff() restores host context and
+* reinstates RCU if enabled and required.
 *
 * This needs to be done before the below as native_read_msr()
 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 * into world and some more.
 */
lockdep_hardirqs_off(CALLER_ADDR0);
-   guest_exit_irqoff();
+   context_tracking_guest_exit_irqoff();
 
instrumentation_begin();
+   /*
+* Account guest time when precise accounting based on context tracking
+* is enabled.  Tick based accounting is deferred until after IRQs that
+* occurred within the VM-Enter/VM-Exit "window" are handled.
+*/
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_account_guest_exit();
trace_hardirqs_off_finish();
instrumentation_end();
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 16fb39503296..e4d475df1d4a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_disable();
kvm_after_interrupt(vcpu);
 
+   /*
+* When using tick-based accounting, wait until after servicing IRQs to
+* account guest time so that any ticks that occurred while running the
+* guest are properly accounted to the guest.
+*/
+   if (!vtime_accounting_enabled_this_cpu())
+   vtime_account_guest_exit();
+
if (lapic_in_kernel(vcpu)) {
s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
if (delta != S64_MIN) {
-- 
2.31.1.368.gbe11c130af-goog

Re: [PATCH 3/3] KVM: Add proper lockdep assertion in I/O bus unregister

2021-04-15 Thread Sean Christopherson

On Thu, Apr 15, 2021, Jim Mattson wrote:
> On Mon, Apr 12, 2021 at 3:23 PM Sean Christopherson  wrote:
> >
> > Convert a comment above kvm_io_bus_unregister_dev() into an actual
> > lockdep assertion, and opportunistically add curly braces to a multi-line
> > for-loop.
> >
> > Signed-off-by: Sean Christopherson 
> > ---
> >  virt/kvm/kvm_main.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ab1fa6f92c82..ccc2ef1dbdda 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4485,21 +4485,23 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
> > kvm_bus bus_idx, gpa_t addr,
> > return 0;
> >  }
> >
> > -/* Caller must hold slots_lock. */
> >  int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
> >   struct kvm_io_device *dev)
> >  {
> > int i, j;
> > struct kvm_io_bus *new_bus, *bus;
> >
> > +   lockdep_assert_held(>slots_lock);
> > +
> > bus = kvm_get_bus(kvm, bus_idx);
> > if (!bus)
> > return 0;
> >
> > -   for (i = 0; i < bus->dev_count; i++)
> > +   for (i = 0; i < bus->dev_count; i++) {
> > if (bus->range[i].dev == dev) {
> > break;
> > }
> > +   }
> Per coding-style.rst, neither the for loop nor the if-block should have 
> braces.
> 
> "Do not unnecessarily use braces where a single statement will do."

Doh, the if-statement should indeed not use braces.  I think I meant to clean
that up, and then saw something shiny...

But the for-loop... keep reading :-D

Also, use braces when a loop contains more than a single simple statement:

.. code-block:: c

while (condition) {
if (test)
do_something();
}

Re: [PATCH v2 0/3] KVM: Properly account for guest CPU time

2021-04-15 Thread Sean Christopherson

On Thu, Apr 15, 2021, Wanpeng Li wrote:
> On Thu, 15 Apr 2021 at 08:49, Sean Christopherson  wrote:
> >
> > On Wed, Apr 14, 2021, Wanpeng Li wrote:
> > > On Wed, 14 Apr 2021 at 01:25, Sean Christopherson  
> > > wrote:
> > > >
> > > > On Tue, Apr 13, 2021, Wanpeng Li wrote:
> > > > > The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831
> > > > > reported that the guest time remains 0 when running a while true
> > > > > loop in the guest.
> > > > >
> > > > > The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it
> > > > > belongs") moves guest_exit_irqoff() close to vmexit breaks the
> > > > > tick-based time accouting when the ticks that happen after IRQs are
> > > > > disabled are incorrectly accounted to the host/system time. This is
> > > > > because we exit the guest state too early.
> > > > >
> > > > > This patchset splits both context tracking logic and the time 
> > > > > accounting
> > > > > logic from guest_enter/exit_irqoff(), keep context tracking around the
> > > > > actual vmentry/exit code, have the virt time specific helpers which
> > > > > can be placed at the proper spots in kvm. In addition, it will not
> > > > > break the world outside of x86.
> > > >
> > > > IMO, this is going in the wrong direction.  Rather than separate 
> > > > context tracking,
> > > > vtime accounting, and KVM logic, this further intertwines the three.  
> > > > E.g. the
> > > > context tracking code has even more vtime accounting NATIVE vs. GEN vs. 
> > > > TICK
> > > > logic baked into it.
> > > >
> > > > Rather than smush everything into context_tracking.h, I think we can 
> > > > cleanly
> > > > split the context tracking and vtime accounting code into separate 
> > > > pieces, which
> > > > will in turn allow moving the wrapping logic to linux/kvm_host.h.  Once 
> > > > that is
> > > > done, splitting the context tracking and time accounting logic for KVM 
> > > > x86
> > > > becomes a KVM detail as opposed to requiring dedicated logic in the 
> > > > context
> > > > tracking code.
> > > >
> > > > I have untested code that compiles on x86, I'll send an RFC shortly.
> > >
> > > We need an easy to backport fix and then we might have some further
> > > cleanups on top.
> >
> > I fiddled with this a bit today, I think I have something workable that 
> > will be
> > a relatively clean and short backport.  With luck, I'll get it posted 
> > tomorrow.
> 
> I think we should improve my posted version instead of posting a lot
> of alternative versions to save everybody's time.

Ya, definitely not looking to throw out more variants.  I'm trying to stack my
cleanups on your code, while also stripping down your patches to the bare 
minimum
to minimize both the backports and the churn across the cleanups.  It looks like
it's going to work?  Fingers crossed :-)

Re: [PATCH v2 0/3] KVM: Properly account for guest CPU time

2021-04-14 Thread Sean Christopherson

On Wed, Apr 14, 2021, Wanpeng Li wrote:
> On Wed, 14 Apr 2021 at 01:25, Sean Christopherson  wrote:
> >
> > On Tue, Apr 13, 2021, Wanpeng Li wrote:
> > > The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831
> > > reported that the guest time remains 0 when running a while true
> > > loop in the guest.
> > >
> > > The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it
> > > belongs") moves guest_exit_irqoff() close to vmexit breaks the
> > > tick-based time accouting when the ticks that happen after IRQs are
> > > disabled are incorrectly accounted to the host/system time. This is
> > > because we exit the guest state too early.
> > >
> > > This patchset splits both context tracking logic and the time accounting
> > > logic from guest_enter/exit_irqoff(), keep context tracking around the
> > > actual vmentry/exit code, have the virt time specific helpers which
> > > can be placed at the proper spots in kvm. In addition, it will not
> > > break the world outside of x86.
> >
> > IMO, this is going in the wrong direction.  Rather than separate context 
> > tracking,
> > vtime accounting, and KVM logic, this further intertwines the three.  E.g. 
> > the
> > context tracking code has even more vtime accounting NATIVE vs. GEN vs. TICK
> > logic baked into it.
> >
> > Rather than smush everything into context_tracking.h, I think we can cleanly
> > split the context tracking and vtime accounting code into separate pieces, 
> > which
> > will in turn allow moving the wrapping logic to linux/kvm_host.h.  Once 
> > that is
> > done, splitting the context tracking and time accounting logic for KVM x86
> > becomes a KVM detail as opposed to requiring dedicated logic in the context
> > tracking code.
> >
> > I have untested code that compiles on x86, I'll send an RFC shortly.
> 
> We need an easy to backport fix and then we might have some further
> cleanups on top.

I fiddled with this a bit today, I think I have something workable that will be
a relatively clean and short backport.  With luck, I'll get it posted tomorrow.

Re: [RFC PATCH 0/7] KVM: Fix tick-based vtime accounting on x86

2021-04-14 Thread Sean Christopherson

On Wed, Apr 14, 2021, Thomas Gleixner wrote:
> On Tue, Apr 13 2021 at 11:29, Sean Christopherson wrote:
> > This is an alternative to Wanpeng's series[*] to fix tick-based accounting
> > on x86.  The approach for fixing the bug is identical: defer accounting
> > until after tick IRQs are handled.  The difference is purely in how the
> > context tracking and vtime code is refactored in order to give KVM the
> > hooks it needs to fix the x86 bug.
> >
> > x86 compile tested only, hence the RFC.  If folks like the direction and
> > there are no unsolvable issues, I'll cross-compile, properly test on x86,
> > and post an "official" series.
> 
> I like the final outcome of this, but we really want a small set of
> patches first which actually fix the bug and is easy to backport and
> then the larger consolidation on top.
> 
> Can you sort that out with Wanpeng please?

Will do.

Re: [PATCH v8] x86/sgx: Maintain encl->refcount for each encl->mm_list entry

2021-04-14 Thread Sean Christopherson

On Tue, Apr 13, 2021, Haitao Huang wrote:
> On Sun, 07 Feb 2021 16:14:01 -0600, Jarkko Sakkinen 
> wrote:
> 
> > This has been shown in tests:
> > 
> > [  +0.08] WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374
> > cleanup_srcu_struct+0xed/0x100
> > 
> > This is essentially a use-after free, although SRCU notices it as
> > an SRCU cleanup in an invalid context.
> > 
> The comments in code around this warning indicate a potential memory leak.
> Not sure how use-after-free come into play. Anyway, this fix seems to work
> for the warning above.
> 
> However, I still have doubts on another potential race. See below.
> 
> 
> > diff --git a/arch/x86/kernel/cpu/sgx/driver.c
> > b/arch/x86/kernel/cpu/sgx/driver.c
> > index f2eac41bb4ff..8ce6d8371cfb 100644
> > --- a/arch/x86/kernel/cpu/sgx/driver.c
> > +++ b/arch/x86/kernel/cpu/sgx/driver.c
> > @@ -72,6 +72,9 @@ static int sgx_release(struct inode *inode, struct
> > file *file)
> > synchronize_srcu(>srcu);
> > mmu_notifier_unregister(_mm->mmu_notifier, encl_mm->mm);
> > kfree(encl_mm);
> 
> Note here you are freeing the encl_mm, outside protection of encl->refcount.
> 
> > +
> > +   /* 'encl_mm' is gone, put encl_mm->encl reference: */
> > +   kref_put(>refcount, sgx_encl_release);
> > }
> > kref_put(>refcount, sgx_encl_release);
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c
> > b/arch/x86/kernel/cpu/sgx/encl.c
> > index 20a2dd5ba2b4..7449ef33f081 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -473,6 +473,9 @@ static void sgx_mmu_notifier_free(struct
> > mmu_notifier *mn)
> >  {
> > struct sgx_encl_mm *encl_mm = container_of(mn, struct sgx_encl_mm,
> > mmu_notifier);
> > +   /* 'encl_mm' is going away, put encl_mm->encl reference: */
> > +   kref_put(_mm->encl->refcount, sgx_encl_release);
> > +
> > kfree(encl_mm);
> 
> Could this access to and kfree of encl_mm possibly be after the
> kfree(encl_mm) noted above?

No, the mmu_notifier_unregister() ensures that all in-progress notifiers 
complete
before it returns, i.e. SGX's notifier call back is not reachable after it's
unregistered.

> Also is there a reason we do kfree(encl_mm) in notifier_free not directly in
> notifier_release?

Because encl_mm is the anchor to the enclave reference

/* 'encl_mm' is going away, put encl_mm->encl reference: */
kref_put(_mm->encl->refcount, sgx_encl_release);

as well as the mmu notifier reference (the mmu_notifier_put(mn) call chain).
Freeing encl_mm immediately would prevent sgx_mmu_notifier_free() from dropping
the enclave reference.  And the mmu notifier reference need to be dropped in
sgx_mmu_notifier_release() because the encl_mm has been taken off encl->mm_list.

[RFC PATCH 7/7] KVM: x86: Defer tick-based accounting 'til after IRQ handling

2021-04-13 Thread Sean Christopherson

When using tick-based accounting, defer the call to account guest time
until after servicing any IRQ(s) that happened in the guest (or
immediately after VM-Exit).  When using tick-based accounting, time is
accounted to the guest when PF_VCPU is set when the tick IRQ handler
runs.  The current approach of unconditionally accounting time in
kvm_guest_exit_irqoff() prevents IRQs that occur in the guest from ever
being processed with PF_VCPU set, since PF_VCPU ends up being set only
during the relatively short VM-Enter sequence, which runs entirely with
IRQs disabled.

Fixes: 87fa7f3e98a131 ("x86/kvm: Move context tracking where it belongs")
Cc: Thomas Gleixner 
Cc: Michael Tokarev 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/x86.c | 8 
 arch/x86/kvm/x86.h | 9 ++---
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 16fb39503296..096bbf50b7a9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9230,6 +9230,14 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_disable();
kvm_after_interrupt(vcpu);
 
+   /*
+* When using tick-based account, wait until after servicing IRQs to
+* account guest time so that any ticks that occurred while running the
+* guest are properly accounted to the guest.
+*/
+   if (!IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN))
+   kvm_vtime_account_guest_exit();
+
if (lapic_in_kernel(vcpu)) {
s64 delta = vcpu->arch.apic->lapic_timer.advance_expire_delta;
if (delta != S64_MIN) {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 74ef92f47db8..039a7d585925 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -38,15 +38,18 @@ static __always_inline void kvm_guest_exit_irqoff(void)
 * have them in state 'on' as recorded before entering guest mode.
 * Same as enter_from_user_mode().
 *
-* guest_exit_irqoff() restores host context and reinstates RCU if
-* enabled and required.
+* context_tracking_guest_exit_irqoff() restores host context and
+* reinstates RCU if enabled and required.
 *
 * This needs to be done before the below as native_read_msr()
 * contains a tracepoint and x86_spec_ctrl_restore_host() calls
 * into world and some more.
 */
lockdep_hardirqs_off(CALLER_ADDR0);
-   guest_exit_irqoff();
+   context_tracking_guest_exit_irqoff();
+
+   if (IS_ENABLED(CONFIG_VIRT_CPU_ACCOUNTING_GEN))
+   kvm_vtime_account_guest_exit();
 
instrumentation_begin();
trace_hardirqs_off_finish();
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 6/7] KVM: x86: Consolidate guest enter/exit logic to common helpers

2021-04-13 Thread Sean Christopherson

Move the enter/exit logic in {svm,vmx}_vcpu_enter_exit() to common
helpers.  In addition to deduplicating code, this will allow tweaking the
vtime accounting in the VM-Exit path without splitting logic across x86,
VMX, and SVM.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/svm.c | 39 ++--
 arch/x86/kvm/vmx/vmx.c | 39 ++--
 arch/x86/kvm/x86.h | 45 ++
 3 files changed, 49 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 48b396f33bee..0677595d07e5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3713,25 +3713,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
-   /*
-* VMENTER enables interrupts (host state), but the kernel state is
-* interrupts disabled when this is invoked. Also tell RCU about
-* it. This is the same logic as for exit_to_user_mode().
-*
-* This ensures that e.g. latency analysis on the host observes
-* guest mode as interrupt enabled.
-*
-* guest_enter_irqoff() informs context tracking about the
-* transition to guest mode and if enabled adjusts RCU state
-* accordingly.
-*/
-   instrumentation_begin();
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   instrumentation_end();
-
-   guest_enter_irqoff();
-   lockdep_hardirqs_on(CALLER_ADDR0);
+   kvm_guest_enter_irqoff();
 
if (sev_es_guest(vcpu->kvm)) {
__svm_sev_es_vcpu_run(svm->vmcb_pa);
@@ -3745,24 +3727,7 @@ static noinstr void svm_vcpu_enter_exit(struct kvm_vcpu 
*vcpu)
vmload(__sme_page_pa(sd->save_area));
}
 
-   /*
-* VMEXIT disables interrupts (host state), but tracing and lockdep
-* have them in state 'on' as recorded before entering guest mode.
-* Same as enter_from_user_mode().
-*
-* guest_exit_irqoff() restores host context and reinstates RCU if
-* enabled and required.
-*
-* This needs to be done before the below as native_read_msr()
-* contains a tracepoint and x86_spec_ctrl_restore_host() calls
-* into world and some more.
-*/
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   guest_exit_irqoff();
-
-   instrumentation_begin();
-   trace_hardirqs_off_finish();
-   instrumentation_end();
+   kvm_guest_exit_irqoff();
 }
 
 static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c05e6e2854b5..19b0e25bf598 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6600,25 +6600,7 @@ static fastpath_t vmx_exit_handlers_fastpath(struct 
kvm_vcpu *vcpu)
 static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
struct vcpu_vmx *vmx)
 {
-   /*
-* VMENTER enables interrupts (host state), but the kernel state is
-* interrupts disabled when this is invoked. Also tell RCU about
-* it. This is the same logic as for exit_to_user_mode().
-*
-* This ensures that e.g. latency analysis on the host observes
-* guest mode as interrupt enabled.
-*
-* guest_enter_irqoff() informs context tracking about the
-* transition to guest mode and if enabled adjusts RCU state
-* accordingly.
-*/
-   instrumentation_begin();
-   trace_hardirqs_on_prepare();
-   lockdep_hardirqs_on_prepare(CALLER_ADDR0);
-   instrumentation_end();
-
-   guest_enter_irqoff();
-   lockdep_hardirqs_on(CALLER_ADDR0);
+   kvm_guest_enter_irqoff();
 
/* L1D Flush includes CPU buffer clear to mitigate MDS */
if (static_branch_unlikely(_l1d_should_flush))
@@ -6634,24 +6616,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu 
*vcpu,
 
vcpu->arch.cr2 = native_read_cr2();
 
-   /*
-* VMEXIT disables interrupts (host state), but tracing and lockdep
-* have them in state 'on' as recorded before entering guest mode.
-* Same as enter_from_user_mode().
-*
-* guest_exit_irqoff() restores host context and reinstates RCU if
-* enabled and required.
-*
-* This needs to be done before the below as native_read_msr()
-* contains a tracepoint and x86_spec_ctrl_restore_host() calls
-* into world and some more.
-*/
-   lockdep_hardirqs_off(CALLER_ADDR0);
-   guest_exit_irqoff();
-
-   instrumentation_begin();
-   trace_hardirqs_off_finish();
-   instrumentation_end();
+   kvm_guest_exit_irqoff();
 }
 
 static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm

[RFC PATCH 5/7] KVM: Move vtime accounting of guest exit to separate helper

2021-04-13 Thread Sean Christopherson

Provide a standalone helper for guest exit vtime accounting so that x86
can defer tick-based accounting until the appropriate time, while still
updating context tracking immediately after VM-Exit.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 444d5f0225cb..20604bfae5a8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -358,16 +358,21 @@ static __always_inline void guest_enter_irqoff(void)
}
 }
 
-static __always_inline void guest_exit_irqoff(void)
+static __always_inline void kvm_vtime_account_guest_exit(void)
 {
-   context_tracking_guest_exit_irqoff();
-
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
instrumentation_end();
 }
 
+static __always_inline void guest_exit_irqoff(void)
+{
+   context_tracking_guest_exit_irqoff();
+
+   kvm_vtime_account_guest_exit();
+}
+
 static inline void guest_exit(void)
 {
unsigned long flags;
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 4/7] context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain

2021-04-13 Thread Sean Christopherson

Move the guest enter/exit wrappers to kvm_host.h so that KVM can manage
its context tracking vs. vtime accounting without bleeding too many KVM
details into the context tracking code.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 45 
 include/linux/kvm_host.h | 45 
 2 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index ded56aed539a..2f4538380a8d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -126,49 +126,4 @@ extern void context_tracking_init(void);
 static inline void context_tracking_init(void) { }
 #endif /* CONFIG_CONTEXT_TRACKING_FORCE */
 
-/* must be called with irqs disabled */
-static __always_inline void guest_enter_irqoff(void)
-{
-   /*
-* This is running in ioctl context so its safe to assume that it's the
-* stime pending cputime to flush.
-*/
-   instrumentation_begin();
-   vtime_account_guest_enter();
-   instrumentation_end();
-
-   /*
-* KVM does not hold any references to rcu protected data when it
-* switches CPU into a guest mode. In fact switching to a guest mode
-* is very similar to exiting to userspace from rcu point of view. In
-* addition CPU may stay in a guest mode for quite a long time (up to
-* one time slice). Lets treat guest mode as quiescent state, just like
-* we do with user-mode execution.
-*/
-   if (!context_tracking_guest_enter_irqoff()) {
-   instrumentation_begin();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-   }
-}
-
-static __always_inline void guest_exit_irqoff(void)
-{
-   context_tracking_guest_exit_irqoff();
-
-   instrumentation_begin();
-   /* Flush the guest cputime we spent on the guest */
-   vtime_account_guest_exit();
-   instrumentation_end();
-}
-
-static inline void guest_exit(void)
-{
-   unsigned long flags;
-
-   local_irq_save(flags);
-   guest_exit_irqoff();
-   local_irq_restore(flags);
-}
-
 #endif
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3b06d12ec37e..444d5f0225cb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -332,6 +332,51 @@ struct kvm_vcpu {
struct kvm_dirty_ring dirty_ring;
 };
 
+/* must be called with irqs disabled */
+static __always_inline void guest_enter_irqoff(void)
+{
+   /*
+* This is running in ioctl context so its safe to assume that it's the
+* stime pending cputime to flush.
+*/
+   instrumentation_begin();
+   vtime_account_guest_enter();
+   instrumentation_end();
+
+   /*
+* KVM does not hold any references to rcu protected data when it
+* switches CPU into a guest mode. In fact switching to a guest mode
+* is very similar to exiting to userspace from rcu point of view. In
+* addition CPU may stay in a guest mode for quite a long time (up to
+* one time slice). Lets treat guest mode as quiescent state, just like
+* we do with user-mode execution.
+*/
+   if (!context_tracking_guest_enter_irqoff()) {
+   instrumentation_begin();
+   rcu_virt_note_context_switch(smp_processor_id());
+   instrumentation_end();
+   }
+}
+
+static __always_inline void guest_exit_irqoff(void)
+{
+   context_tracking_guest_exit_irqoff();
+
+   instrumentation_begin();
+   /* Flush the guest cputime we spent on the guest */
+   vtime_account_guest_exit();
+   instrumentation_end();
+}
+
+static inline void guest_exit(void)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   guest_exit_irqoff();
+   local_irq_restore(flags);
+}
+
 static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 {
/*
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 2/7] context_tracking: Move guest enter/exit logic to standalone helpers

2021-04-13 Thread Sean Christopherson

Move guest enter/exit context tracking to standalone helpers, so that the
existing wrappers can be moved under KVM.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 43 +++-
 1 file changed, 26 insertions(+), 17 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 58f9a7251d3b..89a1a5ccb2ab 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -71,6 +71,30 @@ static inline void exception_exit(enum ctx_state prev_ctx)
}
 }
 
+static __always_inline void context_tracking_guest_enter_irqoff(void)
+{
+   if (context_tracking_enabled())
+   __context_tracking_enter(CONTEXT_GUEST);
+
+   /* KVM does not hold any references to rcu protected data when it
+* switches CPU into a guest mode. In fact switching to a guest mode
+* is very similar to exiting to userspace from rcu point of view. In
+* addition CPU may stay in a guest mode for quite a long time (up to
+* one time slice). Lets treat guest mode as quiescent state, just like
+* we do with user-mode execution.
+*/
+   if (!context_tracking_enabled_this_cpu()) {
+   instrumentation_begin();
+   rcu_virt_note_context_switch(smp_processor_id());
+   instrumentation_end();
+   }
+}
+
+static __always_inline void context_tracking_guest_exit_irqoff(void)
+{
+   if (context_tracking_enabled())
+   __context_tracking_exit(CONTEXT_GUEST);
+}
 
 /**
  * ct_state() - return the current context tracking state if known
@@ -110,27 +134,12 @@ static __always_inline void guest_enter_irqoff(void)
vtime_account_guest_enter();
instrumentation_end();
 
-   if (context_tracking_enabled())
-   __context_tracking_enter(CONTEXT_GUEST);
-
-   /* KVM does not hold any references to rcu protected data when it
-* switches CPU into a guest mode. In fact switching to a guest mode
-* is very similar to exiting to userspace from rcu point of view. In
-* addition CPU may stay in a guest mode for quite a long time (up to
-* one time slice). Lets treat guest mode as quiescent state, just like
-* we do with user-mode execution.
-*/
-   if (!context_tracking_enabled_this_cpu()) {
-   instrumentation_begin();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-   }
+   context_tracking_guest_enter_irqoff();
 }
 
 static __always_inline void guest_exit_irqoff(void)
 {
-   if (context_tracking_enabled())
-   __context_tracking_exit(CONTEXT_GUEST);
+   context_tracking_guest_exit_irqoff();
 
instrumentation_begin();
vtime_account_guest_exit();
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 3/7] context_tracking: Consolidate guest enter/exit wrappers

2021-04-13 Thread Sean Christopherson

Consolidate the guest enter/exit wrappers by providing stubs for the
context tracking helpers as necessary.  This will allow moving the
wrappers under KVM without having to bleed too many #ifdefs into the
soon-to-be KVM code.

No functional change intended.

Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 65 ++--
 1 file changed, 29 insertions(+), 36 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 89a1a5ccb2ab..ded56aed539a 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -76,18 +76,7 @@ static __always_inline void 
context_tracking_guest_enter_irqoff(void)
if (context_tracking_enabled())
__context_tracking_enter(CONTEXT_GUEST);
 
-   /* KVM does not hold any references to rcu protected data when it
-* switches CPU into a guest mode. In fact switching to a guest mode
-* is very similar to exiting to userspace from rcu point of view. In
-* addition CPU may stay in a guest mode for quite a long time (up to
-* one time slice). Lets treat guest mode as quiescent state, just like
-* we do with user-mode execution.
-*/
-   if (!context_tracking_enabled_this_cpu()) {
-   instrumentation_begin();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-   }
+   return context_tracking_enabled_this_cpu();
 }
 
 static __always_inline void context_tracking_guest_exit_irqoff(void)
@@ -116,6 +105,17 @@ static inline void user_exit_irqoff(void) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
+
+static __always_inline bool context_tracking_guest_enter_irqoff(void)
+{
+   return false;
+}
+
+static __always_inline void context_tracking_guest_exit_irqoff(void)
+{
+
+}
+
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
@@ -126,48 +126,41 @@ extern void context_tracking_init(void);
 static inline void context_tracking_init(void) { }
 #endif /* CONFIG_CONTEXT_TRACKING_FORCE */
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 /* must be called with irqs disabled */
 static __always_inline void guest_enter_irqoff(void)
 {
+   /*
+* This is running in ioctl context so its safe to assume that it's the
+* stime pending cputime to flush.
+*/
instrumentation_begin();
vtime_account_guest_enter();
instrumentation_end();
 
-   context_tracking_guest_enter_irqoff();
+   /*
+* KVM does not hold any references to rcu protected data when it
+* switches CPU into a guest mode. In fact switching to a guest mode
+* is very similar to exiting to userspace from rcu point of view. In
+* addition CPU may stay in a guest mode for quite a long time (up to
+* one time slice). Lets treat guest mode as quiescent state, just like
+* we do with user-mode execution.
+*/
+   if (!context_tracking_guest_enter_irqoff()) {
+   instrumentation_begin();
+   rcu_virt_note_context_switch(smp_processor_id());
+   instrumentation_end();
+   }
 }
 
 static __always_inline void guest_exit_irqoff(void)
 {
context_tracking_guest_exit_irqoff();
 
-   instrumentation_begin();
-   vtime_account_guest_exit();
-   instrumentation_end();
-}
-
-#else
-static __always_inline void guest_enter_irqoff(void)
-{
-   /*
-* This is running in ioctl context so its safe
-* to assume that it's the stime pending cputime
-* to flush.
-*/
-   instrumentation_begin();
-   vtime_account_guest_enter();
-   rcu_virt_note_context_switch(smp_processor_id());
-   instrumentation_end();
-}
-
-static __always_inline void guest_exit_irqoff(void)
-{
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
vtime_account_guest_exit();
instrumentation_end();
 }
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 static inline void guest_exit(void)
 {
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 1/7] sched/vtime: Move guest enter/exit vtime accounting to separate helpers

2021-04-13 Thread Sean Christopherson

Provide separate helpers for guest enter/exit vtime accounting instead of
open coding the logic within the context tracking code.  This will allow
KVM x86 to handle vtime accounting slightly differently when using tick-
based accounting.

Opportunstically delete the vtime_account_kernel() stub now that all
callers are wrapped with CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y.

No functional change intended.

Suggested-by: Thomas Gleixner 
Cc: Christian Borntraeger 
Signed-off-by: Sean Christopherson 
---
 include/linux/context_tracking.h | 17 +++-
 include/linux/vtime.h| 45 +---
 2 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index bceb06498521..58f9a7251d3b 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -102,16 +102,12 @@ extern void context_tracking_init(void);
 static inline void context_tracking_init(void) { }
 #endif /* CONFIG_CONTEXT_TRACKING_FORCE */
 
-
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 /* must be called with irqs disabled */
 static __always_inline void guest_enter_irqoff(void)
 {
instrumentation_begin();
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_guest_enter(current);
-   else
-   current->flags |= PF_VCPU;
+   vtime_account_guest_enter();
instrumentation_end();
 
if (context_tracking_enabled())
@@ -137,10 +133,7 @@ static __always_inline void guest_exit_irqoff(void)
__context_tracking_exit(CONTEXT_GUEST);
 
instrumentation_begin();
-   if (vtime_accounting_enabled_this_cpu())
-   vtime_guest_exit(current);
-   else
-   current->flags &= ~PF_VCPU;
+   vtime_account_guest_exit();
instrumentation_end();
 }
 
@@ -153,8 +146,7 @@ static __always_inline void guest_enter_irqoff(void)
 * to flush.
 */
instrumentation_begin();
-   vtime_account_kernel(current);
-   current->flags |= PF_VCPU;
+   vtime_account_guest_enter();
rcu_virt_note_context_switch(smp_processor_id());
instrumentation_end();
 }
@@ -163,8 +155,7 @@ static __always_inline void guest_exit_irqoff(void)
 {
instrumentation_begin();
/* Flush the guest cputime we spent on the guest */
-   vtime_account_kernel(current);
-   current->flags &= ~PF_VCPU;
+   vtime_account_guest_exit();
instrumentation_end();
 }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 041d6524d144..f30b472a2201 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -3,6 +3,8 @@
 #define _LINUX_KERNEL_VTIME_H
 
 #include 
+#include 
+
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #include 
 #endif
@@ -18,6 +20,17 @@ struct task_struct;
 static inline bool vtime_accounting_enabled_this_cpu(void) { return true; }
 extern void vtime_task_switch(struct task_struct *prev);
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   vtime_account_kernel(current);
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+
+}
+
 #elif defined(CONFIG_VIRT_CPU_ACCOUNTING_GEN)
 
 /*
@@ -49,12 +62,38 @@ static inline void vtime_task_switch(struct task_struct 
*prev)
vtime_task_switch_generic(prev);
 }
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_guest_enter(current);
+   else
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+   if (vtime_accounting_enabled_this_cpu())
+   vtime_guest_exit(current);
+   else
+   current->flags &= ~PF_VCPU;
+}
+
+
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
-static inline bool vtime_accounting_enabled_cpu(int cpu) {return false; }
 static inline bool vtime_accounting_enabled_this_cpu(void) { return false; }
 static inline void vtime_task_switch(struct task_struct *prev) { }
 
+static __always_inline void vtime_account_guest_enter(void)
+{
+   current->flags |= PF_VCPU;
+}
+
+static __always_inline void vtime_account_guest_exit(void)
+{
+   current->flags &= ~PF_VCPU;
+}
+
 #endif
 
 /*
@@ -63,9 +102,7 @@ static inline void vtime_task_switch(struct task_struct 
*prev) { }
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING
 extern void vtime_account_kernel(struct task_struct *tsk);
 extern void vtime_account_idle(struct task_struct *tsk);
-#else /* !CONFIG_VIRT_CPU_ACCOUNTING */
-static inline void vtime_account_kernel(struct task_struct *tsk) { }
-#endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
+#endif /* CONFIG_VIRT_CPU_ACCOUNTING */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 extern void arch_vtime_task_switch(struct task_struct *tsk);
-- 
2.31.1.295.g9ea45b61b8-goog

[RFC PATCH 0/7] KVM: Fix tick-based vtime accounting on x86

2021-04-13 Thread Sean Christopherson

This is an alternative to Wanpeng's series[*] to fix tick-based accounting
on x86.  The approach for fixing the bug is identical: defer accounting
until after tick IRQs are handled.  The difference is purely in how the
context tracking and vtime code is refactored in order to give KVM the
hooks it needs to fix the x86 bug.

x86 compile tested only, hence the RFC.  If folks like the direction and
there are no unsolvable issues, I'll cross-compile, properly test on x86,
and post an "official" series.

Sean Christopherson (7):
  sched/vtime: Move guest enter/exit vtime accounting to separate
helpers
  context_tracking: Move guest enter/exit logic to standalone helpers
  context_tracking: Consolidate guest enter/exit wrappers
  context_tracking: KVM: Move guest enter/exit wrappers to KVM's domain
  KVM: Move vtime accounting of guest exit to separate helper
  KVM: x86: Consolidate guest enter/exit logic to common helpers
  KVM: x86: Defer tick-based accounting 'til after IRQ handling

 arch/x86/kvm/svm/svm.c   |  39 +---
 arch/x86/kvm/vmx/vmx.c   |  39 +---
 arch/x86/kvm/x86.c   |   8 +++
 arch/x86/kvm/x86.h   |  48 +++
 include/linux/context_tracking.h | 100 ---
 include/linux/kvm_host.h |  50 
 include/linux/vtime.h|  45 --
 7 files changed, 175 insertions(+), 154 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog

Re: [PATCH v2 0/3] KVM: Properly account for guest CPU time

2021-04-13 Thread Sean Christopherson

On Tue, Apr 13, 2021, Wanpeng Li wrote:
> The bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=209831
> reported that the guest time remains 0 when running a while true
> loop in the guest.
> 
> The commit 87fa7f3e98a131 ("x86/kvm: Move context tracking where it
> belongs") moves guest_exit_irqoff() close to vmexit breaks the
> tick-based time accouting when the ticks that happen after IRQs are
> disabled are incorrectly accounted to the host/system time. This is
> because we exit the guest state too early.
> 
> This patchset splits both context tracking logic and the time accounting 
> logic from guest_enter/exit_irqoff(), keep context tracking around the 
> actual vmentry/exit code, have the virt time specific helpers which 
> can be placed at the proper spots in kvm. In addition, it will not 
> break the world outside of x86.

IMO, this is going in the wrong direction.  Rather than separate context 
tracking,
vtime accounting, and KVM logic, this further intertwines the three.  E.g. the
context tracking code has even more vtime accounting NATIVE vs. GEN vs. TICK
logic baked into it.

Rather than smush everything into context_tracking.h, I think we can cleanly
split the context tracking and vtime accounting code into separate pieces, which
will in turn allow moving the wrapping logic to linux/kvm_host.h.  Once that is
done, splitting the context tracking and time accounting logic for KVM x86
becomes a KVM detail as opposed to requiring dedicated logic in the context
tracking code.

I have untested code that compiles on x86, I'll send an RFC shortly.

> v1 -> v2:
>  * split context_tracking from guest_enter/exit_irqoff
>  * provide separate vtime accounting functions for consistent
>  * place the virt time specific helpers at the proper splot 
> 
> Suggested-by: Thomas Gleixner 
> Cc: Thomas Gleixner 
> Cc: Sean Christopherson 
> Cc: Michael Tokarev 
> 
> Wanpeng Li (3):
>   context_tracking: Split guest_enter/exit_irqoff
>   context_tracking: Provide separate vtime accounting functions
>   x86/kvm: Fix vtime accounting
> 
>  arch/x86/kvm/svm/svm.c   |  6 ++-
>  arch/x86/kvm/vmx/vmx.c   |  6 ++-
>  arch/x86/kvm/x86.c   |  1 +
>  include/linux/context_tracking.h | 84 
> +++-
>  4 files changed, 74 insertions(+), 23 deletions(-)
> 
> -- 
> 2.7.4
>

[PATCH 3/3] KVM: Add proper lockdep assertion in I/O bus unregister

2021-04-12 Thread Sean Christopherson

Convert a comment above kvm_io_bus_unregister_dev() into an actual
lockdep assertion, and opportunistically add curly braces to a multi-line
for-loop.

Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ab1fa6f92c82..ccc2ef1dbdda 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4485,21 +4485,23 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
return 0;
 }
 
-/* Caller must hold slots_lock. */
 int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
  struct kvm_io_device *dev)
 {
int i, j;
struct kvm_io_bus *new_bus, *bus;
 
+   lockdep_assert_held(>slots_lock);
+
bus = kvm_get_bus(kvm, bus_idx);
if (!bus)
return 0;
 
-   for (i = 0; i < bus->dev_count; i++)
+   for (i = 0; i < bus->dev_count; i++) {
if (bus->range[i].dev == dev) {
break;
}
+   }
 
if (i == bus->dev_count)
return 0;
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH 2/3] KVM: Stop looking for coalesced MMIO zones if the bus is destroyed

2021-04-12 Thread Sean Christopherson

Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
fails to allocate memory for the new instance of the bus.  If it can't
instantiate a new bus, unregister_dev() destroys all devices _except_ the
target device.   But, it doesn't tell the caller that it obliterated the
bus and invoked the destructor for all devices that were on the bus.  In
the coalesced MMIO case, this can result in a deleted list entry
dereference due to attempting to continue iterating on coalesced_zones
after future entries (in the walk) have been deleted.

Opportunistically add curly braces to the for-loop, which encompasses
many lines but sneaks by without braces due to the guts being a single
if statement.

Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: sta...@vger.kernel.org
Reported-by: Hao Sun 
Signed-off-by: Sean Christopherson 
---
 include/linux/kvm_host.h  |  4 ++--
 virt/kvm/coalesced_mmio.c | 19 +--
 virt/kvm/kvm_main.c   | 10 +-
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1b65e7204344..99dccea4293c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -192,8 +192,8 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus 
bus_idx, gpa_t addr,
int len, void *val);
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
int len, struct kvm_io_device *dev);
-void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
-  struct kvm_io_device *dev);
+int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
+ struct kvm_io_device *dev);
 struct kvm_io_device *kvm_io_bus_get_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 gpa_t addr);
 
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 62bd908ecd58..f08f5e82460b 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -174,21 +174,36 @@ int kvm_vm_ioctl_unregister_coalesced_mmio(struct kvm 
*kvm,
   struct kvm_coalesced_mmio_zone *zone)
 {
struct kvm_coalesced_mmio_dev *dev, *tmp;
+   int r;
 
if (zone->pio != 1 && zone->pio != 0)
return -EINVAL;
 
mutex_lock(>slots_lock);
 
-   list_for_each_entry_safe(dev, tmp, >coalesced_zones, list)
+   list_for_each_entry_safe(dev, tmp, >coalesced_zones, list) {
if (zone->pio == dev->zone.pio &&
coalesced_mmio_in_range(dev, zone->addr, zone->size)) {
-   kvm_io_bus_unregister_dev(kvm,
+   r = kvm_io_bus_unregister_dev(kvm,
zone->pio ? KVM_PIO_BUS : KVM_MMIO_BUS, 
>dev);
kvm_iodevice_destructor(>dev);
+
+   /*
+* On failure, unregister destroys all devices on the
+* bus _except_ the target device, i.e. coalesced_zones
+* has been modified.  No need to restart the walk as
+* there aren't any zones left.
+*/
+   if (r)
+   break;
}
+   }
 
mutex_unlock(>slots_lock);
 
+   /*
+* Ignore the result of kvm_io_bus_unregister_dev(), from userspace's
+* perspective, the coalesced MMIO is most definitely unregistered.
+*/
return 0;
 }
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d6e2b570e430..ab1fa6f92c82 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4486,15 +4486,15 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum 
kvm_bus bus_idx, gpa_t addr,
 }
 
 /* Caller must hold slots_lock. */
-void kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
-  struct kvm_io_device *dev)
+int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
+ struct kvm_io_device *dev)
 {
int i, j;
struct kvm_io_bus *new_bus, *bus;
 
bus = kvm_get_bus(kvm, bus_idx);
if (!bus)
-   return;
+   return 0;
 
for (i = 0; i < bus->dev_count; i++)
if (bus->range[i].dev == dev) {
@@ -4502,7 +4502,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
}
 
if (i == bus->dev_count)
-   return;
+   return 0;
 
new_bus = kmalloc(struct_size(bus, range, bus->dev_count - 1),
  GFP_KERNEL_ACCOUNT);
@@ -4527,7 +4527,7 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
}
 
kfree(bus);
-   return;
+   return new_bus ? 0

[PATCH 1/3] KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU

2021-04-12 Thread Sean Christopherson

If allocating a new instance of an I/O bus fails when unregistering a
device, wait to destroy the device until after all readers are guaranteed
to see the new null bus.  Destroying devices before the bus is nullified
could lead to use-after-free since readers expect the devices on their
reference of the bus to remain valid.

Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: sta...@vger.kernel.org
Signed-off-by: Sean Christopherson 
---
 virt/kvm/kvm_main.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 383df23514b9..d6e2b570e430 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4511,7 +4511,13 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
new_bus->dev_count--;
memcpy(new_bus->range + i, bus->range + i + 1,
flex_array_size(new_bus, range, 
new_bus->dev_count - i));
-   } else {
+   }
+
+   rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
+   synchronize_srcu_expedited(>srcu);
+
+   /* Destroy the old bus _after_ installing the (null) bus. */
+   if (!new_bus) {
pr_err("kvm: failed to shrink bus, removing it completely\n");
for (j = 0; j < bus->dev_count; j++) {
if (j == i)
@@ -4520,8 +4526,6 @@ void kvm_io_bus_unregister_dev(struct kvm *kvm, enum 
kvm_bus bus_idx,
}
}
 
-   rcu_assign_pointer(kvm->buses[bus_idx], new_bus);
-   synchronize_srcu_expedited(>srcu);
kfree(bus);
return;
 }
-- 
2.31.1.295.g9ea45b61b8-goog

[PATCH 0/3] KVM: Fixes and a cleanup for coalesced MMIO

2021-04-12 Thread Sean Christopherson

Fix two bugs that are exposed if unregistered a device on an I/O bus fails
due to OOM.  Tack on opportunistic cleanup in the related code.
 
Sean Christopherson (3):
  KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing
SRCU
  KVM: Stop looking for coalesced MMIO zones if the bus is destroyed
  KVM: Add proper lockdep assertion in I/O bus unregister

 include/linux/kvm_host.h  |  4 ++--
 virt/kvm/coalesced_mmio.c | 19 +--
 virt/kvm/kvm_main.c   | 26 --
 3 files changed, 35 insertions(+), 14 deletions(-)

-- 
2.31.1.295.g9ea45b61b8-goog

Re: [PATCH 2/2] KVM: x86: Fix split-irqchip vs interrupt injection window request

2021-04-12 Thread Sean Christopherson

On Fri, Apr 09, 2021, Lai Jiangshan wrote:
> On Fri, Nov 27, 2020 at 7:26 PM Paolo Bonzini  wrote:
> >
> > kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
> > a hodge-podge of conditions, hacked together to get something that
> > more or less works.  But what is actually needed is much simpler;
> > in both cases the fundamental question is, do we have a place to stash
> > an interrupt if userspace does KVM_INTERRUPT?
> >
> > In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
> > Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
> > unnecessarily restrictive.
> >
> > In split irqchip mode it's a bit more complicated, we need to check
> > kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
> > cycle and thus requires ExtINTs not to be masked) as well as
> > !pending_userspace_extint(vcpu).  However, there is no need to
> > check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
> > pending ExtINT state separate from event injection state, and checking
> > kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
> > priority than APIC interrupts.  In fact the latter fixes a bug:
> > when userspace requests an IRQ window vmexit, an interrupt in the
> > local APIC can cause kvm_cpu_has_interrupt() to be true and thus
> > kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
> > happens, vcpu_run does not exit to userspace but the interrupt window
> > vmexits keep occurring.  The VM loops without any hope of making progress.
> >
> > Once we try to fix these with something like
> >
> >  return kvm_arch_interrupt_allowed(vcpu) &&
> > -!kvm_cpu_has_interrupt(vcpu) &&
> > -!kvm_event_needs_reinjection(vcpu) &&
> > -kvm_cpu_accept_dm_intr(vcpu);
> > +(!lapic_in_kernel(vcpu)
> > + ? !vcpu->arch.interrupt.injected
> > + : (kvm_apic_accept_pic_intr(vcpu)
> > +&& !pending_userspace_extint(v)));
> >
> > we realize two things.  First, thanks to the previous patch the complex
> > conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
> > window request in vcpu_enter_guest()
> >
> > bool req_int_win =
> > dm_request_for_irq_injection(vcpu) &&
> > kvm_cpu_accept_dm_intr(vcpu);
> >
> > should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
> > it is unnecessary to ask the processor for an interrupt window
> > if we would not be able to return to userspace.  Therefore, the
> > complex conditional is really the correct implementation of
> > kvm_cpu_accept_dm_intr(vcpu).  It all makes sense:
> >
> > - we can accept an interrupt from userspace if there is a place
> >   to stash it (and, for irqchip split, ExtINTs are not masked).
> >   Interrupts from userspace _can_ be accepted even if right now
> >   EFLAGS.IF=0.
> 
> Hello, Paolo
> 
> If userspace does KVM_INTERRUPT, vcpu->arch.interrupt.injected is
> set immediately, and in inject_pending_event(), we have
> 
> else if (!vcpu->arch.exception.pending) {
> if (vcpu->arch.nmi_injected) {
> kvm_x86_ops.set_nmi(vcpu);
> can_inject = false;
> } else if (vcpu->arch.interrupt.injected) {
> kvm_x86_ops.set_irq(vcpu);
> can_inject = false;
> }
> }
> 
> I'm curious about that can the kvm_x86_ops.set_irq() here be possible
> to queue the irq with EFLAGS.IF=0? If not, which code prevents it?

The interrupt is only directly injected if the local APIC is _not_ in-kernel.
If userspace is managing the local APIC, my understanding is that userspace is
also responsible for honoring EFLAGS.IF, though KVM aids userspace by updating
vcpu->run->ready_for_interrupt_injection when exiting to userspace.  When
userspace is modeling the local APIC, that resolves to
kvm_vcpu_ready_for_interrupt_injection():

return kvm_arch_interrupt_allowed(vcpu) &&
kvm_cpu_accept_dm_intr(vcpu);

where kvm_arch_interrupt_allowed() checks EFLAGS.IF (and an edge case related to
nested virtualization).  KVM also captures EFLAGS.IF in vcpu->run->if_flag.
For whatever reason, QEMU checks both vcpu->run flags before injecting an IRQ,
maybe to handle a case where QEMU itself clears EFLAGS.IF?
 
> I'm asking about this because I just noticed that interrupt can
> be queued when exception pending, and this patch relaxed it even
> more.
> 
> Note: interrupt can NOT be queued when exception pending
> until 664f8e26b00c7 ("KVM: X86: Fix loss of exception which
> has not yet been injected") which I think is dangerous.

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-12 Thread Sean Christopherson

On Sun, Apr 11, 2021, Len Brown wrote:
> On Fri, Apr 9, 2021 at 5:44 PM Andy Lutomirski  wrote:
> >
> > On Fri, Apr 9, 2021 at 1:53 PM Len Brown  wrote:
> > >
> > > On Wed, Mar 31, 2021 at 6:45 PM Andy Lutomirski  wrote:
> > > >
> > > > On Wed, Mar 31, 2021 at 3:28 PM Len Brown  wrote:
> > > > > We've also established that when running in a VMM, every update to
> > > > > XCR0 causes a VMEXIT.
> > > >
> > > > This is true, it sucks, and Intel could fix it going forward.
> > >
> > > What hardware fix do you suggest?
> > > If a guest is permitted to set XCR0 bits without notifying the VMM,
> > > what happens when it sets bits that the VMM doesn't know about?
> >
> > The VM could have a mask of allowed XCR0 bits that don't exist.
> >
> > TDX solved this problem *somehow* -- XSETBV doesn't (visibly?) exit on
> > TDX.  Surely plain VMX could fix it too.
> 
> There are two cases.
> 
> 1. Hardware that exists today and in the foreseeable future.
> 
> VM modification of XCR0 results in VMEXIT to VMM.
> The VMM sees bits set by the guest, and so it can accept what
> it supports, or send the VM a fault for non-support.
> 
> Here it is not possible for the VMM to change XCR0 without the VMM knowing.
> 
> 2. Future Hardware that allows guests to write XCR0 w/o VMEXIT.
> 
> Not sure I follow your proposal.
> 
> Yes, the VM effectively has a mask of what is supported,
> because it can issue CPUID.
> 
> The VMM virtualizes CPUID, and needs to know it must not
> expose to the VM any state features it doesn't support.
> Also, the VMM needs to audit XCR0 before it uses XSAVE,
> else the guest could attack or crash the VMM through
> buffer overrun.

The VMM already needs to context switch XCR0 and XSS, so this is a non-issue.

> Is this what you suggest?

Yar.  In TDX, XSETBV exits, but only to the TDX module.  I.e. TDX solves the
problem in software by letting the VMM tell the TDX module what features the
guest can set in XCR0/XSS via the XFAM (Extended Features Allowed Mask).

But, that software "fix" can also be pushed into ucode, e.g. add an XFAM VMCS
field, the guest can set any XCR0 bits that are '1' in VMCS.XFAM without 
exiting.

Note, SGX has similar functionality in the form of XFRM (XSAVE-Feature Request
Mask).  The enclave author can specify what features will be enabled in XCR0
when the enclave is running.  Not that relevant, other than to reinforce that
this is a solvable problem.

> If yes, what do you suggest in the years between now and when
> that future hardware and VMM exist?

Burn some patch space? :-)

Re: general protection fault in kvm_vm_ioctl_unregister_coalesced_mmio

2021-04-12 Thread Sean Christopherson

On Mon, Apr 12, 2021, Hao Sun wrote:
> Crash log:
> ==
> kvm: failed to shrink bus, removing it completely
> general protection fault, probably for non-canonical address
> 0xdead0100:  [#1] PREEMPT SMP
> CPU: 3 PID: 7974 Comm: executor Not tainted 5.12.0-rc6+ #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> RIP: 0010:kvm_vm_ioctl_unregister_coalesced_mmio+0x88/0x1e0
> arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:183

Ugh, this code is a mess.  On allocation failure, it nukes the entire bus and
invokes the destructor for all _other_ devices on the bus.  The coalesced MMIO
code is iterating over its list of devices, and while list_for_each_entry_safe()
can handle removal of the current entry, it blows up when future entries are
deleted.

That the coalesced MMIO code continuing to iterate appears to stem from the fact
that KVM_UNREGISTER_COALESCED_MMIO doesn't require an exact match.  Whether or
not this is intentional is probably a moot point since it's now baked into the
ABI.

Assuming we can't change kvm_vm_ioctl_unregister_coalesced_mmio() to stop
iterating on match, the least awful fix would be to return success/failure from
kvm_io_bus_unregister_dev().

Note, there's a second bug in the error path in kvm_io_bus_unregister_dev(), as
it invokes the destructors before nullifying kvm->buses and synchronizing SRCU.
I.e. it's freeing devices on the bus while readers may be in flight.  That can
be fixed by deferring the destruction until after SRCU synchronization.

I'll send patches unless someone has a better idea for fixing this.

> Code: 00 4c 89 74 24 18 4c 89 6c 24 20 48 8b 44 24 10 48 83 c0 08 48
> 89 44 24 28 48 89 5c 24 08 4c 89 24 24 4c 89 ff e8 d8 9f 49 00 <4d> 8b
> 37 48 89 df e8 3d 9b 49 00 8b 2b 49 8d 7f 2c e8 32 9b 49 00
> RSP: 0018:c90005dfbd58 EFLAGS: 00010246
> RAX: 88800c3e7188 RBX: c90005dfbe3c RCX: 0af0
> RDX: 00010100 RSI: cbab RDI: dead0100
> RBP:  R08:  R09: 00010107
> R10: 0001 R11: 01d2 R12: c90005e7dff8
> R13: 4000 R14: dead0100 R15: dead0100
> FS:  7ff1bb092700() GS:88807ed0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55d8946c4918 CR3: 12d88000 CR4: 00752ee0
> PKRU: 5554
> Call Trace:
>  kvm_vm_ioctl+0x6e1/0x1860 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3897
>  vfs_ioctl fs/ioctl.c:48 [inline]
>  __do_sys_ioctl fs/ioctl.c:753 [inline]
>  __se_sys_ioctl+0xab/0x110 fs/ioctl.c:739
>  __x64_sys_ioctl+0x3f/0x50 fs/ioctl.c:739
>  do_syscall_64+0x39/0x80 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x47338d

Re: [PATCH 5/6] KVM: SVM: pass a proper reason in kvm_emulate_instruction()

2021-04-12 Thread Sean Christopherson

+Aaron

On Mon, Apr 12, 2021, David Edmondson wrote:
> From: Joao Martins 
> 
> Declare various causes of emulation and use them as appropriate.
> 
> Signed-off-by: Joao Martins 
> Signed-off-by: David Edmondson 
> ---
>  arch/x86/include/asm/kvm_host.h |  6 ++
>  arch/x86/kvm/svm/avic.c |  3 ++-
>  arch/x86/kvm/svm/svm.c  | 26 +++---
>  3 files changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 79e9ca756742..e1284680cbdc 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1535,6 +1535,12 @@ enum {
>   EMULREASON_IO_COMPLETE,
>   EMULREASON_UD,
>   EMULREASON_PF,
> + EMULREASON_SVM_NOASSIST,
> + EMULREASON_SVM_RSM,
> + EMULREASON_SVM_RDPMC,
> + EMULREASON_SVM_CR,
> + EMULREASON_SVM_DR,
> + EMULREASON_SVM_AVIC_UNACCEL,

Passing these to userspace arguably makes them ABI, i.e. they need to go into
uapi/kvm.h somewhere.  That said, I don't like passing arbitrary values for what
is effectively the VM-Exit reason.  Why not simply pass the exit reason, 
assuming
we do indeed want to dump this info to userspace?

What is the intended end usage of this information?  Actual emulation?  Debug?
Logging?

Depending on what you're trying to do with the info, maybe there's a better
option.  E.g. Aaron is working on a series that includes passing pass the code
stream (instruction bytes) to userspace on emulation failure, though I'm not
sure if he's planning on providing the VM-Exit reason.

Re: [PATCH v3] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-09 Thread Sean Christopherson

On Fri, Apr 09, 2021, Tom Lendacky wrote:
> From: Tom Lendacky 
> 
> Access to the GHCB is mainly in the VMGEXIT path and it is known that the
> GHCB will be mapped. But there are two paths where it is possible the GHCB
> might not be mapped.
> 
> The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
> the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
> However, if a SIPI is performed without a corresponding AP Reset Hold,
> then the GHCB might not be mapped (depending on the previous VMEXIT),
> which will result in a NULL pointer dereference.
> 
> The svm_complete_emulated_msr() routine will update the GHCB to inform
> the caller of a RDMSR/WRMSR operation about any errors. While it is likely
> that the GHCB will be mapped in this situation, add a safe guard
> in this path to be certain a NULL pointer dereference is not encountered.
> 
> Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts 
> under SEV-ES")
> Fixes: 647daca25d24 ("KVM: SVM: Add support for booting APs in an SEV-ES 
> guest")
> Signed-off-by: Tom Lendacky 
> 
> ---

Reviewed-by: Sean Christopherson

Re: [PATCH v4 1/4] KVM: x86: Fix a spurious -E2BIG in KVM_GET_EMULATED_CPUID

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Emanuele Giuseppe Esposito wrote:
> When retrieving emulated CPUID entries, check for an insufficient array
> size if and only if KVM is actually inserting an entry.
> If userspace has a priori knowledge of the exact array size,
> KVM_GET_EMULATED_CPUID will incorrectly fail due to effectively requiring
> an extra, unused entry.
> 
> Fixes: 433f4ba19041 ("KVM: x86: fix out-of-bounds write in 
> KVM_GET_EMULATED_CPUID (CVE-2019-19332)")
> Signed-off-by: Emanuele Giuseppe Esposito 
> ---
>  arch/x86/kvm/cpuid.c | 33 -
>  1 file changed, 16 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 6bd2f8b830e4..d30194081892 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -567,34 +567,33 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct 
> kvm_cpuid_array *array,
>  
>  static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
>  {
> - struct kvm_cpuid_entry2 *entry;
> -
> - if (array->nent >= array->maxnent)
> - return -E2BIG;
> + struct kvm_cpuid_entry2 entry;
>  
> - entry = >entries[array->nent];
> - entry->function = func;
> - entry->index = 0;
> - entry->flags = 0;
> + memset(, 0, sizeof(entry));
>  
>   switch (func) {
>   case 0:
> - entry->eax = 7;
> - ++array->nent;
> + entry.eax = 7;
>   break;
>   case 1:
> - entry->ecx = F(MOVBE);
> - ++array->nent;
> + entry.ecx = F(MOVBE);
>   break;
>   case 7:
> - entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
> - entry->eax = 0;
> - entry->ecx = F(RDPID);
> - ++array->nent;
> - default:
> + entry.flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
> + entry.ecx = F(RDPID);
>   break;
> + default:
> + goto out;
>   }
>  
> + /* This check is performed only when func is valid */

Sorry to keep nitpicking and bikeshedding.  Funcs aren't really "invalid", KVM
just doesn't have any features it emulates in other leafs.  Maybe be more 
literal
in describing what triggers the check?

/* Check the array capacity iff the entry is being copied over. */

Not a sticking point, so either way:

Reviewed-by: Sean Christopherson 

> + if (array->nent >= array->maxnent)
> + return -E2BIG;
> +
> + entry.function = func;
> + memcpy(>entries[array->nent++], , sizeof(entry));
> +
> +out:
>   return 0;
>  }
>  
> -- 
> 2.30.2
>

Re: [PATCH] x86/kvm: Don't alloc __pv_cpu_mask when !CONFIG_SMP

2021-04-08 Thread Sean Christopherson

On Wed, Apr 07, 2021, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> Enable PV TLB shootdown when !CONFIG_SMP doesn't make sense. Let's move 
> it inside CONFIG_SMP. In addition, we can avoid alloc __pv_cpu_mask when 
> !CONFIG_SMP and get rid of 'alloc' variable in kvm_alloc_cpumask.

...

> +static bool pv_tlb_flush_supported(void) { return false; }
> +static bool pv_ipi_supported(void) { return false; }
> +static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info) { }
> +static void kvm_setup_pv_ipi(void) { }

If you shuffle things around a bit more, you can avoid these stubs, and hide the
definition of __pv_cpu_mask behind CONFIG_SMP, too.


diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5e78e01ca3b4..13c6b1c7c01b 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -451,6 +451,8 @@ static void __init sev_map_percpu_data(void)
}
 }

+#ifdef CONFIG_SMP
+
 static bool pv_tlb_flush_supported(void)
 {
return (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
@@ -460,8 +462,6 @@ static bool pv_tlb_flush_supported(void)

 static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);

-#ifdef CONFIG_SMP
-
 static bool pv_ipi_supported(void)
 {
return kvm_para_has_feature(KVM_FEATURE_PV_SEND_IPI);
@@ -574,45 +574,6 @@ static void kvm_smp_send_call_func_ipi(const struct 
cpumask *mask)
}
 }

-static void __init kvm_smp_prepare_boot_cpu(void)
-{
-   /*
-* Map the per-cpu variables as decrypted before kvm_guest_cpu_init()
-* shares the guest physical address with the hypervisor.
-*/
-   sev_map_percpu_data();
-
-   kvm_guest_cpu_init();
-   native_smp_prepare_boot_cpu();
-   kvm_spinlock_init();
-}
-
-static void kvm_guest_cpu_offline(void)
-{
-   kvm_disable_steal_time();
-   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-   wrmsrl(MSR_KVM_PV_EOI_EN, 0);
-   kvm_pv_disable_apf();
-   apf_task_wake_all();
-}
-
-static int kvm_cpu_online(unsigned int cpu)
-{
-   local_irq_disable();
-   kvm_guest_cpu_init();
-   local_irq_enable();
-   return 0;
-}
-
-static int kvm_cpu_down_prepare(unsigned int cpu)
-{
-   local_irq_disable();
-   kvm_guest_cpu_offline();
-   local_irq_enable();
-   return 0;
-}
-#endif
-
 static void kvm_flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
@@ -639,6 +600,63 @@ static void kvm_flush_tlb_others(const struct cpumask 
*cpumask,
native_flush_tlb_others(flushmask, info);
 }

+static void __init kvm_smp_prepare_boot_cpu(void)
+{
+   /*
+* Map the per-cpu variables as decrypted before kvm_guest_cpu_init()
+* shares the guest physical address with the hypervisor.
+*/
+   sev_map_percpu_data();
+
+   kvm_guest_cpu_init();
+   native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
+}
+
+static void kvm_guest_cpu_offline(void)
+{
+   kvm_disable_steal_time();
+   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
+   wrmsrl(MSR_KVM_PV_EOI_EN, 0);
+   kvm_pv_disable_apf();
+   apf_task_wake_all();
+}
+
+static int kvm_cpu_online(unsigned int cpu)
+{
+   local_irq_disable();
+   kvm_guest_cpu_init();
+   local_irq_enable();
+   return 0;
+}
+
+static int kvm_cpu_down_prepare(unsigned int cpu)
+{
+   local_irq_disable();
+   kvm_guest_cpu_offline();
+   local_irq_enable();
+   return 0;
+}
+
+static __init int kvm_alloc_cpumask(void)
+{
+   int cpu;
+
+   if (!kvm_para_available() || nopv)
+   return 0;
+
+   if (pv_tlb_flush_supported() || pv_ipi_supported())
+   for_each_possible_cpu(cpu) {
+   zalloc_cpumask_var_node(per_cpu_ptr(&__pv_cpu_mask, 
cpu),
+   GFP_KERNEL, cpu_to_node(cpu));
+   }
+
+   return 0;
+}
+arch_initcall(kvm_alloc_cpumask);
+
+#endif
+
 static void __init kvm_guest_init(void)
 {
int i;
@@ -653,21 +671,21 @@ static void __init kvm_guest_init(void)
pv_ops.time.steal_clock = kvm_steal_clock;
}

+   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
+   apic_set_eoi_write(kvm_guest_apic_eoi_write);
+
+   if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
+   static_branch_enable(_async_pf_enabled);
+   alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, 
asm_sysvec_kvm_asyncpf_interrupt);
+   }
+
+#ifdef CONFIG_SMP
if (pv_tlb_flush_supported()) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
pr_info("KVM setup pv remote TLB flush\n");
}

-   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
-   apic_set_eoi_write(kvm_guest_apic_eoi_write);
-
-   if

Re: [PATCH v2] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Tom Lendacky wrote:
> 
> 
> On 4/8/21 12:37 PM, Sean Christopherson wrote:
> > On Thu, Apr 08, 2021, Tom Lendacky wrote:
> >> On 4/8/21 12:10 PM, Sean Christopherson wrote:
> >>> On Thu, Apr 08, 2021, Tom Lendacky wrote:
> >>>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >>>> index 83e00e524513..7ac67615c070 100644
> >>>> --- a/arch/x86/kvm/svm/sev.c
> >>>> +++ b/arch/x86/kvm/svm/sev.c
> >>>> @@ -2105,5 +2105,8 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu 
> >>>> *vcpu, u8 vector)
> >>>>   * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
> >>>>   * non-zero value.
> >>>>   */
> >>>> +if (WARN_ON_ONCE(!svm->ghcb))
> >>>
> >>> Isn't this guest triggerable?  I.e. send a SIPI without doing the reset 
> >>> hold?
> >>> If so, this should not WARN.
> >>
> >> Yes, it is a guest triggerable event. But a guest shouldn't be doing that,
> >> so I thought adding the WARN_ON_ONCE() just to detect it wasn't bad.
> >> Definitely wouldn't want a WARN_ON().
> > 
> > WARNs are intended only for host issues, e.g. a malicious guest shouldn't be
> > able to crash the host when running with panic_on_warn.
> > 
> 
> Ah, yeah, forgot about panic_on_warn. I can go back to the original patch
> or do a pr_warn_once(), any pref?

No strong preference.  If you think the print would be helpful for ongoing
development, then it's probably worth adding.

Re: [PATCH v2] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Tom Lendacky wrote:
> On 4/8/21 12:10 PM, Sean Christopherson wrote:
> > On Thu, Apr 08, 2021, Tom Lendacky wrote:
> >> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> >> index 83e00e524513..7ac67615c070 100644
> >> --- a/arch/x86/kvm/svm/sev.c
> >> +++ b/arch/x86/kvm/svm/sev.c
> >> @@ -2105,5 +2105,8 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu 
> >> *vcpu, u8 vector)
> >> * the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
> >> * non-zero value.
> >> */
> >> +  if (WARN_ON_ONCE(!svm->ghcb))
> > 
> > Isn't this guest triggerable?  I.e. send a SIPI without doing the reset 
> > hold?
> > If so, this should not WARN.
> 
> Yes, it is a guest triggerable event. But a guest shouldn't be doing that,
> so I thought adding the WARN_ON_ONCE() just to detect it wasn't bad.
> Definitely wouldn't want a WARN_ON().

WARNs are intended only for host issues, e.g. a malicious guest shouldn't be
able to crash the host when running with panic_on_warn.

Re: [PATCH v2] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Tom Lendacky wrote:
> From: Tom Lendacky 
> 
> Access to the GHCB is mainly in the VMGEXIT path and it is known that the
> GHCB will be mapped. But there are two paths where it is possible the GHCB
> might not be mapped.
> 
> The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
> the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
> However, if a SIPI is performed without a corresponding AP Reset Hold,
> then the GHCB might not be mapped (depending on the previous VMEXIT),
> which will result in a NULL pointer dereference.
> 
> The svm_complete_emulated_msr() routine will update the GHCB to inform
> the caller of a RDMSR/WRMSR operation about any errors. While it is likely
> that the GHCB will be mapped in this situation, add a safe guard
> in this path to be certain a NULL pointer dereference is not encountered.
> 
> Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts 
> under SEV-ES")
> Fixes: 647daca25d24 ("KVM: SVM: Add support for booting APs in an SEV-ES 
> guest")
> Signed-off-by: Tom Lendacky 
> 
> ---
> 
> Changes from v1:
> - Added the svm_complete_emulated_msr() path as suggested by Sean
>   Christopherson
> - Add a WARN_ON_ONCE() to the sev_vcpu_deliver_sipi_vector() path
> ---
>  arch/x86/kvm/svm/sev.c | 3 +++
>  arch/x86/kvm/svm/svm.c | 2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 83e00e524513..7ac67615c070 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2105,5 +2105,8 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu 
> *vcpu, u8 vector)
>* the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
>* non-zero value.
>*/
> + if (WARN_ON_ONCE(!svm->ghcb))

Isn't this guest triggerable?  I.e. send a SIPI without doing the reset hold?
If so, this should not WARN.

> + return;
> +
>   ghcb_set_sw_exit_info_2(svm->ghcb, 1);
>  }
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 271196400495..534e52ba6045 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -2759,7 +2759,7 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>  static int svm_complete_emulated_msr(struct kvm_vcpu *vcpu, int err)
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
> - if (!sev_es_guest(vcpu->kvm) || !err)
> + if (!err || !sev_es_guest(vcpu->kvm) || WARN_ON_ONCE(!svm->ghcb))
>   return kvm_complete_insn_gp(vcpu, err);
>  
>   ghcb_set_sw_exit_info_1(svm->ghcb, 1);
> -- 
> 2.31.0
>

Re: [PATCH] KVM: X86: Count success and invalid yields

2021-04-08 Thread Sean Christopherson

On Tue, Apr 06, 2021, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> To analyze some performance issues with lock contention and scheduling,
> it is nice to know when directed yield are successful or failing.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/x86.c  | 26 --
>  2 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 44f8930..157bcaa 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1126,6 +1126,8 @@ struct kvm_vcpu_stat {
>   u64 halt_poll_success_ns;
>   u64 halt_poll_fail_ns;
>   u64 nested_run;
> + u64 yield_directed;
> + u64 yield_directed_ignore;
>  };
>  
>  struct x86_instruction_info;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 16fb395..3b475cd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -246,6 +246,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
>   VCPU_STAT("halt_poll_success_ns", halt_poll_success_ns),
>   VCPU_STAT("halt_poll_fail_ns", halt_poll_fail_ns),
>   VCPU_STAT("nested_run", nested_run),
> + VCPU_STAT("yield_directed", yield_directed),

This is ambiguous, it's not clear without looking at the code if it's counting
attempts or actual yields.

> + VCPU_STAT("yield_directed_ignore", yield_directed_ignore),

"ignored" also feels a bit misleading, as that implies KVM deliberately ignored
a valid request, whereas many of the failure paths are due to invalid requests
or errors of some kind.

What about mirroring the halt poll stats, i.e. track "attempted" and 
"successful",
as opposed to "attempted" and "ignored/failed".And maybe switched directed
and yield?  I.e. directed_yield_attempted and directed_yield_successful.

Alternatively, would it make sense to do s/directed/pv, or is that not worth the
potential risk of being wrong if a non-paravirt use case comes along?

pv_yield_attempted
pv_yield_successful

>   VM_STAT("mmu_shadow_zapped", mmu_shadow_zapped),
>   VM_STAT("mmu_pte_write", mmu_pte_write),
>   VM_STAT("mmu_pde_zapped", mmu_pde_zapped),
> @@ -8211,21 +8213,33 @@ void kvm_apicv_init(struct kvm *kvm, bool enable)
>  }
>  EXPORT_SYMBOL_GPL(kvm_apicv_init);
>  
> -static void kvm_sched_yield(struct kvm *kvm, unsigned long dest_id)
> +static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
>  {
>   struct kvm_vcpu *target = NULL;
>   struct kvm_apic_map *map;
>  
> + vcpu->stat.yield_directed++;
> +
>   rcu_read_lock();
> - map = rcu_dereference(kvm->arch.apic_map);
> + map = rcu_dereference(vcpu->kvm->arch.apic_map);
>  
>   if (likely(map) && dest_id <= map->max_apic_id && 
> map->phys_map[dest_id])
>   target = map->phys_map[dest_id]->vcpu;
>  
>   rcu_read_unlock();
> + if (!target)
> + goto no_yield;
> +
> + if (!READ_ONCE(target->ready))

I vote to keep these checks together.  That'll also make the addition of the
"don't yield to self" check match the order of ready vs. self in 
kvm_vcpu_on_spin().

if (!target || !READ_ONCE(target->ready))

> + goto no_yield;
>  
> - if (target && READ_ONCE(target->ready))
> - kvm_vcpu_yield_to(target);
> + if (kvm_vcpu_yield_to(target) <= 0)
> + goto no_yield;
> + return;
> +
> +no_yield:
> + vcpu->stat.yield_directed_ignore++;
> + return;
>  }
>  
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> @@ -8272,7 +8286,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   break;
>  
>   kvm_pv_kick_cpu_op(vcpu->kvm, a0, a1);
> - kvm_sched_yield(vcpu->kvm, a1);
> + kvm_sched_yield(vcpu, a1);
>   ret = 0;
>   break;
>  #ifdef CONFIG_X86_64
> @@ -8290,7 +8304,7 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>   if (!guest_pv_has(vcpu, KVM_FEATURE_PV_SCHED_YIELD))
>   break;
>  
> - kvm_sched_yield(vcpu->kvm, a0);
> + kvm_sched_yield(vcpu, a0);
>   ret = 0;
>   break;
>   default:
> -- 
> 2.7.4
>

Re: [PATCH] KVM: X86: Do not yield to self

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Wanpeng Li wrote:
> From: Wanpeng Li 
> 
> If the target is self we do not need to yield, we can avoid malicious 
> guest to play this.
> 
> Signed-off-by: Wanpeng Li 
> ---
> Rebased on 
> https://lore.kernel.org/kvm/1617697935-4158-1-git-send-email-wanpen...@tencent.com/
> 
>  arch/x86/kvm/x86.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 43c9f9b..260650f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8230,6 +8230,10 @@ static void kvm_sched_yield(struct kvm_vcpu *vcpu, 
> unsigned long dest_id)
>   if (!target)
>   goto no_yield;
>  
> + /* yield to self */

If you're going to bother with a comment, maybe elaborate a bit, e.g.

/* Ignore requests to yield to self. */

> + if (vcpu->vcpu_id == target->vcpu_id)
> + goto no_yield;
> +
>   if (!READ_ONCE(target->ready))
>   goto no_yield;
>  
> -- 
> 2.7.4
>

Re: [PATCH] KVM: x86: Remove unused function declaration

2021-04-08 Thread Sean Christopherson

On Tue, Apr 06, 2021, Keqian Zhu wrote:
> kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
> just remove it.
> 
> Signed-off-by: Keqian Zhu 

Reviewed-by: Sean Christopherson

Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Paolo Bonzini wrote:
> On 08/04/21 17:48, Sean Christopherson wrote:
> > Freaking PDPTRs.  I was really hoping we could keep the lock and 
> > pages_available()
> > logic outside of the helpers.  What if kvm_mmu_load() reads the PDPTRs and
> > passes them into mmu_alloc_shadow_roots()?  Or is that too ugly?
> 
> The patch I have posted (though untested) tries to do that in a slightly
> less ugly way by pushing make_mmu_pages_available down to mmu_alloc_*_roots.

Yeah, I agree it's less ugly.  It would be nice to not duplicate that code, but
it's probably not worth the ugliness.  :-/

For your approach, can we put the out label after the success path?  Setting
mmu->root_pgd isn't wrong per se, but doing so might mislead future readers into
thinking that it's functionally necessary. 


diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efb41f31e80a..93f97d0a9e2e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3244,6 +3244,13 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
u8 shadow_root_level = mmu->shadow_root_level;
hpa_t root;
unsigned i;
+   int r;
+
+   write_lock(>kvm->mmu_lock);
+
+   r = make_mmu_pages_available(vcpu);
+   if (r)
+   goto out_unlock;

if (is_tdp_mmu_enabled(vcpu->kvm)) {
root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
@@ -3252,8 +3259,10 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
mmu->root_hpa = root;
} else if (shadow_root_level == PT32E_ROOT_LEVEL) {
-   if (WARN_ON_ONCE(!mmu->pae_root))
-   return -EIO;
+   if (WARN_ON_ONCE(!mmu->pae_root)) {
+   r = -EIO;
+   goto out_unlock;
+   }

for (i = 0; i < 4; ++i) {
WARN_ON_ONCE(IS_VALID_PAE_ROOT(mmu->pae_root[i]));
@@ -3266,13 +3275,15 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
mmu->root_hpa = __pa(mmu->pae_root);
} else {
WARN_ONCE(1, "Bad TDP root level = %d\n", shadow_root_level);
-   return -EIO;
+   r = -EIO;
+   goto out_unlock;
}

/* root_pgd is ignored for direct MMUs. */
mmu->root_pgd = 0;
-
-   return 0;
+out_unlock:
+   write_unlock(>kvm->mmu_lock);
+   return r;
 }

 static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
@@ -3281,7 +3292,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
u64 pdptrs[4], pm_mask;
gfn_t root_gfn, root_pgd;
hpa_t root;
-   int i;
+   int i, r;

root_pgd = mmu->get_guest_pgd(vcpu);
root_gfn = root_pgd >> PAGE_SHIFT;
@@ -3289,6 +3300,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
if (mmu_check_root(vcpu, root_gfn))
return 1;

+   /*
+* On SVM, reading PDPTRs might access guest memory, which might fault
+* and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
+*/
if (mmu->root_level == PT32E_ROOT_LEVEL) {
for (i = 0; i < 4; ++i) {
pdptrs[i] = mmu->get_pdptr(vcpu, i);
@@ -3300,6 +3315,12 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
}
}

+   write_lock(>kvm->mmu_lock);
+
+   r = make_mmu_pages_available(vcpu);
+   if (r)
+   goto out_unlock;
+
/*
 * Do we shadow a long mode page table? If so we need to
 * write-protect the guests page table root.
@@ -3311,8 +3332,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
goto set_root_pgd;
}

-   if (WARN_ON_ONCE(!mmu->pae_root))
-   return -EIO;
+   if (WARN_ON_ONCE(!mmu->pae_root)) {
+   r = -EIO;
+   goto out_unlock;
+   }

/*
 * We shadow a 32 bit page table. This may be a legacy 2-level
@@ -3323,8 +3346,10 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
if (mmu->shadow_root_level == PT64_ROOT_4LEVEL) {
pm_mask |= PT_ACCESSED_MASK | PT_WRITABLE_MASK | PT_USER_MASK;

-   if (WARN_ON_ONCE(!mmu->lm_root))
-   return -EIO;
+   if (WARN_ON_ONCE(!mmu->lm_root)) {
+   r = -EIO;
+   goto out_unlock;
+   }

mmu->lm_root[0] = __pa(mmu->pae_root) | pm_mask;
}
@@ -3352,8 +3377,9 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)

 set_root_pgd:
mmu->root_pgd = root_pgd;
-
-   return 0;
+out_unlock:
+   write_unlock(>kvm->mmu_lock);
+   r

Re: [PATCH] KVM: vmx: add mismatched size in vmcs_check32

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, lihaiwei.ker...@gmail.com wrote:
> From: Haiwei Li 
> 
> vmcs_check32 misses the check for 64-bit and 64-bit high.

Can you clarify in the changelog that, while it is architecturally legal to
access 64-bit and 64-bit high fields with a 32-bit read/write in 32-bit mode,
KVM should never do partial accesses to VMCS fields.  And/or note that the
32-bit accesses are done in vmcs_{read,write}64() when necessary?  Hmm, maybe:

  Add compile-time assertions in vmcs_check32() to disallow accesses to
  64-bit and 64-bit high fields via vmcs_{read,write}32().  Upper level
  KVM code should never do partial accesses to VMCS fields.  KVM handles
  the split accesses automatically in vmcs_{read,write}64() when running
  as a 32-bit kernel.

With something along those lines:

Reviewed-and-tested-by: Sean Christopherson  

> Signed-off-by: Haiwei Li 
> ---
>  arch/x86/kvm/vmx/vmx_ops.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx_ops.h b/arch/x86/kvm/vmx/vmx_ops.h
> index 692b0c3..164b64f 100644
> --- a/arch/x86/kvm/vmx/vmx_ops.h
> +++ b/arch/x86/kvm/vmx/vmx_ops.h
> @@ -37,6 +37,10 @@ static __always_inline void vmcs_check32(unsigned long 
> field)
>  {
>   BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 0,
>"32-bit accessor invalid for 16-bit field");
> + BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 
> 0x2000,
> +  "32-bit accessor invalid for 64-bit field");
> + BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6001) == 
> 0x2001,
> +  "32-bit accessor invalid for 64-bit high field");
>   BUILD_BUG_ON_MSG(__builtin_constant_p(field) && ((field) & 0x6000) == 
> 0x6000,
>"32-bit accessor invalid for natural width field");
>  }
> -- 
> 1.8.3.1
>

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Keqian Zhu wrote:
> Hi Ben,
> 
> Do you have any similar idea that can share with us?

Doh, Ben is out this week, he'll be back Monday.  Sorry for gumming up the 
works :-/

Re: [PATCH v2 07/17] KVM: x86/mmu: Check PDPTRs before allocating PAE roots

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Paolo Bonzini wrote:
> On 08/04/21 13:15, Wanpeng Li wrote:
> > I saw this splatting:
> > 
> >   BUG: sleeping function called from invalid context at
> > arch/x86/kvm/kvm_cache_regs.h:115
> >kvm_pdptr_read+0x20/0x60 [kvm]
> >kvm_mmu_load+0x3bd/0x540 [kvm]
> > 
> > There is a might_sleep() in kvm_pdptr_read(), however, the original
> > commit didn't explain more. I can send a formal one if the below fix
> > is acceptable.

We don't want to drop mmu_lock, even temporarily.  The reason for holding it
across the entire sequence is to ensure kvm_mmu_available_pages() isn't 
violated.

> I think we can just push make_mmu_pages_available down into
> kvm_mmu_load's callees.  This way it's not necessary to hold the lock
> until after the PDPTR check:

...

> @@ -4852,14 +4868,10 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
>   r = mmu_alloc_special_roots(vcpu);
>   if (r)
>   goto out;
> - write_lock(>kvm->mmu_lock);
> - if (make_mmu_pages_available(vcpu))
> - r = -ENOSPC;
> - else if (vcpu->arch.mmu->direct_map)
> + if (vcpu->arch.mmu->direct_map)
>   r = mmu_alloc_direct_roots(vcpu);
>   else
>   r = mmu_alloc_shadow_roots(vcpu);
> - write_unlock(>kvm->mmu_lock);
>   if (r)
>   goto out;

Freaking PDPTRs.  I was really hoping we could keep the lock and 
pages_available()
logic outside of the helpers.  What if kvm_mmu_load() reads the PDPTRs and
passes them into mmu_alloc_shadow_roots()?  Or is that too ugly?

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index efb41f31e80a..e3c4938cd665 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3275,11 +3275,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
return 0;
 }

-static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
+static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu, u64 pdptrs[4])
 {
struct kvm_mmu *mmu = vcpu->arch.mmu;
-   u64 pdptrs[4], pm_mask;
gfn_t root_gfn, root_pgd;
+   u64 pm_mask;
hpa_t root;
int i;

@@ -3291,11 +3291,8 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)

if (mmu->root_level == PT32E_ROOT_LEVEL) {
for (i = 0; i < 4; ++i) {
-   pdptrs[i] = mmu->get_pdptr(vcpu, i);
-   if (!(pdptrs[i] & PT_PRESENT_MASK))
-   continue;
-
-   if (mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
+   if ((pdptrs[i] & PT_PRESENT_MASK) &&
+   mmu_check_root(vcpu, pdptrs[i] >> PAGE_SHIFT))
return 1;
}
}
@@ -4844,21 +4841,33 @@ EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);

 int kvm_mmu_load(struct kvm_vcpu *vcpu)
 {
-   int r;
+   struct kvm_mmu *mmu = vcpu->arch.mmu;
+   u64 pdptrs[4];
+   int r, i;

-   r = mmu_topup_memory_caches(vcpu, !vcpu->arch.mmu->direct_map);
+   r = mmu_topup_memory_caches(vcpu, !mmu->direct_map);
if (r)
goto out;
r = mmu_alloc_special_roots(vcpu);
if (r)
goto out;
+
+   /*
+* On SVM, reading PDPTRs might access guest memory, which might fault
+* and thus might sleep.  Grab the PDPTRs before acquiring mmu_lock.
+*/
+   if (!mmu->direct_map && mmu->root_level == PT32E_ROOT_LEVEL) {
+   for (i = 0; i < 4; ++i)
+   pdptrs[i] = mmu->get_pdptr(vcpu, i);
+   }
+
write_lock(>kvm->mmu_lock);
if (make_mmu_pages_available(vcpu))
r = -ENOSPC;
else if (vcpu->arch.mmu->direct_map)
r = mmu_alloc_direct_roots(vcpu);
else
-   r = mmu_alloc_shadow_roots(vcpu);
+   r = mmu_alloc_shadow_roots(vcpu, pdptrs);
write_unlock(>kvm->mmu_lock);
if (r)
goto out;

Re: [PATCH 1/7] hyperv: Detect Nested virtualization support for SVM

2021-04-08 Thread Sean Christopherson

On Thu, Apr 08, 2021, Vineeth Pillai wrote:
> Hi Vitaly,
> 
> On 4/8/21 7:06 AM, Vitaly Kuznetsov wrote:
> > -   if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
> > +   /*
> > +* AMD does not need enlightened VMCS as VMCB is already a
> > +* datastructure in memory.
> > Well, VMCS is also a structure in memory, isn't it? It's just that we
> > don't have a 'clean field' concept for it and we can't use normal memory
> > accesses.

Technically, you can use normal memory accesses, so long as software guarantees
the VMCS isn't resident in the VMCS cache and knows the field offsets for the
underlying CPU.  The lack of an architecturally defined layout is the biggest
issue, e.g. tacking on dirty bits through a PV ABI would be trivial.

> Yes, you are right. I was referring to the fact that we cant use normal
> memory accesses, but is a bit mis-worded.

If you slot in "architectural" it will read nicely, i.e. "VMCB is already an
architectural datastructure in memory".

Re: [PATCH 1/7] hyperv: Detect Nested virtualization support for SVM

2021-04-07 Thread Sean Christopherson

On Wed, Apr 07, 2021, Michael Kelley wrote:
> From: Vineeth Pillai  Sent: Wednesday, April 7, 
> 2021 7:41 AM
> > 
> > Detect nested features exposed by Hyper-V if SVM is enabled.
> > 
> > Signed-off-by: Vineeth Pillai 
> > ---
> >  arch/x86/kernel/cpu/mshyperv.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> > index 3546d3e21787..4d364acfe95d 100644
> > --- a/arch/x86/kernel/cpu/mshyperv.c
> > +++ b/arch/x86/kernel/cpu/mshyperv.c
> > @@ -325,9 +325,17 @@ static void __init ms_hyperv_init_platform(void)
> > ms_hyperv.isolation_config_a, 
> > ms_hyperv.isolation_config_b);
> > }
> > 
> > -   if (ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
> > +   /*
> > +* AMD does not need enlightened VMCS as VMCB is already a
> > +* datastructure in memory. We need to get the nested
> > +* features if SVM is enabled.
> > +*/
> > +   if (boot_cpu_has(X86_FEATURE_SVM) ||
> > +   ms_hyperv.hints & HV_X64_ENLIGHTENED_VMCS_RECOMMENDED) {
> > ms_hyperv.nested_features =
> > cpuid_eax(HYPERV_CPUID_NESTED_FEATURES);
> > +   pr_info("Hyper-V nested_features: 0x%x\n",
> 
> Nit:  Most other similar lines put the colon in a different place:
> 
>   pr_info("Hyper-V: nested features 0x%x\n",
> 
> One of these days, I'm going to fix the ones that don't follow this
> pattern. :-)

Any reason not to use pr_fmt?

Re: [PATCH] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-07 Thread Sean Christopherson

On Wed, Apr 07, 2021, Tom Lendacky wrote:
> On 4/7/21 3:08 PM, Sean Christopherson wrote:
> > On Wed, Apr 07, 2021, Tom Lendacky wrote:
> >> From: Tom Lendacky 
> >>
> >> The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
> >> the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
> >> However, if a SIPI is performed without a corresponding AP Reset Hold,
> >> then the GHCB may not be mapped, which will result in a NULL pointer
> >> dereference.
> >>
> >> Check that the GHCB is mapped before attempting the update.
> > 
> > It's tempting to say the ghcb_set_*() helpers should guard against this, but
> > that would add a lot of pollution and the vast majority of uses are very 
> > clearly
> > in the vmgexit path.  svm_complete_emulated_msr() is the only other case 
> > that
> > is non-obvious; would it make sense to sanity check svm->ghcb there as well?
> 
> Hmm... I'm not sure if we can get here without having taken the VMGEXIT
> path to start, but it certainly couldn't hurt to add it.

Yeah, AFAICT it should be impossible to reach the callback without a valid ghcb,
it'd be purely be a sanity check.
 
> I can submit a v2 with that unless you want to submit it (with one small
> change below).

I'd say just throw it into v2.

> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 019ac836dcd0..abe9c765628f 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -2728,7 +2728,8 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
> > msr_data *msr_info)
> >  static int svm_complete_emulated_msr(struct kvm_vcpu *vcpu, int err)
> >  {
> > struct vcpu_svm *svm = to_svm(vcpu);
> > -   if (!sev_es_guest(vcpu->kvm) || !err)
> > +
> > +   if (!err || !sev_es_guest(vcpu->kvm) || !WARN_ON_ONCE(svm->ghcb))
> 
> This should be WARN_ON_ONCE(!svm->ghcb), otherwise you'll get the right
> result, but get a stack trace immediately.

Doh, yep.

Re: [PATCH] KVM: SVM: Make sure GHCB is mapped before updating

2021-04-07 Thread Sean Christopherson

On Wed, Apr 07, 2021, Tom Lendacky wrote:
> From: Tom Lendacky 
> 
> The sev_vcpu_deliver_sipi_vector() routine will update the GHCB to inform
> the caller of the AP Reset Hold NAE event that a SIPI has been delivered.
> However, if a SIPI is performed without a corresponding AP Reset Hold,
> then the GHCB may not be mapped, which will result in a NULL pointer
> dereference.
> 
> Check that the GHCB is mapped before attempting the update.

It's tempting to say the ghcb_set_*() helpers should guard against this, but
that would add a lot of pollution and the vast majority of uses are very clearly
in the vmgexit path.  svm_complete_emulated_msr() is the only other case that
is non-obvious; would it make sense to sanity check svm->ghcb there as well?

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 019ac836dcd0..abe9c765628f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2728,7 +2728,8 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
 static int svm_complete_emulated_msr(struct kvm_vcpu *vcpu, int err)
 {
struct vcpu_svm *svm = to_svm(vcpu);
-   if (!sev_es_guest(vcpu->kvm) || !err)
+
+   if (!err || !sev_es_guest(vcpu->kvm) || !WARN_ON_ONCE(svm->ghcb))
return kvm_complete_insn_gp(vcpu, err);

ghcb_set_sw_exit_info_1(svm->ghcb, 1);

> Fixes: 647daca25d24 ("KVM: SVM: Add support for booting APs in an SEV-ES 
> guest")
> Signed-off-by: Tom Lendacky 

Either way:

Reviewed-by: Sean Christopherson  

> ---
>  arch/x86/kvm/svm/sev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 83e00e524513..13758e3b106d 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -2105,5 +2105,6 @@ void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu 
> *vcpu, u8 vector)
>* the guest will set the CS and RIP. Set SW_EXIT_INFO_2 to a
>* non-zero value.
>*/
> - ghcb_set_sw_exit_info_2(svm->ghcb, 1);
> + if (svm->ghcb)
> + ghcb_set_sw_exit_info_2(svm->ghcb, 1);
>  }
> -- 
> 2.31.0
>

Re: [PATCH 1/1] x86/kvm/svm: Implement support for PSFD

2021-04-07 Thread Sean Christopherson

On Wed, Apr 07, 2021, Ramakrishna Saripalli wrote:
> From: Ramakrishna Saripalli 
> 
> Expose Predictive Store Forwarding capability to guests.

Technically KVM is advertising the capability to userspace, e.g. userspace can
expose the feature to the guest without this patch.

> Guests enable or disable PSF via SPEC_CTRL MSR.

At a (very) quick glance, this requires extra enabling in 
guest_has_spec_ctrl_msr(),
otherwise a vCPU with PSF but not the existing features will not be able to set
MSR_IA32_SPEC_CTRL.PSFD.

That raises a question: should KVM do extra checks for PSFD on top of the "throw
noodles at the wall and see what sticks" approach of kvm_spec_ctrl_test_value()?
The noodle approach is there to handle the mess of cross-vendor features/bits,
but that doesn't seem to apply to PSFD.

> Signed-off-by: Ramakrishna Saripalli 
> ---
>  arch/x86/kvm/cpuid.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 6bd2f8b830e4..9c4af0fef6d7 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -448,6 +448,8 @@ void kvm_set_cpu_caps(void)
>   kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
>   if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
>   kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> + if (boot_cpu_has(X86_FEATURE_AMD_PSFD))
> + kvm_cpu_cap_set(X86_FEATURE_AMD_PSFD);

This is unnecessary, it's handled by the F(AMD_PSFD).  The above features have
special handling to enumerate their Intel equivalent.

>   kvm_cpu_cap_mask(CPUID_7_1_EAX,
>   F(AVX_VNNI) | F(AVX512_BF16)
> @@ -482,7 +484,7 @@ void kvm_set_cpu_caps(void)
>   kvm_cpu_cap_mask(CPUID_8000_0008_EBX,
>   F(CLZERO) | F(XSAVEERPTR) |
>   F(WBNOINVD) | F(AMD_IBPB) | F(AMD_IBRS) | F(AMD_SSBD) | 
> F(VIRT_SSBD) |
> - F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON)
> + F(AMD_SSB_NO) | F(AMD_STIBP) | F(AMD_STIBP_ALWAYS_ON) | 
> F(AMD_PSFD)
>   );
>  
>   /*
> -- 
> 2.25.1
>

Re: [PATCH v2 8/8] KVM: SVM: Allocate SEV command structures on local stack

2021-04-07 Thread Sean Christopherson

On Wed, Apr 07, 2021, Borislav Petkov wrote:
> First of all, I'd strongly suggest you trim your emails when you reply -
> that would be much appreciated.
> 
> On Wed, Apr 07, 2021 at 07:24:54AM +0200, Christophe Leroy wrote:
> > > @@ -258,7 +240,7 @@ static int sev_issue_cmd(struct kvm *kvm, int id, 
> > > void *data, int *error)
> > >   static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
> > >   {
> > >   struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
> > > - struct sev_data_launch_start *start;
> > > + struct sev_data_launch_start start;
> > 
> > struct sev_data_launch_start start = {0, 0, 0, 0, 0, 0, 0};
> 
> I don't know how this is any better than using memset...
> 
> Also, you can do
> 
>   ... start = { };
> 
> which is certainly the only other alternative to memset, AFAIK.
> 
> But whatever you do, you need to look at the resulting asm the compiler
> generates. So let's do that:

I'm ok with Boris' version, I'm not a fan of having to count zeros.  I used
memset() to defer initialization until after the various sanity checks, and
out of habit.

[tip: x86/sgx] x86/sgx: Move provisioning device creation out of SGX driver

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: b3754e5d3da320af2bebb7a690002685c7f5c15c
Gitweb:
https://git.kernel.org/tip/b3754e5d3da320af2bebb7a690002685c7f5c15c
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:09 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 19:18:46 +02:00

x86/sgx: Move provisioning device creation out of SGX driver

And extract sgx_set_attribute() out of sgx_ioc_enclave_provision() and
export it as symbol for KVM to use.

The provisioning key is sensitive. The SGX driver only allows to create
an enclave which can access the provisioning key when the enclave
creator has permission to open /dev/sgx_provision. It should apply to
a VM as well, as the provisioning key is platform-specific, thus an
unrestricted VM can also potentially compromise the provisioning key.

Move the provisioning device creation out of sgx_drv_init() to
sgx_init() as a preparation for adding SGX virtualization support,
so that even if the SGX driver is not enabled due to flexible launch
control not being available, SGX virtualization can still be enabled,
and use it to restrict a VM's capability of being able to access the
provisioning key.

 [ bp: Massage commit message. ]

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Reviewed-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/0f4d044d621561f26d5f4ef73e8dc6cd18cc7e79.1616136308.git.kai.hu...@intel.com
---
 arch/x86/include/asm/sgx.h   |  3 ++-
 arch/x86/kernel/cpu/sgx/driver.c | 17 +-
 arch/x86/kernel/cpu/sgx/ioctl.c  | 16 +
 arch/x86/kernel/cpu/sgx/main.c   | 57 ++-
 4 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 954042e..a16e2c9 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -372,4 +372,7 @@ int sgx_virt_einit(void __user *sigstruct, void __user 
*token,
   void __user *secs, u64 *lepubkeyhash, int *trapnr);
 #endif
 
+int sgx_set_attribute(unsigned long *allowed_attributes,
+ unsigned int attribute_fd);
+
 #endif /* _ASM_X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c
index 8ce6d83..aa9b8b8 100644
--- a/arch/x86/kernel/cpu/sgx/driver.c
+++ b/arch/x86/kernel/cpu/sgx/driver.c
@@ -136,10 +136,6 @@ static const struct file_operations sgx_encl_fops = {
.get_unmapped_area  = sgx_get_unmapped_area,
 };
 
-const struct file_operations sgx_provision_fops = {
-   .owner  = THIS_MODULE,
-};
-
 static struct miscdevice sgx_dev_enclave = {
.minor = MISC_DYNAMIC_MINOR,
.name = "sgx_enclave",
@@ -147,13 +143,6 @@ static struct miscdevice sgx_dev_enclave = {
.fops = _encl_fops,
 };
 
-static struct miscdevice sgx_dev_provision = {
-   .minor = MISC_DYNAMIC_MINOR,
-   .name = "sgx_provision",
-   .nodename = "sgx_provision",
-   .fops = _provision_fops,
-};
-
 int __init sgx_drv_init(void)
 {
unsigned int eax, ebx, ecx, edx;
@@ -187,11 +176,5 @@ int __init sgx_drv_init(void)
if (ret)
return ret;
 
-   ret = misc_register(_dev_provision);
-   if (ret) {
-   misc_deregister(_dev_enclave);
-   return ret;
-   }
-
return 0;
 }
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 7be9c06..83df20e 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -2,6 +2,7 @@
 /*  Copyright(c) 2016-20 Intel Corporation. */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -666,24 +667,11 @@ out:
 static long sgx_ioc_enclave_provision(struct sgx_encl *encl, void __user *arg)
 {
struct sgx_enclave_provision params;
-   struct file *file;
 
if (copy_from_user(, arg, sizeof(params)))
return -EFAULT;
 
-   file = fget(params.fd);
-   if (!file)
-   return -EINVAL;
-
-   if (file->f_op != _provision_fops) {
-   fput(file);
-   return -EINVAL;
-   }
-
-   encl->attributes_mask |= SGX_ATTR_PROVISIONKEY;
-
-   fput(file);
-   return 0;
+   return sgx_set_attribute(>attributes_mask, params.fd);
 }
 
 long sgx_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 227f1e2..92cb11d 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -1,14 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
 /*  Copyright(c) 2016-20 Intel Corporation. */
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include "driver.h"
 #include "encl.h"
 #include &quo

[tip: x86/sgx] x86/cpufeatures: Add SGX1 and SGX2 sub-features

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: b8921dccf3b25798409d35155b5d127085de72c2
Gitweb:
https://git.kernel.org/tip/b8921dccf3b25798409d35155b5d127085de72c2
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:22:18 +13:00
Committer: Borislav Petkov 
CommitterDate: Thu, 25 Mar 2021 17:33:11 +01:00

x86/cpufeatures: Add SGX1 and SGX2 sub-features

Add SGX1 and SGX2 feature flags, via CPUID.0x12.0x0.EAX, as scattered
features, since adding a new leaf for only two bits would be wasteful.
As part of virtualizing SGX, KVM will expose the SGX CPUID leafs to its
guest, and to do so correctly needs to query hardware and kernel support
for SGX1 and SGX2.

Suppress both SGX1 and SGX2 from /proc/cpuinfo. SGX1 basically means
SGX, and for SGX2 there is no concrete use case of using it in
/proc/cpuinfo.

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Dave Hansen 
Acked-by: Jarkko Sakkinen 
Link: 
https://lkml.kernel.org/r/d787827dbfca6b3210ac3e432e3ac1202727e786.1616136308.git.kai.hu...@intel.com
---
 arch/x86/include/asm/cpufeatures.h | 2 ++
 arch/x86/kernel/cpu/cpuid-deps.c   | 2 ++
 arch/x86/kernel/cpu/scattered.c| 2 ++
 3 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index cc96e26..1f918f5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -290,6 +290,8 @@
 #define X86_FEATURE_FENCE_SWAPGS_KERNEL(11*32+ 5) /* "" LFENCE in 
kernel entry SWAPGS path */
 #define X86_FEATURE_SPLIT_LOCK_DETECT  (11*32+ 6) /* #AC for split lock */
 #define X86_FEATURE_PER_THREAD_MBA (11*32+ 7) /* "" Per-thread Memory 
Bandwidth Allocation */
+#define X86_FEATURE_SGX1   (11*32+ 8) /* "" Basic SGX */
+#define X86_FEATURE_SGX2   (11*32+ 9) /* "" SGX Enclave Dynamic 
Memory Management (EDMM) */
 
 /* Intel-defined CPU features, CPUID level 0x0007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI   (12*32+ 4) /* AVX VNNI instructions */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index d40f8e0..defda61 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -73,6 +73,8 @@ static const struct cpuid_dep cpuid_deps[] = {
{ X86_FEATURE_ENQCMD,   X86_FEATURE_XSAVES},
{ X86_FEATURE_PER_THREAD_MBA,   X86_FEATURE_MBA   },
{ X86_FEATURE_SGX_LC,   X86_FEATURE_SGX   },
+   { X86_FEATURE_SGX1, X86_FEATURE_SGX   },
+   { X86_FEATURE_SGX2, X86_FEATURE_SGX1  },
{}
 };
 
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 972ec3b..21d1f06 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -36,6 +36,8 @@ static const struct cpuid_bit cpuid_bits[] = {
{ X86_FEATURE_CDP_L2,   CPUID_ECX,  2, 0x0010, 2 },
{ X86_FEATURE_MBA,  CPUID_EBX,  3, 0x0010, 0 },
{ X86_FEATURE_PER_THREAD_MBA,   CPUID_ECX,  0, 0x0010, 3 },
+   { X86_FEATURE_SGX1, CPUID_EAX,  0, 0x0012, 0 },
+   { X86_FEATURE_SGX2, CPUID_EAX,  1, 0x0012, 0 },
{ X86_FEATURE_HW_PSTATE,CPUID_EDX,  7, 0x8007, 0 },
{ X86_FEATURE_CPB,  CPUID_EDX,  9, 0x8007, 0 },
{ X86_FEATURE_PROC_FEEDBACK,CPUID_EDX, 11, 0x8007, 0 },

[tip: x86/sgx] x86/sgx: Introduce virtual EPC for use by KVM guests

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 540745ddbc70eabdc7dbd3fcc00fe4fb17cd59ba
Gitweb:
https://git.kernel.org/tip/540745ddbc70eabdc7dbd3fcc00fe4fb17cd59ba
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:22:21 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:17 +02:00

x86/sgx: Introduce virtual EPC for use by KVM guests

Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
Enclave Page Cache (EPC) without an associated enclave. The intended
and only known use case for raw EPC allocation is to expose EPC to a
KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
Kconfig.

The SGX driver uses the misc device /dev/sgx_enclave to support
userspace in creating an enclave. Each file descriptor returned from
opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
KVM doesn't control how the guest uses the EPC, therefore EPC allocated
to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
is not suitable for allocating EPC for a KVM guest.

Having separate device nodes for the SGX driver and KVM virtual EPC also
allows separate permission control for running host SGX enclaves and KVM
SGX guests.

To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
intended size to get an address range of virtual EPC. Then it may use
the address range to create one KVM memory slot as virtual EPC for
a guest.

Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
/dev/sgx_vepc rather than in KVM. Doing so has two major advantages:

  - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
just another memory backend for guests.

  - EPC management is wholly contained in the SGX subsystem, e.g. SGX
does not have to export any symbols, changes to reclaim flows don't
need to be routed through KVM, SGX's dirty laundry doesn't have to
get aired out for the world to see, and so on and so forth.

The virtual EPC pages allocated to guests are currently not reclaimable.
Reclaiming an EPC page used by enclave requires a special reclaim
mechanism separate from normal page reclaim, and that mechanism is not
supported for virutal EPC pages. Due to the complications of handling
reclaim conflicts between guest and host, reclaiming virtual EPC pages
is significantly more complex than basic support for SGX virtualization.

 [ bp:
   - Massage commit message and comments
   - use cpu_feature_enabled()
   - vertically align struct members init
   - massage Virtual EPC clarification text
   - move Kconfig prompt to Virtualization ]

Signed-off-by: Sean Christopherson 
Co-developed-by: Kai Huang 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Dave Hansen 
Acked-by: Jarkko Sakkinen 
Link: 
https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.hu...@intel.com
---
 Documentation/x86/sgx.rst|  16 ++-
 arch/x86/kernel/cpu/sgx/Makefile |   1 +-
 arch/x86/kernel/cpu/sgx/sgx.h|   9 +-
 arch/x86/kernel/cpu/sgx/virt.c   | 259 ++-
 arch/x86/kvm/Kconfig |  12 +-
 5 files changed, 297 insertions(+)
 create mode 100644 arch/x86/kernel/cpu/sgx/virt.c

diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
index f90076e..dd0ac96 100644
--- a/Documentation/x86/sgx.rst
+++ b/Documentation/x86/sgx.rst
@@ -234,3 +234,19 @@ As a result, when this happpens, user should stop running 
any new
 SGX workloads, (or just any new workloads), and migrate all valuable
 workloads. Although a machine reboot can recover all EPC memory, the bug
 should be reported to Linux developers.
+
+
+Virtual EPC
+===
+
+The implementation has also a virtual EPC driver to support SGX enclaves
+in guests. Unlike the SGX driver, an EPC page allocated by the virtual
+EPC driver doesn't have a specific enclave associated with it. This is
+because KVM doesn't track how a guest uses EPC pages.
+
+As a result, the SGX core page reclaimer doesn't support reclaiming EPC
+pages allocated to KVM guests through the virtual EPC driver. If the
+user wants to deploy SGX applications both on the host and in guests
+on the same machine, the user should reserve enough EPC (by taking out
+total virtual EPC size of all SGX VMs from the physical EPC size) for
+host SGX applications so they can run with acceptable performance.
diff --git a/arch/x86/kernel/cpu/sgx/Makefile b/arch/x86/kernel/cpu/sgx/Makefile
index 91d3dc7..9c16567 100644
--- a/arch/x86/kernel/cpu/sgx/Makefile
+++ b/arch/x86/kernel/cpu/sgx/Makefile
@@ -3,3 +3,4 @@ obj-y += \
encl.o \
ioctl.o \
main.o
+obj-$(CONFIG_X86_SGX_KVM)  += virt.o
diff --git a/arch/x86/kernel/cpu/sgx/sgx.h b/arch/x86/kernel/cpu/sgx/sgx.h
index 4aa40c6..4854f39 100644
--- a/arch/x86/kernel/cpu/sgx/sgx.h
+++ b/arch/x86/

[tip: x86/sgx] x86/sgx: Add SGX_CHILD_PRESENT hardware error code

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 231d3dbdda192e3b3c7b79f4c3b0616f6c7f31b7
Gitweb:
https://git.kernel.org/tip/231d3dbdda192e3b3c7b79f4c3b0616f6c7f31b7
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:22:20 +13:00
Committer: Borislav Petkov 
CommitterDate: Fri, 26 Mar 2021 22:51:36 +01:00

x86/sgx: Add SGX_CHILD_PRESENT hardware error code

SGX driver can accurately track how enclave pages are used.  This
enables SECS to be specifically targeted and EREMOVE'd only after all
child pages have been EREMOVE'd.  This ensures that SGX driver will
never encounter SGX_CHILD_PRESENT in normal operation.

Virtual EPC is different.  The host does not track how EPC pages are
used by the guest, so it cannot guarantee EREMOVE success.  It might,
for instance, encounter a SECS with a non-zero child count.

Add a definition of SGX_CHILD_PRESENT.  It will be used exclusively by
the SGX virtualization driver to handle recoverable EREMOVE errors when
saniziting EPC pages after they are freed.

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Dave Hansen 
Acked-by: Jarkko Sakkinen 
Link: 
https://lkml.kernel.org/r/050b198e882afde7e6eba8e6a0d4da39161dbb5a.1616136308.git.kai.hu...@intel.com
---
 arch/x86/kernel/cpu/sgx/arch.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/arch.h b/arch/x86/kernel/cpu/sgx/arch.h
index dd7602c..abf99bb 100644
--- a/arch/x86/kernel/cpu/sgx/arch.h
+++ b/arch/x86/kernel/cpu/sgx/arch.h
@@ -26,12 +26,14 @@
  * enum sgx_return_code - The return code type for ENCLS, ENCLU and ENCLV
  * %SGX_NOT_TRACKED:   Previous ETRACK's shootdown sequence has not
  * been completed yet.
+ * %SGX_CHILD_PRESENT  SECS has child pages present in the EPC.
  * %SGX_INVALID_EINITTOKEN:EINITTOKEN is invalid and enclave signer's
  * public key does not match IA32_SGXLEPUBKEYHASH.
  * %SGX_UNMASKED_EVENT:An unmasked event, e.g. INTR, was 
received
  */
 enum sgx_return_code {
SGX_NOT_TRACKED = 11,
+   SGX_CHILD_PRESENT   = 13,
SGX_INVALID_EINITTOKEN  = 16,
SGX_UNMASKED_EVENT  = 128,
 };

[tip: x86/sgx] x86/cpu/intel: Allow SGX virtualization without Launch Control support

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 332bfc7becf479de8a55864cc5ed0024baea28aa
Gitweb:
https://git.kernel.org/tip/332bfc7becf479de8a55864cc5ed0024baea28aa
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:22:58 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:41 +02:00

x86/cpu/intel: Allow SGX virtualization without Launch Control support

The kernel will currently disable all SGX support if the hardware does
not support launch control.  Make it more permissive to allow SGX
virtualization on systems without Launch Control support.  This will
allow KVM to expose SGX to guests that have less-strict requirements on
the availability of flexible launch control.

Improve error message to distinguish between three cases.  There are two
cases where SGX support is completely disabled:
1) SGX has been disabled completely by the BIOS
2) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is unavailable (because of
   Kconfig).
One where it is partially available:
3) SGX LC is locked by the BIOS.  Bare-metal support is disabled because
   of LC unavailability.  SGX virtualization is supported.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kai Huang 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/b3329777076509b3b601550da288c8f3c406a865.1616136308.git.kai.hu...@intel.com
---
 arch/x86/kernel/cpu/feat_ctl.c | 59 -
 1 file changed, 44 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/feat_ctl.c b/arch/x86/kernel/cpu/feat_ctl.c
index 27533a6..da696eb 100644
--- a/arch/x86/kernel/cpu/feat_ctl.c
+++ b/arch/x86/kernel/cpu/feat_ctl.c
@@ -104,8 +104,9 @@ early_param("nosgx", nosgx);
 
 void init_ia32_feat_ctl(struct cpuinfo_x86 *c)
 {
+   bool enable_sgx_kvm = false, enable_sgx_driver = false;
bool tboot = tboot_enabled();
-   bool enable_sgx;
+   bool enable_vmx;
u64 msr;
 
if (rdmsrl_safe(MSR_IA32_FEAT_CTL, )) {
@@ -114,13 +115,19 @@ void init_ia32_feat_ctl(struct cpuinfo_x86 *c)
return;
}
 
-   /*
-* Enable SGX if and only if the kernel supports SGX and Launch Control
-* is supported, i.e. disable SGX if the LE hash MSRs can't be written.
-*/
-   enable_sgx = cpu_has(c, X86_FEATURE_SGX) &&
-cpu_has(c, X86_FEATURE_SGX_LC) &&
-IS_ENABLED(CONFIG_X86_SGX);
+   enable_vmx = cpu_has(c, X86_FEATURE_VMX) &&
+IS_ENABLED(CONFIG_KVM_INTEL);
+
+   if (cpu_has(c, X86_FEATURE_SGX) && IS_ENABLED(CONFIG_X86_SGX)) {
+   /*
+* Separate out SGX driver enabling from KVM.  This allows KVM
+* guests to use SGX even if the kernel SGX driver refuses to
+* use it.  This happens if flexible Launch Control is not
+* available.
+*/
+   enable_sgx_driver = cpu_has(c, X86_FEATURE_SGX_LC);
+   enable_sgx_kvm = enable_vmx && IS_ENABLED(CONFIG_X86_SGX_KVM);
+   }
 
if (msr & FEAT_CTL_LOCKED)
goto update_caps;
@@ -136,15 +143,18 @@ void init_ia32_feat_ctl(struct cpuinfo_x86 *c)
 * i.e. KVM is enabled, to avoid unnecessarily adding an attack vector
 * for the kernel, e.g. using VMX to hide malicious code.
 */
-   if (cpu_has(c, X86_FEATURE_VMX) && IS_ENABLED(CONFIG_KVM_INTEL)) {
+   if (enable_vmx) {
msr |= FEAT_CTL_VMX_ENABLED_OUTSIDE_SMX;
 
if (tboot)
msr |= FEAT_CTL_VMX_ENABLED_INSIDE_SMX;
}
 
-   if (enable_sgx)
-   msr |= FEAT_CTL_SGX_ENABLED | FEAT_CTL_SGX_LC_ENABLED;
+   if (enable_sgx_kvm || enable_sgx_driver) {
+   msr |= FEAT_CTL_SGX_ENABLED;
+   if (enable_sgx_driver)
+   msr |= FEAT_CTL_SGX_LC_ENABLED;
+   }
 
wrmsrl(MSR_IA32_FEAT_CTL, msr);
 
@@ -167,10 +177,29 @@ update_caps:
}
 
 update_sgx:
-   if (!(msr & FEAT_CTL_SGX_ENABLED) ||
-   !(msr & FEAT_CTL_SGX_LC_ENABLED) || !enable_sgx) {
-   if (enable_sgx)
-   pr_err_once("SGX disabled by BIOS\n");
+   if (!(msr & FEAT_CTL_SGX_ENABLED)) {
+   if (enable_sgx_kvm || enable_sgx_driver)
+   pr_err_once("SGX disabled by BIOS.\n");
clear_cpu_cap(c, X86_FEATURE_SGX);
+   return;
+   }
+
+   /*
+* VMX feature bit may be cleared due to being disabled in BIOS,
+* in which case SGX virtualization cannot be supported either.
+*/
+   if (!cpu_has(c, X86_FEATURE_VMX) &&

[tip: x86/sgx] x86/sgx: Expose SGX architectural definitions to the kernel

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 8ca52cc38dc8fdcbdbd0c23eafb19db5e5f5c8d0
Gitweb:
https://git.kernel.org/tip/8ca52cc38dc8fdcbdbd0c23eafb19db5e5f5c8d0
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:03 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:41 +02:00

x86/sgx: Expose SGX architectural definitions to the kernel

Expose SGX architectural structures, as KVM will use many of the
architectural constants and structs to virtualize SGX.

Name the new header file as asm/sgx.h, rather than asm/sgx_arch.h, to
have single header to provide SGX facilities to share with other kernel
componments. Also update MAINTAINERS to include asm/sgx.h.

Signed-off-by: Sean Christopherson 
Co-developed-by: Kai Huang 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/6bf47acd91ab4d709e66ad1692c7803e4c9063a0.1616136308.git.kai.hu...@intel.com
---
 MAINTAINERS   |   1 +-
 arch/x86/include/asm/sgx.h| 350 +-
 arch/x86/kernel/cpu/sgx/arch.h| 340 +
 arch/x86/kernel/cpu/sgx/encl.c|   2 +-
 arch/x86/kernel/cpu/sgx/sgx.h |   2 +-
 tools/testing/selftests/sgx/defines.h |   2 +-
 6 files changed, 354 insertions(+), 343 deletions(-)
 create mode 100644 arch/x86/include/asm/sgx.h
 delete mode 100644 arch/x86/kernel/cpu/sgx/arch.h

diff --git a/MAINTAINERS b/MAINTAINERS
index aa84121..0cb606a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9274,6 +9274,7 @@ Q:
https://patchwork.kernel.org/project/intel-sgx/list/
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/sgx
 F: Documentation/x86/sgx.rst
 F: arch/x86/entry/vdso/vsgx.S
+F: arch/x86/include/asm/sgx.h
 F: arch/x86/include/uapi/asm/sgx.h
 F: arch/x86/kernel/cpu/sgx/*
 F: tools/testing/selftests/sgx/*
diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
new file mode 100644
index 000..14bb5f7
--- /dev/null
+++ b/arch/x86/include/asm/sgx.h
@@ -0,0 +1,350 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * Copyright(c) 2016-20 Intel Corporation.
+ *
+ * Intel Software Guard Extensions (SGX) support.
+ */
+#ifndef _ASM_X86_SGX_H
+#define _ASM_X86_SGX_H
+
+#include 
+#include 
+
+/*
+ * This file contains both data structures defined by SGX architecture and 
Linux
+ * defined software data structures and functions.  The two should not be mixed
+ * together for better readibility.  The architectural definitions come first.
+ */
+
+/* The SGX specific CPUID function. */
+#define SGX_CPUID  0x12
+/* EPC enumeration. */
+#define SGX_CPUID_EPC  2
+/* An invalid EPC section, i.e. the end marker. */
+#define SGX_CPUID_EPC_INVALID  0x0
+/* A valid EPC section. */
+#define SGX_CPUID_EPC_SECTION  0x1
+/* The bitmask for the EPC section type. */
+#define SGX_CPUID_EPC_MASK GENMASK(3, 0)
+
+/**
+ * enum sgx_return_code - The return code type for ENCLS, ENCLU and ENCLV
+ * %SGX_NOT_TRACKED:   Previous ETRACK's shootdown sequence has not
+ * been completed yet.
+ * %SGX_CHILD_PRESENT  SECS has child pages present in the EPC.
+ * %SGX_INVALID_EINITTOKEN:EINITTOKEN is invalid and enclave signer's
+ * public key does not match IA32_SGXLEPUBKEYHASH.
+ * %SGX_UNMASKED_EVENT:An unmasked event, e.g. INTR, was 
received
+ */
+enum sgx_return_code {
+   SGX_NOT_TRACKED = 11,
+   SGX_CHILD_PRESENT   = 13,
+   SGX_INVALID_EINITTOKEN  = 16,
+   SGX_UNMASKED_EVENT  = 128,
+};
+
+/* The modulus size for 3072-bit RSA keys. */
+#define SGX_MODULUS_SIZE 384
+
+/**
+ * enum sgx_miscselect - additional information to an SSA frame
+ * %SGX_MISC_EXINFO:   Report #PF or #GP to the SSA frame.
+ *
+ * Save State Area (SSA) is a stack inside the enclave used to store processor
+ * state when an exception or interrupt occurs. This enum defines additional
+ * information stored to an SSA frame.
+ */
+enum sgx_miscselect {
+   SGX_MISC_EXINFO = BIT(0),
+};
+
+#define SGX_MISC_RESERVED_MASK GENMASK_ULL(63, 1)
+
+#define SGX_SSA_GPRS_SIZE  184
+#define SGX_SSA_MISC_EXINFO_SIZE   16
+
+/**
+ * enum sgx_attributes - the attributes field in  sgx_secs
+ * %SGX_ATTR_INIT: Enclave can be entered (is initialized).
+ * %SGX_ATTR_DEBUG:Allow ENCLS(EDBGRD) and ENCLS(EDBGWR).
+ * %SGX_ATTR_MODE64BIT:Tell that this a 64-bit enclave.
+ * %SGX_ATTR_PROVISIONKEY:  Allow to use provisioning keys for remote
+ * attestation.
+ * %SGX_ATTR_KSS:  Allow to use key separation and sharing (KSS).
+ * %SGX_ATTR_EINITTOKENKEY:Allow to use token signing key that is used

[tip: x86/sgx] x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 32ddda8e445df3de477db14d386fb3518042224a
Gitweb:
https://git.kernel.org/tip/32ddda8e445df3de477db14d386fb3518042224a
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:05 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:42 +02:00

x86/sgx: Add SGX2 ENCLS leaf definitions (EAUG, EMODPR and EMODT)

Define the ENCLS leafs that are available with SGX2, also referred to as
Enclave Dynamic Memory Management (EDMM).  The leafs will be used by KVM
to conditionally expose SGX2 capabilities to guests.

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/5f0970c251ebcc6d5add132f0d750cc753b7060f.1616136308.git.kai.hu...@intel.com
---
 arch/x86/include/asm/sgx.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 34f4423..3b025af 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -40,6 +40,9 @@ enum sgx_encls_function {
EPA = 0x0A,
EWB = 0x0B,
ETRACK  = 0x0C,
+   EAUG= 0x0D,
+   EMODPR  = 0x0E,
+   EMODT   = 0x0F,
 };
 
 /**

[tip: x86/sgx] x86/sgx: Add encls_faulted() helper

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: a67136b458e5e63822b19c35794451122fe2bf3e
Gitweb:
https://git.kernel.org/tip/a67136b458e5e63822b19c35794451122fe2bf3e
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:06 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:42 +02:00

x86/sgx: Add encls_faulted() helper

Add a helper to extract the fault indicator from an encoded ENCLS return
value.  SGX virtualization will also need to detect ENCLS faults.

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/c1f955898110de2f669da536fc6cf62e003dff88.1616136308.git.kai.hu...@intel.com
---
 arch/x86/kernel/cpu/sgx/encls.h | 15 ++-
 arch/x86/kernel/cpu/sgx/ioctl.c |  2 +-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encls.h b/arch/x86/kernel/cpu/sgx/encls.h
index be5c496..9b20484 100644
--- a/arch/x86/kernel/cpu/sgx/encls.h
+++ b/arch/x86/kernel/cpu/sgx/encls.h
@@ -40,6 +40,19 @@
} while (0);  \
 }
 
+/*
+ * encls_faulted() - Check if an ENCLS leaf faulted given an error code
+ * @ret:   the return value of an ENCLS leaf function call
+ *
+ * Return:
+ * - true: ENCLS leaf faulted.
+ * - false:Otherwise.
+ */
+static inline bool encls_faulted(int ret)
+{
+   return ret & ENCLS_FAULT_FLAG;
+}
+
 /**
  * encls_failed() - Check if an ENCLS function failed
  * @ret:   the return value of an ENCLS function call
@@ -50,7 +63,7 @@
  */
 static inline bool encls_failed(int ret)
 {
-   if (ret & ENCLS_FAULT_FLAG)
+   if (encls_faulted(ret))
return ENCLS_TRAPNR(ret) != X86_TRAP_PF;
 
return !!ret;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 354e309..11e3f96 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -568,7 +568,7 @@ static int sgx_encl_init(struct sgx_encl *encl, struct 
sgx_sigstruct *sigstruct,
}
}
 
-   if (ret & ENCLS_FAULT_FLAG) {
+   if (encls_faulted(ret)) {
if (encls_failed(ret))
ENCLS_WARN(ret, "EINIT");

[tip: x86/sgx] x86/sgx: Move ENCLS leaf definitions to sgx.h

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: 9c55c78a73ce6e62a1d46ba6e4f242c23c29b812
Gitweb:
https://git.kernel.org/tip/9c55c78a73ce6e62a1d46ba6e4f242c23c29b812
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:04 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 09:43:41 +02:00

x86/sgx: Move ENCLS leaf definitions to sgx.h

Move the ENCLS leaf definitions to sgx.h so that they can be used by
KVM.

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Acked-by: Dave Hansen 
Link: 
https://lkml.kernel.org/r/2e6cd7c5c1ced620cfcd292c3c6c382827fde6b2.1616136308.git.kai.hu...@intel.com
---
 arch/x86/include/asm/sgx.h  | 15 +++
 arch/x86/kernel/cpu/sgx/encls.h | 15 ---
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 14bb5f7..34f4423 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -27,6 +27,21 @@
 /* The bitmask for the EPC section type. */
 #define SGX_CPUID_EPC_MASK GENMASK(3, 0)
 
+enum sgx_encls_function {
+   ECREATE = 0x00,
+   EADD= 0x01,
+   EINIT   = 0x02,
+   EREMOVE = 0x03,
+   EDGBRD  = 0x04,
+   EDGBWR  = 0x05,
+   EEXTEND = 0x06,
+   ELDU= 0x08,
+   EBLOCK  = 0x09,
+   EPA = 0x0A,
+   EWB = 0x0B,
+   ETRACK  = 0x0C,
+};
+
 /**
  * enum sgx_return_code - The return code type for ENCLS, ENCLU and ENCLV
  * %SGX_NOT_TRACKED:   Previous ETRACK's shootdown sequence has not
diff --git a/arch/x86/kernel/cpu/sgx/encls.h b/arch/x86/kernel/cpu/sgx/encls.h
index 443188f..be5c496 100644
--- a/arch/x86/kernel/cpu/sgx/encls.h
+++ b/arch/x86/kernel/cpu/sgx/encls.h
@@ -11,21 +11,6 @@
 #include 
 #include "sgx.h"
 
-enum sgx_encls_function {
-   ECREATE = 0x00,
-   EADD= 0x01,
-   EINIT   = 0x02,
-   EREMOVE = 0x03,
-   EDGBRD  = 0x04,
-   EDGBWR  = 0x05,
-   EEXTEND = 0x06,
-   ELDU= 0x08,
-   EBLOCK  = 0x09,
-   EPA = 0x0A,
-   EWB = 0x0B,
-   ETRACK  = 0x0C,
-};
-
 /**
  * ENCLS_FAULT_FLAG - flag signifying an ENCLS return code is a trapnr
  *

[tip: x86/sgx] x86/sgx: Add helpers to expose ECREATE and EINIT to KVM

2021-04-07 Thread tip-bot2 for Sean Christopherson

The following commit has been merged into the x86/sgx branch of tip:

Commit-ID: d155030b1e7c0e448aab22a803f7a71ea2e117d7
Gitweb:
https://git.kernel.org/tip/d155030b1e7c0e448aab22a803f7a71ea2e117d7
Author:Sean Christopherson 
AuthorDate:Fri, 19 Mar 2021 20:23:08 +13:00
Committer: Borislav Petkov 
CommitterDate: Tue, 06 Apr 2021 19:18:27 +02:00

x86/sgx: Add helpers to expose ECREATE and EINIT to KVM

The host kernel must intercept ECREATE to impose policies on guests, and
intercept EINIT to be able to write guest's virtual SGX_LEPUBKEYHASH MSR
values to hardware before running guest's EINIT so it can run correctly
according to hardware behavior.

Provide wrappers around __ecreate() and __einit() to hide the ugliness
of overloading the ENCLS return value to encode multiple error formats
in a single int.  KVM will trap-and-execute ECREATE and EINIT as part
of SGX virtualization, and reflect ENCLS execution result to guest by
setting up guest's GPRs, or on an exception, injecting the correct fault
based on return value of __ecreate() and __einit().

Use host userspace addresses (provided by KVM based on guest physical
address of ENCLS parameters) to execute ENCLS/EINIT when possible.
Accesses to both EPC and memory originating from ENCLS are subject to
segmentation and paging mechanisms.  It's also possible to generate
kernel mappings for ENCLS parameters by resolving PFN but using
__uaccess_xx() is simpler.

 [ bp: Return early if the __user memory accesses fail, use
   cpu_feature_enabled(). ]

Signed-off-by: Sean Christopherson 
Signed-off-by: Kai Huang 
Signed-off-by: Borislav Petkov 
Acked-by: Jarkko Sakkinen 
Link: 
https://lkml.kernel.org/r/20e09daf559aa5e9e680a0b4b5fba940f1bad86e.1616136308.git.kai.hu...@intel.com
---
 arch/x86/include/asm/sgx.h |   7 ++-
 arch/x86/kernel/cpu/sgx/virt.c | 117 -
 2 files changed, 124 insertions(+)

diff --git a/arch/x86/include/asm/sgx.h b/arch/x86/include/asm/sgx.h
index 3b025af..954042e 100644
--- a/arch/x86/include/asm/sgx.h
+++ b/arch/x86/include/asm/sgx.h
@@ -365,4 +365,11 @@ struct sgx_sigstruct {
  * comment!
  */
 
+#ifdef CONFIG_X86_SGX_KVM
+int sgx_virt_ecreate(struct sgx_pageinfo *pageinfo, void __user *secs,
+int *trapnr);
+int sgx_virt_einit(void __user *sigstruct, void __user *token,
+  void __user *secs, u64 *lepubkeyhash, int *trapnr);
+#endif
+
 #endif /* _ASM_X86_SGX_H */
diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c
index 259cc46..7d221ea 100644
--- a/arch/x86/kernel/cpu/sgx/virt.c
+++ b/arch/x86/kernel/cpu/sgx/virt.c
@@ -257,3 +257,120 @@ int __init sgx_vepc_init(void)
 
return misc_register(_vepc_dev);
 }
+
+/**
+ * sgx_virt_ecreate() - Run ECREATE on behalf of guest
+ * @pageinfo:  Pointer to PAGEINFO structure
+ * @secs:  Userspace pointer to SECS page
+ * @trapnr:trap number injected to guest in case of ECREATE error
+ *
+ * Run ECREATE on behalf of guest after KVM traps ECREATE for the purpose
+ * of enforcing policies of guest's enclaves, and return the trap number
+ * which should be injected to guest in case of any ECREATE error.
+ *
+ * Return:
+ * -  0:   ECREATE was successful.
+ * - <0:   on error.
+ */
+int sgx_virt_ecreate(struct sgx_pageinfo *pageinfo, void __user *secs,
+int *trapnr)
+{
+   int ret;
+
+   /*
+* @secs is an untrusted, userspace-provided address.  It comes from
+* KVM and is assumed to be a valid pointer which points somewhere in
+* userspace.  This can fault and call SGX or other fault handlers when
+* userspace mapping @secs doesn't exist.
+*
+* Add a WARN() to make sure @secs is already valid userspace pointer
+* from caller (KVM), who should already have handled invalid pointer
+* case (for instance, made by malicious guest).  All other checks,
+* such as alignment of @secs, are deferred to ENCLS itself.
+*/
+   if (WARN_ON_ONCE(!access_ok(secs, PAGE_SIZE)))
+   return -EINVAL;
+
+   __uaccess_begin();
+   ret = __ecreate(pageinfo, (void *)secs);
+   __uaccess_end();
+
+   if (encls_faulted(ret)) {
+   *trapnr = ENCLS_TRAPNR(ret);
+   return -EFAULT;
+   }
+
+   /* ECREATE doesn't return an error code, it faults or succeeds. */
+   WARN_ON_ONCE(ret);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(sgx_virt_ecreate);
+
+static int __sgx_virt_einit(void __user *sigstruct, void __user *token,
+   void __user *secs)
+{
+   int ret;
+
+   /*
+* Make sure all userspace pointers from caller (KVM) are valid.
+* All other checks deferred to ENCLS itself.  Also see comment
+* for @secs in sgx_virt_ecreate().
+*/
+#define SGX_EINITTOKEN_SIZE304
+   if (WARN_ON_ONCE(!access_ok(sigstruct, sizeof(struct sgx_sigstr

Re: [RFC PATCH] KVM: x86: Support write protect huge pages lazily

2021-04-06 Thread Sean Christopherson

+Ben

On Tue, Apr 06, 2021, Keqian Zhu wrote:
> Hi Paolo,
> 
> I plan to rework this patch and do full test. What do you think about this 
> idea
> (enable dirty logging for huge pages lazily)?

Ben, don't you also have something similar (or maybe the exact opposite?) in the
hopper?  This sounds very familiar, but I can't quite connect the dots that are
floating around my head...
 
> PS: As dirty log of TDP MMU has been supported, I should add more code.
> 
> On 2020/8/28 16:11, Keqian Zhu wrote:
> > Currently during enable dirty logging, if we're with init-all-set,
> > we just write protect huge pages and leave normal pages untouched,
> > for that we can enable dirty logging for these pages lazily.
> > 
> > It seems that enable dirty logging lazily for huge pages is feasible
> > too, which not only reduces the time of start dirty logging, also
> > greatly reduces side-effect on guest when there is high dirty rate.
> > 
> > (These codes are not tested, for RFC purpose :-) ).
> > 
> > Signed-off-by: Keqian Zhu 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +-
> >  arch/x86/kvm/mmu/mmu.c  | 65 ++---
> >  arch/x86/kvm/vmx/vmx.c  |  3 +-
> >  arch/x86/kvm/x86.c  | 22 +--
> >  4 files changed, 62 insertions(+), 31 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 5303dbc5c9bc..201a068cf43d 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1296,8 +1296,7 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
> > accessed_mask,
> >  
> >  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
> >  void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> > - struct kvm_memory_slot *memslot,
> > - int start_level);
> > + struct kvm_memory_slot *memslot);
> >  void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >const struct kvm_memory_slot *memslot);
> >  void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm,
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 43fdb0c12a5d..4b7d577de6cd 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -1625,14 +1625,45 @@ static bool __rmap_set_dirty(struct kvm *kvm, 
> > struct kvm_rmap_head *rmap_head)
> >  }
> >  
> >  /**
> > - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> > + * kvm_mmu_write_protect_largepage_masked - write protect selected 
> > largepages
> >   * @kvm: kvm instance
> >   * @slot: slot to protect
> >   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> >   * @mask: indicates which pages we should protect
> >   *
> > - * Used when we do not need to care about huge page mappings: e.g. during 
> > dirty
> > - * logging we do not have any such mappings.
> > + * @ret: true if all pages are write protected
> > + */
> > +static bool kvm_mmu_write_protect_largepage_masked(struct kvm *kvm,
> > +   struct kvm_memory_slot *slot,
> > +   gfn_t gfn_offset, unsigned long mask)
> > +{
> > +   struct kvm_rmap_head *rmap_head;
> > +   bool protected, all_protected;
> > +   gfn_t start_gfn = slot->base_gfn + gfn_offset;
> > +   int i;
> > +
> > +   all_protected = true;
> > +   while (mask) {
> > +   protected = false;
> > +   for (i = PG_LEVEL_2M; i <= KVM_MAX_HUGEPAGE_LEVEL; ++i) {
> > +   rmap_head = __gfn_to_rmap(start_gfn + __ffs(mask), i, 
> > slot);
> > +   protectd |= __rmap_write_protect(kvm, rmap_head, false);
> > +   }
> > +
> > +   all_protected &= protectd;
> > +   /* clear the first set bit */
> > +   mask &= mask - 1;
> > +   }
> > +
> > +   return all_protected;
> > +}
> > +
> > +/**
> > + * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> > + * @kvm: kvm instance
> > + * @slot: slot to protect
> > + * @gfn_offset: start of the BITS_PER_LONG pages we care about
> > + * @mask: indicates which pages we should protect
> >   */
> >  static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >  struct kvm_memory_slot *slot,
> > @@ -1679,18 +1710,25 @@ EXPORT_SYMBOL_GPL(kvm_mmu_clear_dirty_pt_masked);
> >  
> >  /**
> >   * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for 
> > selected
> > - * PT level pages.
> > - *
> > - * It calls kvm_mmu_write_protect_pt_masked to write protect selected 
> > pages to
> > - * enable dirty logging for them.
> > - *
> > - * Used when we do not need to care about huge page mappings: e.g. during 
> > dirty
> > - * logging we do not have any such mappings.
> > + * dirty pages.
> >   */
> >  void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> > struct kvm_memory_slot *slot,
> >

Re: [PATCH] KVM: MMU: protect TDP MMU pages only down to required level

2021-04-06 Thread Sean Christopherson

On Tue, Apr 06, 2021, Keqian Zhu wrote:
> Hi Paolo,
> 
> I'm just going to fix this issue, and found that you have done this ;-)

Ha, and meanwhile I'm having a serious case of deja vu[1].  It even received a
variant of the magic "Queued, thanks"[2].  Doesn't appear in either of the 5.12
pull requests though, must have gotten lost along the way.

[1] https://lkml.kernel.org/r/20210213005015.1651772-3-sea...@google.com
[2] https://lkml.kernel.org/r/b5ab72f2-970f-64bd-891c-48f1c3035...@redhat.com

> Please feel free to add:
> 
> Reviewed-by: Keqian Zhu 
> 
> Thanks,
> Keqian
> 
> On 2021/4/2 20:17, Paolo Bonzini wrote:
> > When using manual protection of dirty pages, it is not necessary
> > to protect nested page tables down to the 4K level; instead KVM
> > can protect only hugepages in order to split them lazily, and
> > delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG.
> > This was overlooked in the TDP MMU, so do it there as well.
> > 
> > Fixes: a6a0b05da9f37 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
> > Cc: Ben Gardon 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index efb41f31e80a..0d92a269c5fa 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5538,7 +5538,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> > flush = slot_handle_level(kvm, memslot, slot_rmap_write_protect,
> > start_level, KVM_MAX_HUGEPAGE_LEVEL, false);
> > if (is_tdp_mmu_enabled(kvm))
> > -   flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, PG_LEVEL_4K);
> > +   flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> > write_unlock(>mmu_lock);
> >  
> > /*
> >

[PATCH v2 8/8] KVM: SVM: Allocate SEV command structures on local stack

2021-04-06 Thread Sean Christopherson

Use the local stack to "allocate" the structures used to communicate with
the PSP.  The largest struct used by KVM, sev_data_launch_secret, clocks
in at 52 bytes, well within the realm of reasonable stack usage.  The
smallest structs are a mere 4 bytes, i.e. the pointer for the allocation
is larger than the allocation itself.

Now that the PSP driver plays nice with vmalloc pointers, putting the
data on a virtually mapped stack (CONFIG_VMAP_STACK=y) will not cause
explosions.

Cc: Brijesh Singh 
Cc: Tom Lendacky 
Signed-off-by: Sean Christopherson 
---
 arch/x86/kvm/svm/sev.c | 262 +++--
 1 file changed, 96 insertions(+), 166 deletions(-)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 5457138c7347..316fd39c7aef 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -150,35 +150,22 @@ static void sev_asid_free(int asid)
 
 static void sev_unbind_asid(struct kvm *kvm, unsigned int handle)
 {
-   struct sev_data_decommission *decommission;
-   struct sev_data_deactivate *data;
+   struct sev_data_decommission decommission;
+   struct sev_data_deactivate deactivate;
 
if (!handle)
return;
 
-   data = kzalloc(sizeof(*data), GFP_KERNEL);
-   if (!data)
-   return;
-
-   /* deactivate handle */
-   data->handle = handle;
+   deactivate.handle = handle;
 
/* Guard DEACTIVATE against WBINVD/DF_FLUSH used in ASID recycling */
down_read(_deactivate_lock);
-   sev_guest_deactivate(data, NULL);
+   sev_guest_deactivate(, NULL);
up_read(_deactivate_lock);
 
-   kfree(data);
-
-   decommission = kzalloc(sizeof(*decommission), GFP_KERNEL);
-   if (!decommission)
-   return;
-
/* decommission handle */
-   decommission->handle = handle;
-   sev_guest_decommission(decommission, NULL);
-
-   kfree(decommission);
+   decommission.handle = handle;
+   sev_guest_decommission(, NULL);
 }
 
 static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp)
@@ -216,19 +203,14 @@ static int sev_guest_init(struct kvm *kvm, struct 
kvm_sev_cmd *argp)
 
 static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error)
 {
-   struct sev_data_activate *data;
+   struct sev_data_activate activate;
int asid = sev_get_asid(kvm);
int ret;
 
-   data = kzalloc(sizeof(*data), GFP_KERNEL_ACCOUNT);
-   if (!data)
-   return -ENOMEM;
-
/* activate ASID on the given handle */
-   data->handle = handle;
-   data->asid   = asid;
-   ret = sev_guest_activate(data, error);
-   kfree(data);
+   activate.handle = handle;
+   activate.asid   = asid;
+   ret = sev_guest_activate(, error);
 
return ret;
 }
@@ -258,7 +240,7 @@ static int sev_issue_cmd(struct kvm *kvm, int id, void 
*data, int *error)
 static int sev_launch_start(struct kvm *kvm, struct kvm_sev_cmd *argp)
 {
struct kvm_sev_info *sev = _kvm_svm(kvm)->sev_info;
-   struct sev_data_launch_start *start;
+   struct sev_data_launch_start start;
struct kvm_sev_launch_start params;
void *dh_blob, *session_blob;
int *error = >error;
@@ -270,20 +252,16 @@ static int sev_launch_start(struct kvm *kvm, struct 
kvm_sev_cmd *argp)
if (copy_from_user(, (void __user *)(uintptr_t)argp->data, 
sizeof(params)))
return -EFAULT;
 
-   start = kzalloc(sizeof(*start), GFP_KERNEL_ACCOUNT);
-   if (!start)
-   return -ENOMEM;
+   memset(, 0, sizeof(start));
 
dh_blob = NULL;
if (params.dh_uaddr) {
dh_blob = psp_copy_user_blob(params.dh_uaddr, params.dh_len);
-   if (IS_ERR(dh_blob)) {
-   ret = PTR_ERR(dh_blob);
-   goto e_free;
-   }
+   if (IS_ERR(dh_blob))
+   return PTR_ERR(dh_blob);
 
-   start->dh_cert_address = __sme_set(__pa(dh_blob));
-   start->dh_cert_len = params.dh_len;
+   start.dh_cert_address = __sme_set(__pa(dh_blob));
+   start.dh_cert_len = params.dh_len;
}
 
session_blob = NULL;
@@ -294,40 +272,38 @@ static int sev_launch_start(struct kvm *kvm, struct 
kvm_sev_cmd *argp)
goto e_free_dh;
}
 
-   start->session_address = __sme_set(__pa(session_blob));
-   start->session_len = params.session_len;
+   start.session_address = __sme_set(__pa(session_blob));
+   start.session_len = params.session_len;
}
 
-   start->handle = params.handle;
-   start->policy = params.policy;
+   start.handle = params.handle;
+   start.policy = params.policy;
 
/* create memory encryption context */
-   ret = __sev_issue_cmd(argp->sev_

[PATCH v2 7/8] crypto: ccp: Use the stack and common buffer for INIT command

2021-04-06 Thread Sean Christopherson

Drop the dedicated init_cmd_buf and instead use a local variable.  Now
that the low level helper uses an internal buffer for all commands,
using the stack for the upper layers is safe even when running with
CONFIG_VMAP_STACK=y.

Signed-off-by: Sean Christopherson 
---
 drivers/crypto/ccp/sev-dev.c | 10 ++
 drivers/crypto/ccp/sev-dev.h |  1 -
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index e54774b0d637..9ff28df03030 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -233,6 +233,7 @@ static int sev_do_cmd(int cmd, void *data, int *psp_ret)
 static int __sev_platform_init_locked(int *error)
 {
struct psp_device *psp = psp_master;
+   struct sev_data_init data;
struct sev_device *sev;
int rc = 0;
 
@@ -244,6 +245,7 @@ static int __sev_platform_init_locked(int *error)
if (sev->state == SEV_STATE_INIT)
return 0;
 
+   memset(, 0, sizeof(data));
if (sev_es_tmr) {
u64 tmr_pa;
 
@@ -253,12 +255,12 @@ static int __sev_platform_init_locked(int *error)
 */
tmr_pa = __pa(sev_es_tmr);
 
-   sev->init_cmd_buf.flags |= SEV_INIT_FLAGS_SEV_ES;
-   sev->init_cmd_buf.tmr_address = tmr_pa;
-   sev->init_cmd_buf.tmr_len = SEV_ES_TMR_SIZE;
+   data.flags |= SEV_INIT_FLAGS_SEV_ES;
+   data.tmr_address = tmr_pa;
+   data.tmr_len = SEV_ES_TMR_SIZE;
}
 
-   rc = __sev_do_cmd_locked(SEV_CMD_INIT, >init_cmd_buf, error);
+   rc = __sev_do_cmd_locked(SEV_CMD_INIT, , error);
if (rc)
return rc;
 
diff --git a/drivers/crypto/ccp/sev-dev.h b/drivers/crypto/ccp/sev-dev.h
index 0fd21433f627..666c21eb81ab 100644
--- a/drivers/crypto/ccp/sev-dev.h
+++ b/drivers/crypto/ccp/sev-dev.h
@@ -46,7 +46,6 @@ struct sev_device {
unsigned int int_rcvd;
wait_queue_head_t int_queue;
struct sev_misc_dev *misc;
-   struct sev_data_init init_cmd_buf;
 
u8 api_major;
u8 api_minor;
-- 
2.31.0.208.g409f899ff0-goog

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 6046 matches

Mail list logo