Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-24 Thread Madhavan T. Venkataraman



On 5/5/23 12:31, Sean Christopherson wrote:
> On Fri, May 05, 2023, Micka�l Sala�n wrote:
>>
>> On 05/05/2023 18:28, Sean Christopherson wrote:
>>> I have no doubt that we'll need to solve performance and scaling issues 
>>> with the
>>> memory attributes implementation, e.g. to utilize xarray multi-range support
>>> instead of storing information on a per-4KiB-page basis, but AFAICT, the 
>>> core
>>> idea is sound.  And a very big positive from a maintenance perspective is 
>>> that
>>> any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should 
>>> also
>>> benefit the other use case.
>>>
>>> [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
>>> [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
>>> [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
>>
>> I agree, I used this mechanism because it was easier at first to rely on a
>> previous work, but while I was working on the MBEC support, I realized that
>> it's not the optimal way to do it.
>>
>> I was thinking about using a new special EPT bit similar to
>> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
>> think?
> 
> On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical 
> reasons,
> KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the 
> guest
> is moving around BARs, using option ROMs, etc.
> 
> ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
> otrack attributes, but that works only because pKVM is more privileged than 
> the
> host kernel, and the shared vs. private memory attribute that pKVM cares about
> is very, very restricted in how it can be used and changed.
> 
> I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, 
> and
> it ended up being a constant battle with the kernel, e.g. page migration, and 
> with
> KVM itself, e.g. the above memslot mess.

Sorry for the delay in responding to this. I wanted to study the KVM code and 
fully
understand your comment before responding.

Yes, I quite agree with you. I will make an attempt to address this in the next 
version.
I am working on it right now.

Thanks.

Madhavan



Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson
On Fri, May 05, 2023, Micka�l Sala�n wrote:
> 
> On 05/05/2023 18:28, Sean Christopherson wrote:
> > I have no doubt that we'll need to solve performance and scaling issues 
> > with the
> > memory attributes implementation, e.g. to utilize xarray multi-range support
> > instead of storing information on a per-4KiB-page basis, but AFAICT, the 
> > core
> > idea is sound.  And a very big positive from a maintenance perspective is 
> > that
> > any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should 
> > also
> > benefit the other use case.
> > 
> > [1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
> > [2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
> > [3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com
> 
> I agree, I used this mechanism because it was easier at first to rely on a
> previous work, but while I was working on the MBEC support, I realized that
> it's not the optimal way to do it.
> 
> I was thinking about using a new special EPT bit similar to
> EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you
> think?

On x86, SPTEs are even more ephemeral than memslots.  E.g. for historical 
reasons,
KVM zaps all SPTEs if _any_ memslot is deleted, which is problematic if the 
guest
is moving around BARs, using option ROMs, etc.

ARM's pKVM tracks metadata in its stage-2 PTEs, i.e. doesn't need an xarray to
otrack attributes, but that works only because pKVM is more privileged than the
host kernel, and the shared vs. private memory attribute that pKVM cares about
is very, very restricted in how it can be used and changed.

I tried shoehorning private vs. shared metadata into x86's SPTEs in the past, 
and
it ended up being a constant battle with the kernel, e.g. page migration, and 
with
KVM itself, e.g. the above memslot mess.



Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Mickaël Salaün



On 05/05/2023 18:28, Sean Christopherson wrote:

On Fri, May 05, 2023, Micka�l Sala�n wrote:

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@
  #define _ASM_X86_KVM_PAGE_TRACK_H
  
  enum kvm_page_track_mode {

+   KVM_PAGE_TRACK_PREWRITE,


Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

   Drop "support" for multiple page-track modes, as there is no evidence
   that array-based and refcounted metadata is the optimal solution for
   other modes, nor is there any evidence that other use cases, e.g. for
   access-tracking, will be a good fit for the page-track machinery in
   general.
   
   E.g. one potential use case of access-tracking would be to prevent guest

   access to poisoned memory (from the guest's perspective).  In that case,
   the number of poisoned pages is likely to be a very small percentage of
   the guest memory, and there is no need to reference count the number of
   access-tracking users, i.e. expanding gfn_track[] for a new mode would be
   grossly inefficient.  And for poisoned memory, host userspace would also
   likely want to trap accesses, e.g. to inject #MC into the guest, and that
   isn't currently supported by the page-track framework.
   
   A better alternative for that poisoned page use case is likely a

   variation of the proposed per-gfn attributes overlay (linked), which
   would allow efficiently tracking the sparse set of poisoned pages, and by
   default would exit to userspace on access.

Of particular relevance:

   - Using the page-track machinery is inefficient because the guest is likely
 going to write-protect a minority of its memory.  And this

   select KVM_EXTERNAL_WRITE_TRACKING if KVM

 is particularly nasty because simply enabling HEKI in the Kconfig will 
cause
 KVM to allocate rmaps and gfn tracking.

   - There's no need to reference count the protection, i.e. 15 of the 16 bits 
of
 gfn_track are dead weight.

   - As proposed, adding a second "mode" would double the cost of gfn tracking.

   - Tying the protections to the memslots will create an impossible-to-maintain
 ABI because the protections will be lost if the owning memslot is deleted 
and
 recreated.

   - The page-track framework provides incomplete protection and will lead to an
 ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
 adding calls to kvm_page_track_prewrite(), but misses things like 
kvm_vcpu_map().

   - The scaling and maintenance issues will only get worse if/when someone 
tries
 to support dropping read and/or execute permissions, e.g. for execute-only.

   - The code is x86-only, and is likely to stay that way for the foreseeable
 future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
[2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
[3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com


I agree, I used this mechanism because it was easier at first to rely on 
a previous work, but while I was working on the MBEC support, I realized 
that it's not the optimal way to do it.


I was thinking about using a new special EPT bit similar to 
EPT_SPTE_HOST_WRITABLE, but it may not be portable though. What do you 
think?




Re: [PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Sean Christopherson
On Fri, May 05, 2023, Micka�l Sala�n wrote:
> diff --git a/arch/x86/include/asm/kvm_page_track.h 
> b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a..a7fb4ff888e6 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -3,6 +3,7 @@
>  #define _ASM_X86_KVM_PAGE_TRACK_H
>  
>  enum kvm_page_track_mode {
> + KVM_PAGE_TRACK_PREWRITE,

Heh, just when I decide to finally kill off support for multiple modes[1] :-)

My assessment from that changelog still holds true for this case:

  Drop "support" for multiple page-track modes, as there is no evidence
  that array-based and refcounted metadata is the optimal solution for
  other modes, nor is there any evidence that other use cases, e.g. for
  access-tracking, will be a good fit for the page-track machinery in
  general.
  
  E.g. one potential use case of access-tracking would be to prevent guest
  access to poisoned memory (from the guest's perspective).  In that case,
  the number of poisoned pages is likely to be a very small percentage of
  the guest memory, and there is no need to reference count the number of
  access-tracking users, i.e. expanding gfn_track[] for a new mode would be
  grossly inefficient.  And for poisoned memory, host userspace would also
  likely want to trap accesses, e.g. to inject #MC into the guest, and that
  isn't currently supported by the page-track framework.
  
  A better alternative for that poisoned page use case is likely a
  variation of the proposed per-gfn attributes overlay (linked), which
  would allow efficiently tracking the sparse set of poisoned pages, and by
  default would exit to userspace on access.

Of particular relevance:

  - Using the page-track machinery is inefficient because the guest is likely
going to write-protect a minority of its memory.  And this

  select KVM_EXTERNAL_WRITE_TRACKING if KVM

is particularly nasty because simply enabling HEKI in the Kconfig will cause
KVM to allocate rmaps and gfn tracking.

  - There's no need to reference count the protection, i.e. 15 of the 16 bits of
gfn_track are dead weight.

  - As proposed, adding a second "mode" would double the cost of gfn tracking.

  - Tying the protections to the memslots will create an impossible-to-maintain
ABI because the protections will be lost if the owning memslot is deleted 
and
recreated.

  - The page-track framework provides incomplete protection and will lead to an
ongoing game of whack-a-mole, e.g. this patch catches the obvious cases by
adding calls to kvm_page_track_prewrite(), but misses things like 
kvm_vcpu_map().

  - The scaling and maintenance issues will only get worse if/when someone tries
to support dropping read and/or execute permissions, e.g. for execute-only.

  - The code is x86-only, and is likely to stay that way for the foreseeable
future.

The proposed alternative is to piggyback the memory attributes implementation[2]
that is being added (if all goes according to plan) for confidential VMs.  This
use case (dropping permissions) came up not too long ago[3], which is why I have
a ready-made answer).

I have no doubt that we'll need to solve performance and scaling issues with the
memory attributes implementation, e.g. to utilize xarray multi-range support
instead of storing information on a per-4KiB-page basis, but AFAICT, the core
idea is sound.  And a very big positive from a maintenance perspective is that
any optimizations, fixes, etc. for one use case (CoCo vs. hardening) should also
benefit the other use case.

[1] https://lore.kernel.org/all/20230311002258.852397-22-sea...@google.com
[2] https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
[3] https://lore.kernel.org/all/y1a1i9vbj%2fpvm...@google.com



[PATCH v1 2/9] KVM: x86/mmu: Add support for prewrite page tracking

2023-05-05 Thread Mickaël Salaün
Add a new page tracking mode to deny a page update and throw a page
fault to the guest.  This is useful for KVM to be able to make some
pages non-writable (not read-only because it doesn't imply execution
restrictions), see the next Heki commits.

This kind of synthetic kernel page fault needs to be handled by the
guest, which is not currently the case, making it try again and again.
This will be part of a follow-up patch series.

Update emulator_read_write_onepage() to handle X86EMUL_CONTINUE and
X86EMUL_PROPAGATE_FAULT.

Update page_fault_handle_page_track() to call
kvm_slot_page_track_is_active() whenever this is required for
KVM_PAGE_TRACK_PREWRITE and KVM_PAGE_TRACK_WRITE, even if one tracker
already returned true.

Invert the return code semantic for read_emulate() and write_emulate():
- from 1=Ok 0=Error
- to X86EMUL_* return codes (e.g. X86EMUL_CONTINUE == 0)

Imported the prewrite page tracking support part originally written by
Mihai Donțu, Marian Rotariu, and Ștefan Șicleru:
https://lore.kernel.org/r/20211006173113.26445-27-ala...@bitdefender.com
https://lore.kernel.org/r/20211006173113.26445-28-ala...@bitdefender.com
Removed the GVA changes for page tracking, removed the
X86EMUL_RETRY_INSTR case, and some emulation part for now.

Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: H. Peter Anvin 
Cc: Ingo Molnar 
Cc: Kees Cook 
Cc: Madhavan T. Venkataraman 
Cc: Marian Rotariu 
Cc: Mihai Donțu 
Cc: Paolo Bonzini 
Cc: Sean Christopherson 
Cc: Thomas Gleixner 
Cc: Vitaly Kuznetsov 
Cc: Wanpeng Li 
Cc: Ștefan Șicleru 
Signed-off-by: Mickaël Salaün 
Link: https://lore.kernel.org/r/20230505152046.6575-3-...@digikod.net
---
 arch/x86/include/asm/kvm_page_track.h | 12 +
 arch/x86/kvm/mmu/mmu.c| 64 ++-
 arch/x86/kvm/mmu/page_track.c | 33 +-
 arch/x86/kvm/mmu/spte.c   |  6 +++
 arch/x86/kvm/x86.c| 27 +++
 5 files changed, 122 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_page_track.h 
b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a..a7fb4ff888e6 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_KVM_PAGE_TRACK_H
 
 enum kvm_page_track_mode {
+   KVM_PAGE_TRACK_PREWRITE,
KVM_PAGE_TRACK_WRITE,
KVM_PAGE_TRACK_MAX,
 };
@@ -22,6 +23,16 @@ struct kvm_page_track_notifier_head {
 struct kvm_page_track_notifier_node {
struct hlist_node node;
 
+   /*
+* It is called when guest is writing the write-tracked page
+* and the write emulation didn't happened yet.
+*
+* @vcpu: the vcpu where the write access happened
+* @gpa: the physical address written by guest
+* @node: this nodet
+*/
+   bool (*track_prewrite)(struct kvm_vcpu *vcpu, gpa_t gpa,
+  struct kvm_page_track_notifier_node *node);
/*
 * It is called when guest is writing the write-tracked page
 * and write emulation is finished at that time.
@@ -73,6 +84,7 @@ kvm_page_track_register_notifier(struct kvm *kvm,
 void
 kvm_page_track_unregister_notifier(struct kvm *kvm,
   struct kvm_page_track_notifier_node *n);
+bool kvm_page_track_prewrite(struct kvm_vcpu *vcpu, gpa_t gpa);
 void kvm_page_track_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new,
  int bytes);
 void kvm_page_track_flush_slot(struct kvm *kvm, struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 835426254e76..e5d1e241ff0f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -793,9 +793,13 @@ static void account_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
slot = __gfn_to_memslot(slots, gfn);
 
/* the non-leaf shadow pages are keeping readonly. */
-   if (sp->role.level > PG_LEVEL_4K)
-   return kvm_slot_page_track_add_page(kvm, slot, gfn,
-   KVM_PAGE_TRACK_WRITE);
+   if (sp->role.level > PG_LEVEL_4K) {
+   kvm_slot_page_track_add_page(kvm, slot, gfn,
+KVM_PAGE_TRACK_PREWRITE);
+   kvm_slot_page_track_add_page(kvm, slot, gfn,
+KVM_PAGE_TRACK_WRITE);
+   return;
+   }
 
kvm_mmu_gfn_disallow_lpage(slot, gfn);
 
@@ -840,9 +844,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct 
kvm_mmu_page *sp)
gfn = sp->gfn;
slots = kvm_memslots_for_spte_role(kvm, sp->role);
slot = __gfn_to_memslot(slots, gfn);
-   if (sp->role.level > PG_LEVEL_4K)
-   return kvm_slot_page_track_remove_page(kvm, slot, gfn,
-  KVM_PAGE_TRACK_WRITE);
+   if (sp->role.level > PG_LEVEL_4K) {
+   kvm_slot_page_track_remove_page(kvm,