Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-28 Thread Gavin Shan

Hi Marc,

On 5/27/20 5:37 PM, Marc Zyngier wrote:

On 2020-05-27 05:05, Gavin Shan wrote:


[...]
 

+struct kvm_vcpu_pv_apf_data {
+    __u32    reason;
+    __u8    pad[60];
+    __u32    enabled;
+};


What's all the padding for?



The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.


So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]



Ok, Thanks for your comments and hints.


+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu)
+{
+    u64 vbar, pc;
+    u32 val;
+    int ret;
+
+    if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+    return false;
+
+    if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+    return false;
+
+    /* Pending page fault, which ins't acknowledged by guest */
+    ret = kvm_async_pf_read_cache(vcpu, );
+    if (ret || val)
+    return false;
+
+    /*
+ * Events can't be injected through data abort because it's
+ * going to clobber ELR_EL1, which might not consued (or saved)
+ * by guest yet.
+ */
+    vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+    pc = *vcpu_pc(vcpu);
+    if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+    return false;


Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub code,
and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even virtually
contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a contiguous
range controlled by the kernel isn't ideal.

How does this work on x86?



Yeah, here we just provide a mechanism to forbid injecting data abort. The
range is fed by guest through HVC call. So I think it's guest related issue.
You had more comments about this in PATCH[9]. I will explain a bit more there.

x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected
if it's on. Otherwise, it's forbidden. It's workable because exception is
special interrupt to x86 if I'm correct.

   return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
  !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
    (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));


I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

     M.


As Paolo mentioned in another reply. There are two types of notifications
sent from host to guest: page_not_present and page_ready. The page_not_present
notification should be delivered synchronously while page_ready can be
delivered asynchronously. He also suggested to reserve a ESR (or DFSC) subclass
for page_not_present. For page_ready, it can be delivered by interrupt, like 
PPI.
x86 is changing the code to deliver page_ready by interrupt, which was done by
exception previously.

when we use interrupt, instead of exception for page_ready. We won't need the
game of guessing PC.

I assume you prefer to use SDEI for page_not_present, correct? In that case,
what's the current status of SDEI? I mean it has been fully or partially
supported, or we need develop it from the scratch :)

Thanks,
Gavin



Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-27 Thread Marc Zyngier

On 2020-05-27 05:05, Gavin Shan wrote:

Hi Mark,



[...]


+struct kvm_vcpu_pv_apf_data {
+   __u32   reason;
+   __u8pad[60];
+   __u32   enabled;
+};


What's all the padding for?



The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.


So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]

+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu 
*vcpu)

+{
+   u64 vbar, pc;
+   u32 val;
+   int ret;
+
+   if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+   return false;
+
+   if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+   return false;
+
+   /* Pending page fault, which ins't acknowledged by guest */
+   ret = kvm_async_pf_read_cache(vcpu, );
+   if (ret || val)
+   return false;
+
+   /*
+* Events can't be injected through data abort because it's
+* going to clobber ELR_EL1, which might not consued (or saved)
+* by guest yet.
+*/
+   vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+   pc = *vcpu_pc(vcpu);
+   if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+   return false;


Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub 
code,

and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even 
virtually

contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a 
contiguous

range controlled by the kernel isn't ideal.

How does this work on x86?



Yeah, here we just provide a mechanism to forbid injecting data abort. 
The
range is fed by guest through HVC call. So I think it's guest related 
issue.
You had more comments about this in PATCH[9]. I will explain a bit more 
there.


x86 basically relies on EFLAGS[IF] flag. The async page fault can be 
injected
if it's on. Otherwise, it's forbidden. It's workable because exception 
is

special interrupt to x86 if I'm correct.

   return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
  !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
(GUEST_INTR_STATE_STI | 
GUEST_INTR_STATE_MOV_SS));


I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

M.
--
Jazz is not dead. It just smells funny...


Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-26 Thread Gavin Shan

Hi Mark,

On 5/26/20 10:34 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:

There are two stages of fault pages and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be related to IO activities if the page was swapped
out previously. In this case, the guest has to suspend for a few of
milliseconds at least, regardless of the overall system load. There
is no useful work done during the suspended period from guest's view.


This is a bit difficult to read. How about:

| When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
| is not mapped at Stage-2), the vCPU is suspended until the host has
| handled the fault. It can take the host milliseconds or longer to
| handle the fault as this may require IO, and when the system load is
| low neither the host nor guest perform useful work during such
| periods.



Yes, It's much better.



This adds asychornous page fault to improve the situation. A signal


Nit: typo for `asynchronous` here, and there are a few other typos in
the patch itself. It would be nice if you could run a spellcheck over
that.



Sure.


(PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
to be populated. Guest might reschedule to another running process if
possible. Otherwise, the vCPU is put into power-saving mode, which is
actually to cause vCPU reschedule from host's view. A followup signal
(PAGE_READY) is sent to guest once the requested page is populated.
The suspended task is waken up or scheduled when guest receives the
signal. With this mechanism, the vCPU won't be stuck when the requested
page is being populated by host.


It would probably be best to say 'notification' rather than 'signal'
here, and say 'the guest is notified', etc. As above, it seems that this
is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
make it clear which context this applies to.



Ok.



There are more details highlighted as below. Note the implementation is
similar to what x86 has to some extent:

* A dedicated SMCCC ID is reserved to enable, disable or configure
  the functionality. The only 64-bits parameter is conveyed by two
  registers (w2/w1). Bits[63:56] is the bitmap used to specify the
  operated functionality like enabling/disabling/configuration. The
  bits[55:6] is the physical address of control block or external
  data abort injection disallowed region. Bit[5:0] are used to pass
  control flags.

* Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
  isn't ready. In the mean while, a work is started to populate the
  page asynchronously in background. The stage 2 page table entry is
  updated accordingly and another signal (PAGE_READY) is fired after
  the request page is populted. The signals is notified by injected
  data abort fault.

* The signals are fired and consumed in sequential fashion. It means
  no more signals will be fired if there is pending one, awaiting the
  guest to consume. It's because the injected data abort faults have
  to be done in sequential fashion.

Signed-off-by: Gavin Shan 
---
  arch/arm64/include/asm/kvm_host.h  |  43 
  arch/arm64/include/asm/kvm_para.h  |  27 ++
  arch/arm64/include/uapi/asm/Kbuild |   2 -
  arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
  arch/arm64/kvm/Kconfig |   1 +
  arch/arm64/kvm/Makefile|   2 +
  include/linux/arm-smccc.h  |   6 +
  virt/kvm/arm/arm.c |  36 ++-
  virt/kvm/arm/async_pf.c| 335 +
  virt/kvm/arm/hypercalls.c  |   8 +
  virt/kvm/arm/mmu.c |  29 ++-
  11 files changed, 506 insertions(+), 5 deletions(-)
  create mode 100644 arch/arm64/include/asm/kvm_para.h
  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
  create mode 100644 virt/kvm/arm/async_pf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index f77c706777ec..a207728d6f3f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -250,6 +250,23 @@ struct vcpu_reset_state {
boolreset;
  };
  
+#ifdef CONFIG_KVM_ASYNC_PF

+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU  64


What exactly is this number?



It's maximal number of async page faults pending on the specific vCPU.


+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */


The length of what should be 32-bit? The token?

The guest sees the token as the fault address? How exactly is that
exposed to the guest, is that via a 

Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-26 Thread Mark Rutland
On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:
> There are two stages of fault pages and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. To populate the
> requested page can be related to IO activities if the page was swapped
> out previously. In this case, the guest has to suspend for a few of
> milliseconds at least, regardless of the overall system load. There
> is no useful work done during the suspended period from guest's view.

This is a bit difficult to read. How about:

| When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
| is not mapped at Stage-2), the vCPU is suspended until the host has
| handled the fault. It can take the host milliseconds or longer to
| handle the fault as this may require IO, and when the system load is
| low neither the host nor guest perform useful work during such
| periods.

> 
> This adds asychornous page fault to improve the situation. A signal

Nit: typo for `asynchronous` here, and there are a few other typos in
the patch itself. It would be nice if you could run a spellcheck over
that.

> (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
> to be populated. Guest might reschedule to another running process if
> possible. Otherwise, the vCPU is put into power-saving mode, which is
> actually to cause vCPU reschedule from host's view. A followup signal
> (PAGE_READY) is sent to guest once the requested page is populated.
> The suspended task is waken up or scheduled when guest receives the
> signal. With this mechanism, the vCPU won't be stuck when the requested
> page is being populated by host.

It would probably be best to say 'notification' rather than 'signal'
here, and say 'the guest is notified', etc. As above, it seems that this
is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
make it clear which context this applies to.

> 
> There are more details highlighted as below. Note the implementation is
> similar to what x86 has to some extent:
> 
>* A dedicated SMCCC ID is reserved to enable, disable or configure
>  the functionality. The only 64-bits parameter is conveyed by two
>  registers (w2/w1). Bits[63:56] is the bitmap used to specify the
>  operated functionality like enabling/disabling/configuration. The
>  bits[55:6] is the physical address of control block or external
>  data abort injection disallowed region. Bit[5:0] are used to pass
>  control flags.
> 
>* Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
>  isn't ready. In the mean while, a work is started to populate the
>  page asynchronously in background. The stage 2 page table entry is
>  updated accordingly and another signal (PAGE_READY) is fired after
>  the request page is populted. The signals is notified by injected
>  data abort fault.
> 
>* The signals are fired and consumed in sequential fashion. It means
>  no more signals will be fired if there is pending one, awaiting the
>  guest to consume. It's because the injected data abort faults have
>  to be done in sequential fashion.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/include/asm/kvm_host.h  |  43 
>  arch/arm64/include/asm/kvm_para.h  |  27 ++
>  arch/arm64/include/uapi/asm/Kbuild |   2 -
>  arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
>  arch/arm64/kvm/Kconfig |   1 +
>  arch/arm64/kvm/Makefile|   2 +
>  include/linux/arm-smccc.h  |   6 +
>  virt/kvm/arm/arm.c |  36 ++-
>  virt/kvm/arm/async_pf.c| 335 +
>  virt/kvm/arm/hypercalls.c  |   8 +
>  virt/kvm/arm/mmu.c |  29 ++-
>  11 files changed, 506 insertions(+), 5 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_para.h
>  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>  create mode 100644 virt/kvm/arm/async_pf.c
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index f77c706777ec..a207728d6f3f 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -250,6 +250,23 @@ struct vcpu_reset_state {
>   boolreset;
>  };
>  
> +#ifdef CONFIG_KVM_ASYNC_PF
> +
> +/* Should be a power of two number */
> +#define ASYNC_PF_PER_VCPU64

What exactly is this number?

> +
> +/*
> + * The association of gfn and token. The token will be sent to guest as
> + * page fault address. Also, the guest could be in aarch32 mode. So its
> + * length should be 32-bits.
> + */

The length of what should be 32-bit? The token?

The guest sees the token as the fault address? How exactly is that
exposed to the guest, is that via a synthetic S1 fault?

> +struct kvm_arch_async_pf {
> + u32 

[PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-07 Thread Gavin Shan
There are two stages of fault pages and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be related to IO activities if the page was swapped
out previously. In this case, the guest has to suspend for a few of
milliseconds at least, regardless of the overall system load. There
is no useful work done during the suspended period from guest's view.

This adds asychornous page fault to improve the situation. A signal
(PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
to be populated. Guest might reschedule to another running process if
possible. Otherwise, the vCPU is put into power-saving mode, which is
actually to cause vCPU reschedule from host's view. A followup signal
(PAGE_READY) is sent to guest once the requested page is populated.
The suspended task is waken up or scheduled when guest receives the
signal. With this mechanism, the vCPU won't be stuck when the requested
page is being populated by host.

There are more details highlighted as below. Note the implementation is
similar to what x86 has to some extent:

   * A dedicated SMCCC ID is reserved to enable, disable or configure
 the functionality. The only 64-bits parameter is conveyed by two
 registers (w2/w1). Bits[63:56] is the bitmap used to specify the
 operated functionality like enabling/disabling/configuration. The
 bits[55:6] is the physical address of control block or external
 data abort injection disallowed region. Bit[5:0] are used to pass
 control flags.

   * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
 isn't ready. In the mean while, a work is started to populate the
 page asynchronously in background. The stage 2 page table entry is
 updated accordingly and another signal (PAGE_READY) is fired after
 the request page is populted. The signals is notified by injected
 data abort fault.

   * The signals are fired and consumed in sequential fashion. It means
 no more signals will be fired if there is pending one, awaiting the
 guest to consume. It's because the injected data abort faults have
 to be done in sequential fashion.

Signed-off-by: Gavin Shan 
---
 arch/arm64/include/asm/kvm_host.h  |  43 
 arch/arm64/include/asm/kvm_para.h  |  27 ++
 arch/arm64/include/uapi/asm/Kbuild |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
 arch/arm64/kvm/Kconfig |   1 +
 arch/arm64/kvm/Makefile|   2 +
 include/linux/arm-smccc.h  |   6 +
 virt/kvm/arm/arm.c |  36 ++-
 virt/kvm/arm/async_pf.c| 335 +
 virt/kvm/arm/hypercalls.c  |   8 +
 virt/kvm/arm/mmu.c |  29 ++-
 11 files changed, 506 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index f77c706777ec..a207728d6f3f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -250,6 +250,23 @@ struct vcpu_reset_state {
boolreset;
 };
 
+#ifdef CONFIG_KVM_ASYNC_PF
+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU  64
+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */
+struct kvm_arch_async_pf {
+   u32 token;
+   gfn_t   gfn;
+   u32 esr;
+};
+#endif /* CONFIG_KVM_ASYNC_PF */
+
 struct kvm_vcpu_arch {
struct kvm_cpu_context ctxt;
void *sve_state;
@@ -351,6 +368,17 @@ struct kvm_vcpu_arch {
u64 last_steal;
gpa_t base;
} steal;
+
+#ifdef CONFIG_KVM_ASYNC_PF
+   struct {
+   struct gfn_to_hva_cache cache;
+   gfn_t   gfns[ASYNC_PF_PER_VCPU];
+   u64 control_block;
+   u16 id;
+   boolsend_user_only;
+   u64 no_fault_inst_range;
+   } apf;
+#endif
 };
 
 /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */
@@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
 
 static inline void __cpu_init_stage2(void) {}
 
+#ifdef CONFIG_KVM_ASYNC_PF
+bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn);
+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu);
+bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
+int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr,
+   gpa_t gpa, gfn_t