Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
Hi Marc, On 5/27/20 5:37 PM, Marc Zyngier wrote: On 2020-05-27 05:05, Gavin Shan wrote: [...] +struct kvm_vcpu_pv_apf_data { + __u32 reason; + __u8 pad[60]; + __u32 enabled; +}; What's all the padding for? The padding is ensure the @reason and @enabled in different cache line. @reason is shared by host/guest, while @enabled is almostly owned by guest. So you are assuming that a cache line is at most 64 bytes. It is actualy implementation defined, and you can probe for it by looking at the CTR_EL0 register. There are implementations ranging from 32 to 256 bytes in the wild, and let's not mention broken big-little implementations here. [...] Ok, Thanks for your comments and hints. +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu) +{ + u64 vbar, pc; + u32 val; + int ret; + + if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED)) + return false; + + if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu)) + return false; + + /* Pending page fault, which ins't acknowledged by guest */ + ret = kvm_async_pf_read_cache(vcpu, ); + if (ret || val) + return false; + + /* + * Events can't be injected through data abort because it's + * going to clobber ELR_EL1, which might not consued (or saved) + * by guest yet. + */ + vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1); + pc = *vcpu_pc(vcpu); + if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range)) + return false; Ah, so that's when this `no_fault_inst_range` is for. As-is this is not sufficient, and we'll need t be extremely careful here. The vectors themselves typically only have a small amount of stub code, and the bulk of the non-reentrant exception entry work happens elsewhere, in a mixture of assembly and C code that isn't even virtually contiguous with either the vectors or itself. It's possible in theory that code in modules (or perhaps in eBPF JIT'd code) that isn't safe to take a fault from, so even having a contiguous range controlled by the kernel isn't ideal. How does this work on x86? Yeah, here we just provide a mechanism to forbid injecting data abort. The range is fed by guest through HVC call. So I think it's guest related issue. You had more comments about this in PATCH[9]. I will explain a bit more there. x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected if it's on. Otherwise, it's forbidden. It's workable because exception is special interrupt to x86 if I'm correct. return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); I really wish this was relying on an architected exception delivery mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}). Trying to guess based on the PC won't fly. But these signals are pretty hard to multiplex with anything else. Like any form of non-architected exception injection, I don't see a good path forward unless we start considering something like SDEI. M. As Paolo mentioned in another reply. There are two types of notifications sent from host to guest: page_not_present and page_ready. The page_not_present notification should be delivered synchronously while page_ready can be delivered asynchronously. He also suggested to reserve a ESR (or DFSC) subclass for page_not_present. For page_ready, it can be delivered by interrupt, like PPI. x86 is changing the code to deliver page_ready by interrupt, which was done by exception previously. when we use interrupt, instead of exception for page_ready. We won't need the game of guessing PC. I assume you prefer to use SDEI for page_not_present, correct? In that case, what's the current status of SDEI? I mean it has been fully or partially supported, or we need develop it from the scratch :) Thanks, Gavin
Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
On 2020-05-27 05:05, Gavin Shan wrote: Hi Mark, [...] +struct kvm_vcpu_pv_apf_data { + __u32 reason; + __u8pad[60]; + __u32 enabled; +}; What's all the padding for? The padding is ensure the @reason and @enabled in different cache line. @reason is shared by host/guest, while @enabled is almostly owned by guest. So you are assuming that a cache line is at most 64 bytes. It is actualy implementation defined, and you can probe for it by looking at the CTR_EL0 register. There are implementations ranging from 32 to 256 bytes in the wild, and let's not mention broken big-little implementations here. [...] +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu) +{ + u64 vbar, pc; + u32 val; + int ret; + + if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED)) + return false; + + if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu)) + return false; + + /* Pending page fault, which ins't acknowledged by guest */ + ret = kvm_async_pf_read_cache(vcpu, ); + if (ret || val) + return false; + + /* +* Events can't be injected through data abort because it's +* going to clobber ELR_EL1, which might not consued (or saved) +* by guest yet. +*/ + vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1); + pc = *vcpu_pc(vcpu); + if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range)) + return false; Ah, so that's when this `no_fault_inst_range` is for. As-is this is not sufficient, and we'll need t be extremely careful here. The vectors themselves typically only have a small amount of stub code, and the bulk of the non-reentrant exception entry work happens elsewhere, in a mixture of assembly and C code that isn't even virtually contiguous with either the vectors or itself. It's possible in theory that code in modules (or perhaps in eBPF JIT'd code) that isn't safe to take a fault from, so even having a contiguous range controlled by the kernel isn't ideal. How does this work on x86? Yeah, here we just provide a mechanism to forbid injecting data abort. The range is fed by guest through HVC call. So I think it's guest related issue. You had more comments about this in PATCH[9]. I will explain a bit more there. x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected if it's on. Otherwise, it's forbidden. It's workable because exception is special interrupt to x86 if I'm correct. return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); I really wish this was relying on an architected exception delivery mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}). Trying to guess based on the PC won't fly. But these signals are pretty hard to multiplex with anything else. Like any form of non-architected exception injection, I don't see a good path forward unless we start considering something like SDEI. M. -- Jazz is not dead. It just smells funny...
Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
Hi Mark, On 5/26/20 10:34 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote: There are two stages of fault pages and the stage one page fault is handled by guest itself. The guest is trapped to host when the page fault is caused by stage 2 page table, for example missing. The guest is suspended until the requested page is populated. To populate the requested page can be related to IO activities if the page was swapped out previously. In this case, the guest has to suspend for a few of milliseconds at least, regardless of the overall system load. There is no useful work done during the suspended period from guest's view. This is a bit difficult to read. How about: | When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that | is not mapped at Stage-2), the vCPU is suspended until the host has | handled the fault. It can take the host milliseconds or longer to | handle the fault as this may require IO, and when the system load is | low neither the host nor guest perform useful work during such | periods. Yes, It's much better. This adds asychornous page fault to improve the situation. A signal Nit: typo for `asynchronous` here, and there are a few other typos in the patch itself. It would be nice if you could run a spellcheck over that. Sure. (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time to be populated. Guest might reschedule to another running process if possible. Otherwise, the vCPU is put into power-saving mode, which is actually to cause vCPU reschedule from host's view. A followup signal (PAGE_READY) is sent to guest once the requested page is populated. The suspended task is waken up or scheduled when guest receives the signal. With this mechanism, the vCPU won't be stuck when the requested page is being populated by host. It would probably be best to say 'notification' rather than 'signal' here, and say 'the guest is notified', etc. As above, it seems that this is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to make it clear which context this applies to. Ok. There are more details highlighted as below. Note the implementation is similar to what x86 has to some extent: * A dedicated SMCCC ID is reserved to enable, disable or configure the functionality. The only 64-bits parameter is conveyed by two registers (w2/w1). Bits[63:56] is the bitmap used to specify the operated functionality like enabling/disabling/configuration. The bits[55:6] is the physical address of control block or external data abort injection disallowed region. Bit[5:0] are used to pass control flags. * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page isn't ready. In the mean while, a work is started to populate the page asynchronously in background. The stage 2 page table entry is updated accordingly and another signal (PAGE_READY) is fired after the request page is populted. The signals is notified by injected data abort fault. * The signals are fired and consumed in sequential fashion. It means no more signals will be fired if there is pending one, awaiting the guest to consume. It's because the injected data abort faults have to be done in sequential fashion. Signed-off-by: Gavin Shan --- arch/arm64/include/asm/kvm_host.h | 43 arch/arm64/include/asm/kvm_para.h | 27 ++ arch/arm64/include/uapi/asm/Kbuild | 2 - arch/arm64/include/uapi/asm/kvm_para.h | 22 ++ arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile| 2 + include/linux/arm-smccc.h | 6 + virt/kvm/arm/arm.c | 36 ++- virt/kvm/arm/async_pf.c| 335 + virt/kvm/arm/hypercalls.c | 8 + virt/kvm/arm/mmu.c | 29 ++- 11 files changed, 506 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_para.h create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h create mode 100644 virt/kvm/arm/async_pf.c diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f77c706777ec..a207728d6f3f 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -250,6 +250,23 @@ struct vcpu_reset_state { boolreset; }; +#ifdef CONFIG_KVM_ASYNC_PF + +/* Should be a power of two number */ +#define ASYNC_PF_PER_VCPU 64 What exactly is this number? It's maximal number of async page faults pending on the specific vCPU. + +/* + * The association of gfn and token. The token will be sent to guest as + * page fault address. Also, the guest could be in aarch32 mode. So its + * length should be 32-bits. + */ The length of what should be 32-bit? The token? The guest sees the token as the fault address? How exactly is that exposed to the guest, is that via a
Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote: > There are two stages of fault pages and the stage one page fault is > handled by guest itself. The guest is trapped to host when the page > fault is caused by stage 2 page table, for example missing. The guest > is suspended until the requested page is populated. To populate the > requested page can be related to IO activities if the page was swapped > out previously. In this case, the guest has to suspend for a few of > milliseconds at least, regardless of the overall system load. There > is no useful work done during the suspended period from guest's view. This is a bit difficult to read. How about: | When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that | is not mapped at Stage-2), the vCPU is suspended until the host has | handled the fault. It can take the host milliseconds or longer to | handle the fault as this may require IO, and when the system load is | low neither the host nor guest perform useful work during such | periods. > > This adds asychornous page fault to improve the situation. A signal Nit: typo for `asynchronous` here, and there are a few other typos in the patch itself. It would be nice if you could run a spellcheck over that. > (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time > to be populated. Guest might reschedule to another running process if > possible. Otherwise, the vCPU is put into power-saving mode, which is > actually to cause vCPU reschedule from host's view. A followup signal > (PAGE_READY) is sent to guest once the requested page is populated. > The suspended task is waken up or scheduled when guest receives the > signal. With this mechanism, the vCPU won't be stuck when the requested > page is being populated by host. It would probably be best to say 'notification' rather than 'signal' here, and say 'the guest is notified', etc. As above, it seems that this is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to make it clear which context this applies to. > > There are more details highlighted as below. Note the implementation is > similar to what x86 has to some extent: > >* A dedicated SMCCC ID is reserved to enable, disable or configure > the functionality. The only 64-bits parameter is conveyed by two > registers (w2/w1). Bits[63:56] is the bitmap used to specify the > operated functionality like enabling/disabling/configuration. The > bits[55:6] is the physical address of control block or external > data abort injection disallowed region. Bit[5:0] are used to pass > control flags. > >* Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page > isn't ready. In the mean while, a work is started to populate the > page asynchronously in background. The stage 2 page table entry is > updated accordingly and another signal (PAGE_READY) is fired after > the request page is populted. The signals is notified by injected > data abort fault. > >* The signals are fired and consumed in sequential fashion. It means > no more signals will be fired if there is pending one, awaiting the > guest to consume. It's because the injected data abort faults have > to be done in sequential fashion. > > Signed-off-by: Gavin Shan > --- > arch/arm64/include/asm/kvm_host.h | 43 > arch/arm64/include/asm/kvm_para.h | 27 ++ > arch/arm64/include/uapi/asm/Kbuild | 2 - > arch/arm64/include/uapi/asm/kvm_para.h | 22 ++ > arch/arm64/kvm/Kconfig | 1 + > arch/arm64/kvm/Makefile| 2 + > include/linux/arm-smccc.h | 6 + > virt/kvm/arm/arm.c | 36 ++- > virt/kvm/arm/async_pf.c| 335 + > virt/kvm/arm/hypercalls.c | 8 + > virt/kvm/arm/mmu.c | 29 ++- > 11 files changed, 506 insertions(+), 5 deletions(-) > create mode 100644 arch/arm64/include/asm/kvm_para.h > create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h > create mode 100644 virt/kvm/arm/async_pf.c > > diff --git a/arch/arm64/include/asm/kvm_host.h > b/arch/arm64/include/asm/kvm_host.h > index f77c706777ec..a207728d6f3f 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -250,6 +250,23 @@ struct vcpu_reset_state { > boolreset; > }; > > +#ifdef CONFIG_KVM_ASYNC_PF > + > +/* Should be a power of two number */ > +#define ASYNC_PF_PER_VCPU64 What exactly is this number? > + > +/* > + * The association of gfn and token. The token will be sent to guest as > + * page fault address. Also, the guest could be in aarch32 mode. So its > + * length should be 32-bits. > + */ The length of what should be 32-bit? The token? The guest sees the token as the fault address? How exactly is that exposed to the guest, is that via a synthetic S1 fault? > +struct kvm_arch_async_pf { > + u32
[PATCH RFCv2 7/9] kvm/arm64: Support async page fault
There are two stages of fault pages and the stage one page fault is handled by guest itself. The guest is trapped to host when the page fault is caused by stage 2 page table, for example missing. The guest is suspended until the requested page is populated. To populate the requested page can be related to IO activities if the page was swapped out previously. In this case, the guest has to suspend for a few of milliseconds at least, regardless of the overall system load. There is no useful work done during the suspended period from guest's view. This adds asychornous page fault to improve the situation. A signal (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time to be populated. Guest might reschedule to another running process if possible. Otherwise, the vCPU is put into power-saving mode, which is actually to cause vCPU reschedule from host's view. A followup signal (PAGE_READY) is sent to guest once the requested page is populated. The suspended task is waken up or scheduled when guest receives the signal. With this mechanism, the vCPU won't be stuck when the requested page is being populated by host. There are more details highlighted as below. Note the implementation is similar to what x86 has to some extent: * A dedicated SMCCC ID is reserved to enable, disable or configure the functionality. The only 64-bits parameter is conveyed by two registers (w2/w1). Bits[63:56] is the bitmap used to specify the operated functionality like enabling/disabling/configuration. The bits[55:6] is the physical address of control block or external data abort injection disallowed region. Bit[5:0] are used to pass control flags. * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page isn't ready. In the mean while, a work is started to populate the page asynchronously in background. The stage 2 page table entry is updated accordingly and another signal (PAGE_READY) is fired after the request page is populted. The signals is notified by injected data abort fault. * The signals are fired and consumed in sequential fashion. It means no more signals will be fired if there is pending one, awaiting the guest to consume. It's because the injected data abort faults have to be done in sequential fashion. Signed-off-by: Gavin Shan --- arch/arm64/include/asm/kvm_host.h | 43 arch/arm64/include/asm/kvm_para.h | 27 ++ arch/arm64/include/uapi/asm/Kbuild | 2 - arch/arm64/include/uapi/asm/kvm_para.h | 22 ++ arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile| 2 + include/linux/arm-smccc.h | 6 + virt/kvm/arm/arm.c | 36 ++- virt/kvm/arm/async_pf.c| 335 + virt/kvm/arm/hypercalls.c | 8 + virt/kvm/arm/mmu.c | 29 ++- 11 files changed, 506 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_para.h create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h create mode 100644 virt/kvm/arm/async_pf.c diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f77c706777ec..a207728d6f3f 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -250,6 +250,23 @@ struct vcpu_reset_state { boolreset; }; +#ifdef CONFIG_KVM_ASYNC_PF + +/* Should be a power of two number */ +#define ASYNC_PF_PER_VCPU 64 + +/* + * The association of gfn and token. The token will be sent to guest as + * page fault address. Also, the guest could be in aarch32 mode. So its + * length should be 32-bits. + */ +struct kvm_arch_async_pf { + u32 token; + gfn_t gfn; + u32 esr; +}; +#endif /* CONFIG_KVM_ASYNC_PF */ + struct kvm_vcpu_arch { struct kvm_cpu_context ctxt; void *sve_state; @@ -351,6 +368,17 @@ struct kvm_vcpu_arch { u64 last_steal; gpa_t base; } steal; + +#ifdef CONFIG_KVM_ASYNC_PF + struct { + struct gfn_to_hva_cache cache; + gfn_t gfns[ASYNC_PF_PER_VCPU]; + u64 control_block; + u16 id; + boolsend_user_only; + u64 no_fault_inst_range; + } apf; +#endif }; /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */ @@ -604,6 +632,21 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu, static inline void __cpu_init_stage2(void) {} +#ifdef CONFIG_KVM_ASYNC_PF +bool kvm_async_pf_hash_find(struct kvm_vcpu *vcpu, gfn_t gfn); +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu); +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu); +int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, u32 esr, + gpa_t gpa, gfn_t