Re: [PATCH 1/3] KVM: arm64: Stop writing aarch32's CSSELR into ACTLR
Hi [This is an automated email] This commit has been processed because it contains a -stable tag. The stable tag indicates that it's relevant for the following trees: all The bot has tested the following trees: v5.6.14, v5.4.42, v4.19.124, v4.14.181, v4.9.224, v4.4.224. v5.6.14: Build OK! v5.4.42: Build OK! v4.19.124: Failed to apply! Possible dependencies: f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest") v4.14.181: Failed to apply! Possible dependencies: 005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch path") 93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from guests") f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest") v4.9.224: Failed to apply! Possible dependencies: 005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch path") 016f98afd050 ("irqchip/gic-v3: Use nops macro for Cavium ThunderX erratum 23154") 0d449541c185 ("KVM: arm64: use common invariant sysreg definitions") 0e9884fe63c6 ("arm64: sysreg: subsume GICv3 sysreg definitions") 14ae7518dd55 ("arm64: sysreg: add register encodings used by KVM") 47863d41ecf8 ("arm64: sysreg: sort by encoding") 82e0191a1aa1 ("arm64: Support systems without FP/ASIMD") 851050a573e1 ("KVM: arm64: Use common sysreg definitions") 93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from guests") bca8f17f57bd ("arm64: Get rid of asm/opcodes.h") c7a3c61fc606 ("arm64: sysreg: add performance monitor registers") c9a3c58f01fb ("KVM: arm64: Add the EL1 physical timer access handler") cd9e1927a525 ("arm64: Work around broken .inst when defective gas is detected") f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest") v4.4.224: Failed to apply! Possible dependencies: 005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch path") 06282fd2c2bf ("arm64: KVM: Implement vgic-v2 save/restore") 068a17a5805d ("arm64: mm: create new fine-grained mappings at boot") 072f0a633838 ("arm64: Introduce raw_{d,i}cache_line_size") 0a28714c53fd ("arm64: Use PoU cache instr for I/D coherency") 116c81f427ff ("arm64: Work around systems with mismatched cache line sizes") 1431af367e52 ("arm64: KVM: Implement timer save/restore") 157962f5a8f2 ("arm64: decouple early fixmap init from linear mapping") 1e48ef7fcc37 ("arm64: add support for building vmlinux as a relocatable PIE binary") 2a803c4db615 ("arm64: head.S: use memset to clear BSS") 57f4959bad0a ("arm64: kernel: Add support for User Access Override") 6d6ec20fcf28 ("arm64: KVM: Implement system register save/restore") 7b7293ae3dbd ("arm64: Fold proc-macros.S into assembler.h") 82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk") 82e0191a1aa1 ("arm64: Support systems without FP/ASIMD") 8eb992674c9e ("arm64: KVM: Implement debug save/restore") 910917bb7db0 ("arm64: KVM: Map the kernel RO section into HYP") 93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from guests") 9e8e865bbe29 ("arm64: unify idmap removal") a0bf9776cd0b ("arm64: kvm: deal with kernel symbols outside of linear mapping") a7f8de168ace ("arm64: allow kernel Image to be loaded anywhere in physical memory") ab893fb9f1b1 ("arm64: introduce KIMAGE_VADDR as the virtual base of the kernel region") adc9b2dfd009 ("arm64: kernel: Rework finisher callback out of __cpu_suspend_enter()") b3122023df93 ("arm64: Fix an enum typo in mm/dump.c") b97b66c14b96 ("arm64: KVM: Implement guest entry") be901e9b15cd ("arm64: KVM: Implement the core world switch") c1a88e9124a4 ("arm64: kasan: avoid TLB conflicts") c76a0a6695c6 ("arm64: KVM: Add a HYP-specific header file") d5370f754875 ("arm64: prefetch: add alternative pattern for CPUs without a prefetcher") f68d2b1b73cc ("arm64: KVM: Implement vgic-v3 save/restore") f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest") f80fb3a3d508 ("arm64: add support for kernel ASLR") f9040773b7bb ("arm64: move kernel image to base of vmalloc area") fd045f6cd98e ("arm64: add support for module PLTs") NOTE: The patch will not be queued to stable trees until it is upstream. How should we proceed with this patch? -- Thanks Sasha ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
On 27/05/20 09:48, Marc Zyngier wrote: > > My own question is whether this even makes any sense 10 years later. > The HW has massively changed, and this adds a whole lot of complexity > to both the hypervisor and the guest. It still makes sense, but indeed it's for different reasons. One example is host page cache sharing, where (parts of) the host page cache are visible to the guest. In this context, async page faults are used for any kind of host page faults, not just paging out memory due to overcommit. But I agree that it is very very important to design the exception model first, as we're witnessing in x86 land the problems with a poor design. Nothing major, but just pain all around. Paolo > It also plays very ugly games > with the exception model, which doesn't give me the warm fuzzy feeling > that it's going to be great. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 26/26] KVM: arm64: Parametrize exception entry with a target EL
On Wed, May 27, 2020 at 10:34:09AM +0100, Marc Zyngier wrote: > HI Mark, > > On 2020-05-19 11:44, Mark Rutland wrote: > > On Wed, Apr 22, 2020 at 01:00:50PM +0100, Marc Zyngier wrote: > > > -static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu) > > > +static void enter_exception(struct kvm_vcpu *vcpu, unsigned long > > > target_mode, > > > + enum exception_type type) > > > > Since this is all for an AArch64 target, could we keep `64` in the name, > > e.g enter_exception64? That'd mirror the callers below. > > > > > { > > > - unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1); > > > - unsigned long old, new; > > > + unsigned long sctlr, vbar, old, new, mode; > > > + u64 exc_offset; > > > + > > > + mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT); > > > + > > > + if (mode == target_mode) > > > + exc_offset = CURRENT_EL_SP_ELx_VECTOR; > > > + else if ((mode | 1) == target_mode) > > > + exc_offset = CURRENT_EL_SP_EL0_VECTOR; > > > > It would be nice if we could add a mnemonic for the `1` here, e.g. > > PSR_MODE_SP0 or PSR_MODE_THREAD_BIT. > > I've addressed both comments as follows: > > diff --git a/arch/arm64/include/asm/ptrace.h > b/arch/arm64/include/asm/ptrace.h > index bf57308fcd63..953b6a1ce549 100644 > --- a/arch/arm64/include/asm/ptrace.h > +++ b/arch/arm64/include/asm/ptrace.h > @@ -35,6 +35,7 @@ > #define GIC_PRIO_PSR_I_SET (1 << 4) > > /* Additional SPSR bits not exposed in the UABI */ > +#define PSR_MODE_THREAD_BIT (1 << 0) > #define PSR_IL_BIT (1 << 20) > > /* AArch32-specific ptrace requests */ > diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c > index 3dbcbc839b9c..ebfdfc27b2bd 100644 > --- a/arch/arm64/kvm/inject_fault.c > +++ b/arch/arm64/kvm/inject_fault.c > @@ -43,8 +43,8 @@ enum exception_type { > * Here we manipulate the fields in order of the AArch64 SPSR_ELx layout, > from > * MSB to LSB. > */ > -static void enter_exception(struct kvm_vcpu *vcpu, unsigned long > target_mode, > - enum exception_type type) > +static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long > target_mode, > + enum exception_type type) > { > unsigned long sctlr, vbar, old, new, mode; > u64 exc_offset; > @@ -53,7 +53,7 @@ static void enter_exception(struct kvm_vcpu *vcpu, > unsigned long target_mode, > > if (mode == target_mode) > exc_offset = CURRENT_EL_SP_ELx_VECTOR; > - else if ((mode | 1) == target_mode) > + else if ((mode | PSR_MODE_THREAD_BIT) == target_mode) > exc_offset = CURRENT_EL_SP_EL0_VECTOR; > else if (!(mode & PSR_MODE32_BIT)) > exc_offset = LOWER_EL_AArch64_VECTOR; > @@ -126,7 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool > is_iabt, unsigned long addr > bool is_aarch32 = vcpu_mode_is_32bit(vcpu); > u32 esr = 0; > > - enter_exception(vcpu, PSR_MODE_EL1h, except_type_sync); > + enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync); > > vcpu_write_sys_reg(vcpu, addr, FAR_EL1); > > @@ -156,7 +156,7 @@ static void inject_undef64(struct kvm_vcpu *vcpu) > { > u32 esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT); > > - enter_exception(vcpu, PSR_MODE_EL1h, except_type_sync); > + enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync); > > /* >* Build an unknown exception, depending on the instruction Thanks; that all looks good to me, and my R-b stands! Mark. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 20/26] KVM: arm64: Move ELR_EL1 to the system register array
On 2020-05-26 17:29, James Morse wrote: Hi Marc, On 22/04/2020 13:00, Marc Zyngier wrote: As ELR-EL1 is a VNCR-capable register with ARMv8.4-NV, let's move it to the sys_regs array and repaint the accessors. While we're at it, let's kill the now useless accessors used only on the fault injection path. Reviewed-by: James Morse A curiosity: diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 95977b80265ce..46949fce3e813 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -184,6 +184,8 @@ enum vcpu_sysreg { Comment above the enum has some claims about the order, but its already out of order with __vcpu_read_sys_reg_from_cpu()... (PAR_EL1 being the culprit) This comment dates back from the original assembly implementation, where I was paranoid about accessing the sys_regs array in strict order to maximize the chances of the prefetcher doing the right thing. As always with premature optimization, it was worthless, and moving things to C forced us to do things differently anyway (not to mention VHE...). I'll delete the comment in a separate patch. Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 19/26] KVM: arm64: Make struct kvm_regs userspace-only
On 2020-05-26 17:29, James Morse wrote: Hi Marc, On 22/04/2020 13:00, Marc Zyngier wrote: struct kvm_regs is used by userspace to indicate which register gets accessed by the {GET,SET}_ONE_REG API. But as we're about to refactor the layout of the in-kernel register structures, we need the kernel to move away from it. Let's make kvm_regs userspace only, and let the kernel map it to its own internal representation. diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index 23ebe51410f06..9fec9231b63e2 100644 --- a/arch/arm64/kvm/guest.c +++ b/arch/arm64/kvm/guest.c @@ -102,6 +102,55 @@ static int core_reg_size_from_offset(const struct kvm_vcpu *vcpu, u64 off) return size; } +static void *core_reg_addr(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg) +{ + u64 off = core_reg_offset_from_id(reg->id); + + switch (off) { + default: + return NULL; Doesn't this switch statement catch an out of range offset, and a misaligned offset? ... We still test for those explicitly in the caller. Better safe than implicit? Indeed, this is not supposed to happen at all. Maybe I should just fold validate_core_offset offset there, and make this NULL value the error case. + } +} With the reset thing reported by Zenghui and Zengtao on the previous patch fixed: Reviewed-by: James Morse (otherwise struct kvm_regs isn't userspace-only!) Indeed! Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 10/26] KVM: arm64: Refactor vcpu_{read,write}_sys_reg
On 2020-05-26 17:28, James Morse wrote: Hi Marc, On 22/04/2020 13:00, Marc Zyngier wrote: Extract the direct HW accessors for later reuse. diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 51db934702b64..46f218982df8c 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c +u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg) +{ + u64 val = 0x8badf00d8badf00d; + + if (!vcpu->arch.sysregs_loaded_on_cpu) { + goto memory_read; } -immediate_write: + if (__vcpu_read_sys_reg_from_cpu(reg, )) + return val; + +memory_read: + return __vcpu_sys_reg(vcpu, reg); +} The goto here is a bit odd, is it just an artefact of how we got here? That's because a lot of this changes when NV gets added to the mix, see [1]. Is this easier on the eye?: | u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg) | { | u64 val = 0x8badf00d8badf00d; | | if (vcpu->arch.sysregs_loaded_on_cpu && | __vcpu_read_sys_reg_from_cpu(reg, )) | return val; | | return __vcpu_sys_reg(vcpu, reg); | } Definitely. I don't mind reworking the NV branch so that the label gets introduced there. +void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, int reg) +{ + if (!vcpu->arch.sysregs_loaded_on_cpu) + goto memory_write; + + if (__vcpu_write_sys_reg_to_cpu(val, reg)) + return; + +memory_write: __vcpu_sys_reg(vcpu, reg) = val; } Again I think its clearer without the goto Regardless: Reviewed-by: James Morse Thanks, M. [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=kvm-arm64/nv-5.7-rc1-WIP=11f3217d39a602cbfac7d08064c8b31afb57348e -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 26/26] KVM: arm64: Parametrize exception entry with a target EL
HI Mark, On 2020-05-19 11:44, Mark Rutland wrote: On Wed, Apr 22, 2020 at 01:00:50PM +0100, Marc Zyngier wrote: We currently assume that an exception is delivered to EL1, always. Once we emulate EL2, this no longer will be the case. To prepare for this, add a target_mode parameter. While we're at it, merge the computing of the target PC and PSTATE in a single function that updates both PC and CPSR after saving their previous values in the corresponding ELR/SPSR. This ensures that they are updated in the correct order (a pretty common source of bugs...). Signed-off-by: Marc Zyngier --- arch/arm64/kvm/inject_fault.c | 75 ++- 1 file changed, 38 insertions(+), 37 deletions(-) diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index d3ebf8bca4b89..3dbcbc839b9c3 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -26,28 +26,12 @@ enum exception_type { except_type_serror = 0x180, }; -static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type) -{ - u64 exc_offset; - - switch (*vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT)) { - case PSR_MODE_EL1t: - exc_offset = CURRENT_EL_SP_EL0_VECTOR; - break; - case PSR_MODE_EL1h: - exc_offset = CURRENT_EL_SP_ELx_VECTOR; - break; - case PSR_MODE_EL0t: - exc_offset = LOWER_EL_AArch64_VECTOR; - break; - default: - exc_offset = LOWER_EL_AArch32_VECTOR; - } - - return vcpu_read_sys_reg(vcpu, VBAR_EL1) + exc_offset + type; -} - /* + * This performs the exception entry at a given EL (@target_mode), stashing PC + * and PSTATE into ELR and SPSR respectively, and compute the new PC/PSTATE. + * The EL passed to this function *must* be a non-secure, privileged mode with + * bit 0 being set (PSTATE.SP == 1). + * * When an exception is taken, most PSTATE fields are left unchanged in the * handler. However, some are explicitly overridden (e.g. M[4:0]). Luckily all * of the inherited bits have the same position in the AArch64/AArch32 SPSR_ELx @@ -59,10 +43,35 @@ static u64 get_except_vector(struct kvm_vcpu *vcpu, enum exception_type type) * Here we manipulate the fields in order of the AArch64 SPSR_ELx layout, from * MSB to LSB. */ -static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu) +static void enter_exception(struct kvm_vcpu *vcpu, unsigned long target_mode, + enum exception_type type) Since this is all for an AArch64 target, could we keep `64` in the name, e.g enter_exception64? That'd mirror the callers below. { - unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1); - unsigned long old, new; + unsigned long sctlr, vbar, old, new, mode; + u64 exc_offset; + + mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT); + + if (mode == target_mode) + exc_offset = CURRENT_EL_SP_ELx_VECTOR; + else if ((mode | 1) == target_mode) + exc_offset = CURRENT_EL_SP_EL0_VECTOR; It would be nice if we could add a mnemonic for the `1` here, e.g. PSR_MODE_SP0 or PSR_MODE_THREAD_BIT. I've addressed both comments as follows: diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h index bf57308fcd63..953b6a1ce549 100644 --- a/arch/arm64/include/asm/ptrace.h +++ b/arch/arm64/include/asm/ptrace.h @@ -35,6 +35,7 @@ #define GIC_PRIO_PSR_I_SET (1 << 4) /* Additional SPSR bits not exposed in the UABI */ +#define PSR_MODE_THREAD_BIT(1 << 0) #define PSR_IL_BIT (1 << 20) /* AArch32-specific ptrace requests */ diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c index 3dbcbc839b9c..ebfdfc27b2bd 100644 --- a/arch/arm64/kvm/inject_fault.c +++ b/arch/arm64/kvm/inject_fault.c @@ -43,8 +43,8 @@ enum exception_type { * Here we manipulate the fields in order of the AArch64 SPSR_ELx layout, from * MSB to LSB. */ -static void enter_exception(struct kvm_vcpu *vcpu, unsigned long target_mode, - enum exception_type type) +static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long target_mode, + enum exception_type type) { unsigned long sctlr, vbar, old, new, mode; u64 exc_offset; @@ -53,7 +53,7 @@ static void enter_exception(struct kvm_vcpu *vcpu, unsigned long target_mode, if (mode == target_mode) exc_offset = CURRENT_EL_SP_ELx_VECTOR; - else if ((mode | 1) == target_mode) + else if ((mode | PSR_MODE_THREAD_BIT) == target_mode) exc_offset = CURRENT_EL_SP_EL0_VECTOR; else if (!(mode & PSR_MODE32_BIT)) exc_offset = LOWER_EL_AArch64_VECTOR; @@ -126,7 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool is_iabt, unsigned long addr
Re: [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled
Hi Catalin, On 2020/5/26 19:49, Catalin Marinas wrote: > On Mon, May 25, 2020 at 07:24:01PM +0800, Keqian Zhu wrote: >> diff --git a/arch/arm64/include/asm/pgtable-prot.h >> b/arch/arm64/include/asm/pgtable-prot.h >> index 1305e28225fc..f9910ba2afd8 100644 >> --- a/arch/arm64/include/asm/pgtable-prot.h >> +++ b/arch/arm64/include/asm/pgtable-prot.h >> @@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings; >> }) >> >> #define PAGE_S2 __pgprot(_PROT_DEFAULT | >> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN) >> +#define PAGE_S2_DBM __pgprot(_PROT_DEFAULT | >> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM) > > You don't need a new page permission (see below). > >> #define PAGE_S2_DEVICE __pgprot(_PROT_DEFAULT | >> PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN) >> >> #define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | >> PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN) >> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c >> index e3b9ee268823..dc97988eb2e0 100644 >> --- a/virt/kvm/arm/mmu.c >> +++ b/virt/kvm/arm/mmu.c >> @@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t >> addr, phys_addr_t end) >> pte = pte_offset_kernel(pmd, addr); >> do { >> if (!pte_none(*pte)) { >> +#ifdef CONFIG_ARM64_HW_AFDBM >> +if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte)) >> +kvm_set_s2pte_dbm(pte); >> +#endif >> if (!kvm_s2pte_readonly(pte)) >> kvm_set_s2pte_readonly(pte); >> } > > Setting the DBM bit is equivalent to marking the page writable. The > actual writable pte bit (S2AP[1] or HAP[2] as we call them in Linux for > legacy reasons) tells you whether the page has been dirtied but it is > still writable if you set DBM. Doing this in stage2_wp_ptes() > practically means that you no longer have read-only pages at S2. There > are several good reasons why you don't want to break this. For example, > the S2 pte may already be read-only for other reasons (CoW). > Thanks, your comments help to solve the first problem in cover letter. > I think you should only set the DBM bit if the pte was previously > writable. In addition, any permission change to the S2 pte must take > into account the DBM bit and clear it while transferring the dirty > status to the underlying page. I'm not deeply familiar with all these > callbacks into KVM but two such paths are kvm_unmap_hva_range() and the > kvm_mmu_notifier_change_pte(). Yes, I agree. > > >> @@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, >> phys_addr_t fault_ipa, >> >> ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, _pmd); >> } else { >> -pte_t new_pte = kvm_pfn_pte(pfn, mem_type); >> +pte_t new_pte; >> + >> +#ifdef CONFIG_ARM64_HW_AFDBM >> +if (kvm_hw_dbm_enabled() && >> +pgprot_val(mem_type) == pgprot_val(PAGE_S2)) { >> +mem_type = PAGE_S2_DBM; >> +} >> +#endif >> +new_pte = kvm_pfn_pte(pfn, mem_type); >> >> if (writable) { >> new_pte = kvm_s2pte_mkwrite(new_pte); > > That's wrong here. Basically for any fault you get, you just turn the S2 > page writable. The point of DBM is that you don't get write faults at > all if you have a writable page. So, as I said above, only set the DBM > bit if you stored a writable S2 pte (kvm_s2pte_mkwrite()). Yeah, you are right. I will correct it in Patch v1. > Thanks, Keqian ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [RFC] Use SMMU HTTU for DMA dirty page tracking
On Wed, May 27, 2020 at 08:40:47AM +, Tian, Kevin wrote: > > From: Xiang Zheng > > Sent: Wednesday, May 27, 2020 2:45 PM > > > > > > On 2020/5/27 11:27, Tian, Kevin wrote: > > >> From: Xiang Zheng > > >> Sent: Monday, May 25, 2020 7:34 PM > > >> > > >> [+cc Kirti, Yan, Alex] > > >> > > >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote: > > >>> Hi, > > >>> > > >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote: > > Hi all, > > > > Is there any plan for enabling SMMU HTTU? > > >>> > > >>> Not outside of SVA, as far as I know. > > >>> > > >> > > I have seen the patch locates in the SVA series patch, which adds > > support for HTTU: > > https://www.spinics.net/lists/arm-kernel/msg798694.html > > > > HTTU reduces the number of access faults on SMMU fault queue > > (permission faults also benifit from it). > > > > Besides reducing the faults, HTTU also helps to track dirty pages for > > device DMA. Is it feasible to utilize HTTU to get dirty pages on device > > DMA during VFIO live migration? > > >>> > > >>> As you know there is a VFIO interface for this under discussion: > > >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email- > > >> kwankh...@nvidia.com/ > > >>> It doesn't implement an internal API to communicate with the IOMMU > > >> driver > > >>> about dirty pages. > > > > > > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level > > > page tables (Rev 3.0). > > > > > > > Thank you, Kevin. > > > > When will you send this series patches? Maybe(Hope) we can also support > > hardware-based dirty pages tracking via common APIs based on your > > patches. :) > > Yan is working with Kirti on basic live migration support now. After that > part is done, we will start working on A/D bit support. Yes, common APIs > are definitely the goal here. > > > > > >> > > >>> > > If SMMU can track dirty pages, devices are not required to implement > > additional dirty pages tracking to support VFIO live migration. > > >>> > > >>> It seems feasible, though tracking it in the device might be more > > >>> efficient. I might have misunderstood but I think for live migration of > > >>> the Intel NIC they trap guest accesses to the device and introspect its > > >>> state to figure out which pages it is accessing. > > > > > > Does HTTU implement A/D-like mechanism in SMMU page tables, or just > > > report dirty pages in a log buffer? Either way tracking dirty pages in > > > IOMMU > > > side is generic thus doesn't require device-specific tweak like in Intel > > > NIC. > > > > > > > Currently HTTU just implement A/D-like mechanism in SMMU page tables. > > We certainly > > expect SMMU can also implement PML-like feature so that we can avoid > > walking the > > whole page table to get the dirty pages. There is no reporting of dirty pages in log buffer. It might be possible to do software logging based on PRI or Stall, but that requires special support in the endpoint as well as the SMMU. > Is there a link to HTTU introduction? I don't know any gentle introduction, but there are sections D5.4.11 "Hardware management of the Access flag and dirty state" in the ARM Architecture Reference Manual (DDI0487E), and section 3.13 "Translation table entries and Access/Dirty flags" in the SMMU specification (IHI0070C). HTTU stands for "Hardware Translation Table Update". In short, when HTTU is enabled, the SMMU translation performs an atomic read-modify-write on the leaf translation table descriptor, setting some bits depending on the type of memory access. This can be enabled independently on both stage-1 and stage-2 tables (equivalent to your 1st and 2nd page tables levels, I think). Thanks, Jean ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 08/26] KVM: arm64: Use TTL hint in when invalidating stage-2 translations
On 2020-05-13 10:06, Andrew Scull wrote: On Tue, May 12, 2020 at 01:04:31PM +0100, James Morse wrote: Hi Andrew, On 07/05/2020 16:13, Andrew Scull wrote: >> @@ -176,7 +177,7 @@ static void clear_stage2_pud_entry(struct kvm_s2_mmu *mmu, pud_t *pud, phys_addr >>pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0); >>VM_BUG_ON(stage2_pud_huge(kvm, *pud)); >>stage2_pud_clear(kvm, pud); >> - kvm_tlb_flush_vmid_ipa(mmu, addr); >> + kvm_tlb_flush_vmid_ipa(mmu, addr, S2_NO_LEVEL_HINT); >>stage2_pmd_free(kvm, pmd_table); >>put_page(virt_to_page(pud)); >> } >> @@ -186,7 +187,7 @@ static void clear_stage2_pmd_entry(struct kvm_s2_mmu *mmu, pmd_t *pmd, phys_addr >>pte_t *pte_table = pte_offset_kernel(pmd, 0); >>VM_BUG_ON(pmd_thp_or_huge(*pmd)); >>pmd_clear(pmd); >> - kvm_tlb_flush_vmid_ipa(mmu, addr); >> + kvm_tlb_flush_vmid_ipa(mmu, addr, S2_NO_LEVEL_HINT); >>free_page((unsigned long)pte_table); >>put_page(virt_to_page(pmd)); >> } > > Going by the names, is it possible to give a better level hint for these > cases? There is no leaf entry being invalidated here. After clearing the range, we found we'd emptied (and invalidated) a whole page of mappings: | if (stage2_pmd_table_empty(kvm, start_pmd)) | clear_stage2_pud_entry(mmu, pud, start_addr); Now we want to remove the link to the empty page so we can free it. We are changing the structure of the tables, not what gets mapped. I think this is why we need the un-hinted behaviour, to invalidate "any level of the translation table walk required to translate the specified IPA". Otherwise the hardware can look for a leaf at the indicated level, find none, and do nothing. This is sufficiently horrible, its possible I've got it completely wrong! (does it make sense?) Ok. `addr` is an IPA, that IPA is now omitted from the map so doesn't appear in any entry of the table, least of all a leaf entry. That makes sense. Is there a convention to distinguish IPA and PA similar to the distinction for VA or does kvmarm just use phys_addr_t all round? It seems like the TTL patches are failry self contained if it would be easier to serparate them out from these larger series? They are. This whole series is a mix of unrelated patches anyway. Their only goal is to make my life a bit easier in the distant future. I'll repost that anyway, as I have made some cosmetic changes. Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 03/26] KVM: arm64: Factor out stage 2 page table data from struct kvm
Hi Marc, On 5/27/20 9:41 AM, Marc Zyngier wrote: > Hi Alex, > > On 2020-05-12 17:53, Alexandru Elisei wrote: >> Hi, >> >> On 5/12/20 12:17 PM, James Morse wrote: >>> Hi Alex, Marc, >>> >>> (just on this last_vcpu_ran thing...) >>> >>> On 11/05/2020 17:38, Alexandru Elisei wrote: On 4/22/20 1:00 PM, Marc Zyngier wrote: > From: Christoffer Dall > > As we are about to reuse our stage 2 page table manipulation code for > shadow stage 2 page tables in the context of nested virtualization, we > are going to manage multiple stage 2 page tables for a single VM. > > This requires some pretty invasive changes to our data structures, > which moves the vmid and pgd pointers into a separate structure and > change pretty much all of our mmu code to operate on this structure > instead. > > The new structure is called struct kvm_s2_mmu. > > There is no intended functional change by this patch alone. > diff --git a/arch/arm64/include/asm/kvm_host.h > b/arch/arm64/include/asm/kvm_host.h > index 7dd8fefa6aecd..664a5d92ae9b8 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -63,19 +63,32 @@ struct kvm_vmid { > u32 vmid; > }; > > -struct kvm_arch { > +struct kvm_s2_mmu { > struct kvm_vmid vmid; > > - /* stage2 entry level table */ > - pgd_t *pgd; > - phys_addr_t pgd_phys; > - > - /* VTCR_EL2 value for this VM */ > - u64 vtcr; > + /* > + * stage2 entry level table > + * > + * Two kvm_s2_mmu structures in the same VM can point to the same pgd > + * here. This happens when running a non-VHE guest hypervisor which > + * uses the canonical stage 2 page table for both vEL2 and for vEL1/0 > + * with vHCR_EL2.VM == 0. It makes more sense to me to say that a non-VHE guest hypervisor will use the canonical stage *1* page table when running at EL2 >>> Can KVM say anything about stage1? Its totally under the the guests control >>> even at vEL2... >> >> It just occurred to me that "canonical stage 2 page table" refers to the L0 >> hypervisor stage 2, not to the L1 hypervisor stage 2. If you don't mind my >> suggestion, perhaps the comment can be slightly improved to avoid any >> confusion? >> Maybe something along the lines of "[..] This happens when running a >> non-VHE guest >> hypervisor, in which case we use the canonical stage 2 page table for both >> vEL2 >> and for vEL1/0 with vHCR_EL2.VM == 0". > > If the confusion stems from the lack of guest stage-2, how about: > > "This happens when running a guest using a translation regime that isn't > affected by its own stage-2 translation, such as a non-VHE hypervisor > running at vEL2, or for vEL1/EL0 with vHCR_EL2.VM == 0. In that case, > we use the canonical stage-2 page tables." > > instead? Does this lift the ambiguity? Yes, that's perfect. Thanks, Alex > > Thanks, > > M. ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs
Hi Marc, On 2020/5/27 15:55, Marc Zyngier wrote: Hi Zenghui, On 2020-05-27 08:41, Zenghui Yu wrote: On 2020/5/27 0:11, Marc Zyngier wrote: On a system that uses SPIs to implement MSIs (as it would be the case on a GICv2 system exposing a GICv2m to its guests), we deny the possibility of injecting SPIs on the in-atomic fast-path. This results in a very large amount of context-switches (roughly equivalent to twice the interrupt rate) on the host, and suboptimal performance for the guest (as measured with a test workload involving a virtio interface backed by vhost-net). Given that GICv2 systems are usually on the low-end of the spectrum performance wise, they could do without the aggravation. We solved this for GICv3+ITS by having a translation cache. But SPIs do not need any extra infrastructure, and can be immediately injected in the virtual distributor as the locking is already heavy enough that we don't need to worry about anything. This halves the number of context switches for the same workload. Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-irqfd.c | 20 arch/arm64/kvm/vgic/vgic-its.c | 3 +-- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index d8cdfea5cc96..11a9f81115ab 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, bool line_status) ... and you may also need to update the comment on top of it to reflect this change. /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection * * Currently only direct MSI injection is supported. */ As far as I can tell, it is still valid (at least from the guest's perspective). You could in practice use that to deal with level interrupts, but we only inject the rising edge on this path, never the falling edge. So effectively, this is limited to edge interrupts, which is mostly MSIs. Oops... I had wrongly mixed MSI with the architecture-defined LPI, and was think that we should add something like "direct SPI injection is also supported now". Sorry. Unless you are thinking of something else which I would have missed? No, please ignore the noisy. Thanks, Zenghui ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH 03/26] KVM: arm64: Factor out stage 2 page table data from struct kvm
Hi Alex, On 2020-05-12 17:53, Alexandru Elisei wrote: Hi, On 5/12/20 12:17 PM, James Morse wrote: Hi Alex, Marc, (just on this last_vcpu_ran thing...) On 11/05/2020 17:38, Alexandru Elisei wrote: On 4/22/20 1:00 PM, Marc Zyngier wrote: From: Christoffer Dall As we are about to reuse our stage 2 page table manipulation code for shadow stage 2 page tables in the context of nested virtualization, we are going to manage multiple stage 2 page tables for a single VM. This requires some pretty invasive changes to our data structures, which moves the vmid and pgd pointers into a separate structure and change pretty much all of our mmu code to operate on this structure instead. The new structure is called struct kvm_s2_mmu. There is no intended functional change by this patch alone. diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 7dd8fefa6aecd..664a5d92ae9b8 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -63,19 +63,32 @@ struct kvm_vmid { u32vmid; }; -struct kvm_arch { +struct kvm_s2_mmu { struct kvm_vmid vmid; - /* stage2 entry level table */ - pgd_t *pgd; - phys_addr_t pgd_phys; - - /* VTCR_EL2 value for this VM */ - u64vtcr; + /* +* stage2 entry level table +* + * Two kvm_s2_mmu structures in the same VM can point to the same pgd + * here. This happens when running a non-VHE guest hypervisor which + * uses the canonical stage 2 page table for both vEL2 and for vEL1/0 +* with vHCR_EL2.VM == 0. It makes more sense to me to say that a non-VHE guest hypervisor will use the canonical stage *1* page table when running at EL2 Can KVM say anything about stage1? Its totally under the the guests control even at vEL2... It just occurred to me that "canonical stage 2 page table" refers to the L0 hypervisor stage 2, not to the L1 hypervisor stage 2. If you don't mind my suggestion, perhaps the comment can be slightly improved to avoid any confusion? Maybe something along the lines of "[..] This happens when running a non-VHE guest hypervisor, in which case we use the canonical stage 2 page table for both vEL2 and for vEL1/0 with vHCR_EL2.VM == 0". If the confusion stems from the lack of guest stage-2, how about: "This happens when running a guest using a translation regime that isn't affected by its own stage-2 translation, such as a non-VHE hypervisor running at vEL2, or for vEL1/EL0 with vHCR_EL2.VM == 0. In that case, we use the canonical stage-2 page tables." instead? Does this lift the ambiguity? Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
RE: [RFC] Use SMMU HTTU for DMA dirty page tracking
> From: Xiang Zheng > Sent: Wednesday, May 27, 2020 2:45 PM > > > On 2020/5/27 11:27, Tian, Kevin wrote: > >> From: Xiang Zheng > >> Sent: Monday, May 25, 2020 7:34 PM > >> > >> [+cc Kirti, Yan, Alex] > >> > >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote: > >>> Hi, > >>> > >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote: > Hi all, > > Is there any plan for enabling SMMU HTTU? > >>> > >>> Not outside of SVA, as far as I know. > >>> > >> > I have seen the patch locates in the SVA series patch, which adds > support for HTTU: > https://www.spinics.net/lists/arm-kernel/msg798694.html > > HTTU reduces the number of access faults on SMMU fault queue > (permission faults also benifit from it). > > Besides reducing the faults, HTTU also helps to track dirty pages for > device DMA. Is it feasible to utilize HTTU to get dirty pages on device > DMA during VFIO live migration? > >>> > >>> As you know there is a VFIO interface for this under discussion: > >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email- > >> kwankh...@nvidia.com/ > >>> It doesn't implement an internal API to communicate with the IOMMU > >> driver > >>> about dirty pages. > > > > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level > > page tables (Rev 3.0). > > > > Thank you, Kevin. > > When will you send this series patches? Maybe(Hope) we can also support > hardware-based dirty pages tracking via common APIs based on your > patches. :) Yan is working with Kirti on basic live migration support now. After that part is done, we will start working on A/D bit support. Yes, common APIs are definitely the goal here. > > >> > >>> > If SMMU can track dirty pages, devices are not required to implement > additional dirty pages tracking to support VFIO live migration. > >>> > >>> It seems feasible, though tracking it in the device might be more > >>> efficient. I might have misunderstood but I think for live migration of > >>> the Intel NIC they trap guest accesses to the device and introspect its > >>> state to figure out which pages it is accessing. > > > > Does HTTU implement A/D-like mechanism in SMMU page tables, or just > > report dirty pages in a log buffer? Either way tracking dirty pages in IOMMU > > side is generic thus doesn't require device-specific tweak like in Intel > > NIC. > > > > Currently HTTU just implement A/D-like mechanism in SMMU page tables. > We certainly > expect SMMU can also implement PML-like feature so that we can avoid > walking the > whole page table to get the dirty pages. Is there a link to HTTU introduction? > > By the way, I'm not sure whether HTTU or SLAD can help for mediated deivce. > A/D bit applies to mediated device on VT-d. Thanks Kevin ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs
Hi Zenghui, On 2020-05-27 08:41, Zenghui Yu wrote: On 2020/5/27 0:11, Marc Zyngier wrote: On a system that uses SPIs to implement MSIs (as it would be the case on a GICv2 system exposing a GICv2m to its guests), we deny the possibility of injecting SPIs on the in-atomic fast-path. This results in a very large amount of context-switches (roughly equivalent to twice the interrupt rate) on the host, and suboptimal performance for the guest (as measured with a test workload involving a virtio interface backed by vhost-net). Given that GICv2 systems are usually on the low-end of the spectrum performance wise, they could do without the aggravation. We solved this for GICv3+ITS by having a translation cache. But SPIs do not need any extra infrastructure, and can be immediately injected in the virtual distributor as the locking is already heavy enough that we don't need to worry about anything. This halves the number of context switches for the same workload. Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-irqfd.c | 20 arch/arm64/kvm/vgic/vgic-its.c | 3 +-- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index d8cdfea5cc96..11a9f81115ab 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, bool line_status) ... and you may also need to update the comment on top of it to reflect this change. /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection * * Currently only direct MSI injection is supported. */ As far as I can tell, it is still valid (at least from the guest's perspective). You could in practice use that to deal with level interrupts, but we only inject the rising edge on this path, never the falling edge. So effectively, this is limited to edge interrupts, which is mostly MSIs. Unless you are thinking of something else which I would have missed? Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
On 2020-05-27 03:39, Gavin Shan wrote: Hi Mark, [...] Can you run tests with a real workload? For example, a kernel build inside the VM? Yeah, I agree it's far from a realistic workload. However, it's the test case which was suggested when async page fault was proposed from day one, according to the following document. On the page#34, you can see the benchmark, which is similar to what we're doing. https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf My own question is whether this even makes any sense 10 years later. The HW has massively changed, and this adds a whole lot of complexity to both the hypervisor and the guest. It also plays very ugly games with the exception model, which doesn't give me the warm fuzzy feeling that it's going to be great. Ok. I will test with the workload to build kernel or another better one to represent the case. Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs
On 2020/5/27 0:11, Marc Zyngier wrote: On a system that uses SPIs to implement MSIs (as it would be the case on a GICv2 system exposing a GICv2m to its guests), we deny the possibility of injecting SPIs on the in-atomic fast-path. This results in a very large amount of context-switches (roughly equivalent to twice the interrupt rate) on the host, and suboptimal performance for the guest (as measured with a test workload involving a virtio interface backed by vhost-net). Given that GICv2 systems are usually on the low-end of the spectrum performance wise, they could do without the aggravation. We solved this for GICv3+ITS by having a translation cache. But SPIs do not need any extra infrastructure, and can be immediately injected in the virtual distributor as the locking is already heavy enough that we don't need to worry about anything. This halves the number of context switches for the same workload. Signed-off-by: Marc Zyngier --- arch/arm64/kvm/vgic/vgic-irqfd.c | 20 arch/arm64/kvm/vgic/vgic-its.c | 3 +-- 2 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c index d8cdfea5cc96..11a9f81115ab 100644 --- a/arch/arm64/kvm/vgic/vgic-irqfd.c +++ b/arch/arm64/kvm/vgic/vgic-irqfd.c @@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e, struct kvm *kvm, int irq_source_id, int level, bool line_status) ... and you may also need to update the comment on top of it to reflect this change. /** * kvm_arch_set_irq_inatomic: fast-path for irqfd injection * * Currently only direct MSI injection is supported. */ Thanks, Zenghui { - if (e->type == KVM_IRQ_ROUTING_MSI && vgic_has_its(kvm) && level) { + if (!level) + return -EWOULDBLOCK; + + switch (e->type) { + case KVM_IRQ_ROUTING_MSI: { struct kvm_msi msi; + if (!vgic_has_its(kvm)) + return -EINVAL; + kvm_populate_msi(e, ); - if (!vgic_its_inject_cached_translation(kvm, )) - return 0; + return vgic_its_inject_cached_translation(kvm, ); } - return -EWOULDBLOCK; + case KVM_IRQ_ROUTING_IRQCHIP: + /* Injecting SPIs is always possible in atomic context */ + return vgic_irqfd_set_irq(e, kvm, irq_source_id, 1, line_status); + + default: + return -EWOULDBLOCK; + } } int kvm_vgic_setup_default_irq_routing(struct kvm *kvm) diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c index c012a52b19f5..40cbaca81333 100644 --- a/arch/arm64/kvm/vgic/vgic-its.c +++ b/arch/arm64/kvm/vgic/vgic-its.c @@ -757,9 +757,8 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, struct kvm_msi *msi) db = (u64)msi->address_hi << 32 | msi->address_lo; irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data); - if (!irq) - return -1; + return -EWOULDBLOCK; raw_spin_lock_irqsave(>irq_lock, flags); irq->pending_latch = true; ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
On 2020-05-27 05:05, Gavin Shan wrote: Hi Mark, [...] +struct kvm_vcpu_pv_apf_data { + __u32 reason; + __u8pad[60]; + __u32 enabled; +}; What's all the padding for? The padding is ensure the @reason and @enabled in different cache line. @reason is shared by host/guest, while @enabled is almostly owned by guest. So you are assuming that a cache line is at most 64 bytes. It is actualy implementation defined, and you can probe for it by looking at the CTR_EL0 register. There are implementations ranging from 32 to 256 bytes in the wild, and let's not mention broken big-little implementations here. [...] +bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu *vcpu) +{ + u64 vbar, pc; + u32 val; + int ret; + + if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED)) + return false; + + if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu)) + return false; + + /* Pending page fault, which ins't acknowledged by guest */ + ret = kvm_async_pf_read_cache(vcpu, ); + if (ret || val) + return false; + + /* +* Events can't be injected through data abort because it's +* going to clobber ELR_EL1, which might not consued (or saved) +* by guest yet. +*/ + vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1); + pc = *vcpu_pc(vcpu); + if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range)) + return false; Ah, so that's when this `no_fault_inst_range` is for. As-is this is not sufficient, and we'll need t be extremely careful here. The vectors themselves typically only have a small amount of stub code, and the bulk of the non-reentrant exception entry work happens elsewhere, in a mixture of assembly and C code that isn't even virtually contiguous with either the vectors or itself. It's possible in theory that code in modules (or perhaps in eBPF JIT'd code) that isn't safe to take a fault from, so even having a contiguous range controlled by the kernel isn't ideal. How does this work on x86? Yeah, here we just provide a mechanism to forbid injecting data abort. The range is fed by guest through HVC call. So I think it's guest related issue. You had more comments about this in PATCH[9]. I will explain a bit more there. x86 basically relies on EFLAGS[IF] flag. The async page fault can be injected if it's on. Otherwise, it's forbidden. It's workable because exception is special interrupt to x86 if I'm correct. return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) && !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS)); I really wish this was relying on an architected exception delivery mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}). Trying to guess based on the PC won't fly. But these signals are pretty hard to multiplex with anything else. Like any form of non-architected exception injection, I don't see a good path forward unless we start considering something like SDEI. M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
On 2020-05-27 03:43, Gavin Shan wrote: Hi Mark, On 5/26/20 8:42 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote: Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() to it a bit more self-explaining because the functions returns ESR instead of HSR on aarch64. This shouldn't cause any functional changes. Signed-off-by: Gavin Shan I think that this would be a nice cleanup on its own, and could be taken independently of the rest of this series if it were rebased and sent as a single patch. Yeah, I'll see how PATCH[3,4,5] can be posted independently as part of the preparatory work, which is suggested by you in another reply. By the way, I assume the cleanup patches are good enough to target 5.8.rc1/rc2 if you agree. It's fine to base them on -rc1 or -rc2. They will not be merged before 5.9 though. Thanks, M. -- Jazz is not dead. It just smells funny... ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v4 6/7] KVM: MIPS: clean up redundant 'kvm_run' parameters
On 2020/4/27 13:40, Huacai Chen wrote: Reviewed-by: Huacai Chen On Mon, Apr 27, 2020 at 12:35 PM Tianjia Zhang wrote: In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu' structure. For historical reasons, many kvm-related function parameters retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This patch does a unified cleanup of these remaining redundant parameters. Signed-off-by: Tianjia Zhang --- arch/mips/include/asm/kvm_host.h | 28 +--- arch/mips/kvm/emulate.c | 59 ++-- arch/mips/kvm/mips.c | 11 ++- arch/mips/kvm/trap_emul.c| 114 ++- arch/mips/kvm/vz.c | 26 +++ 5 files changed, 87 insertions(+), 151 deletions(-) Hi Huacai, These two patches(6/7 and 7/7) should be merged into the tree of the mips architecture separately. At present, there seems to be no good way to merge the whole architecture patchs. For this series of patches, some architectures have been merged, some need to update the patch. Thanks and best, Tianjia ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
Hi Mark, On 5/26/20 8:58 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:16PM +1000, Gavin Shan wrote: This renames user_mem_abort() to kvm_handle_user_mem_abort(), and then export it. The function will be used in asynchronous page fault to populate a page table entry once the corresponding page is populated from the backup device (e.g. swap partition): * Parameter @fault_status is replace by @esr. * The parameters are reorder based on their importance. It seems like multiple changes are going on here, and it would be clearer with separate patches. Passing the ESR rather than the extracted fault status seems fine, but for clarirty it's be nicer to do this in its own patch. Ok. I'll have a separate patch to do this. Why is it necessary to re-order the function parameters? Does that align with other function prototypes? There are no explicit reasons for it. I can restore the order to what we had previously if you like. What exactly is the `prefault` parameter meant to do? It doesn't do anything currently, so it'd be better to introduce it later when logic using it is instroduced, or where callers will pass distinct values. Yes, fair enough. I will merge the logic with PATCH[7] then. Thanks, Mark. Thanks, Gavin This shouldn't cause any functional changes. Signed-off-by: Gavin Shan --- arch/arm64/include/asm/kvm_host.h | 4 virt/kvm/arm/mmu.c| 14 -- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 32c8a675e5a4..f77c706777ec 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -437,6 +437,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu, struct kvm_vcpu_events *events); #define KVM_ARCH_WANT_MMU_NOTIFIER +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr, + struct kvm_memory_slot *memslot, + phys_addr_t fault_ipa, unsigned long hva, + bool prefault); int kvm_unmap_hva_range(struct kvm *kvm, unsigned long start, unsigned long end); int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte); diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c index e462e0368fd9..95aaabb2b1fc 100644 --- a/virt/kvm/arm/mmu.c +++ b/virt/kvm/arm/mmu.c @@ -1656,12 +1656,12 @@ static bool fault_supports_stage2_huge_mapping(struct kvm_memory_slot *memslot, (hva & ~(map_size - 1)) + map_size <= uaddr_end; } -static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, - struct kvm_memory_slot *memslot, unsigned long hva, - unsigned long fault_status) +int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr, + struct kvm_memory_slot *memslot, + phys_addr_t fault_ipa, unsigned long hva, + bool prefault) { - int ret; - u32 esr = kvm_vcpu_get_esr(vcpu); + unsigned int fault_status = kvm_vcpu_trap_get_fault_type(esr); bool write_fault, writable, force_pte = false; bool exec_fault, needs_exec; unsigned long mmu_seq; @@ -1674,6 +1674,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, pgprot_t mem_type = PAGE_S2; bool logging_active = memslot_is_logging(memslot); unsigned long vma_pagesize, flags = 0; + int ret; write_fault = kvm_is_write_fault(esr); exec_fault = kvm_vcpu_trap_is_iabt(esr); @@ -1995,7 +1996,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) goto out_unlock; } - ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status); + ret = kvm_handle_user_mem_abort(vcpu, esr, memslot, + fault_ipa, hva, false); if (ret == 0) ret = 1; out: -- 2.23.0 ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
RE: [RFC] Use SMMU HTTU for DMA dirty page tracking
> -Original Message- > From: zhengxiang (A) > Sent: Monday, May 25, 2020 7:34 PM > To: Jean-Philippe Brucker > Cc: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu; > Will Deacon; Wanghaibin (D); Zengtao (B); m...@kernel.org; James Morse; > julien.thierry.k...@gmail.com; Suzuki K Poulose; Wangzhou (B); > io...@lists.linux-foundation.org; Kirti Wankhede; Yan Zhao; > alex.william...@redhat.com > Subject: Re: [RFC] Use SMMU HTTU for DMA dirty page tracking > > [+cc Kirti, Yan, Alex] > > On 2020/5/23 1:14, Jean-Philippe Brucker wrote: > > Hi, > > > > On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote: > >> Hi all, > >> > >> Is there any plan for enabling SMMU HTTU? > > > > Not outside of SVA, as far as I know. > > > > >> I have seen the patch locates in the SVA series patch, which adds > >> support for HTTU: > >> https://www.spinics.net/lists/arm-kernel/msg798694.html > >> > >> HTTU reduces the number of access faults on SMMU fault queue > >> (permission faults also benifit from it). > >> > >> Besides reducing the faults, HTTU also helps to track dirty pages for > >> device DMA. Is it feasible to utilize HTTU to get dirty pages on device > >> DMA during VFIO live migration? > > > > As you know there is a VFIO interface for this under discussion: > > > https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhe > d...@nvidia.com/ > > It doesn't implement an internal API to communicate with the IOMMU > driver > > about dirty pages. > > > > >> If SMMU can track dirty pages, devices are not required to implement > >> additional dirty pages tracking to support VFIO live migration. > > > > It seems feasible, though tracking it in the device might be more > > efficient. I might have misunderstood but I think for live migration of > > the Intel NIC they trap guest accesses to the device and introspect its > > state to figure out which pages it is accessing. > > > > With HTTU I suppose (without much knowledge about live migration) > that > > you'd need several new interfaces to the IOMMU drivers: > > > > * A way for VFIO to query HTTU support in the SMMU. There are some > > discussions about communicating more IOMMU capabilities through > VFIO but > > no implementation yet. When HTTU isn't supported the DIRTY_PAGES > bitmap > > would report all pages as they do now. > > > > * VFIO_IOMMU_DIRTY_PAGES_FLAG_START/STOP would clear the dirty > bit > > for all VFIO mappings (which is going to take some time). There is a > > walker in io-pgtable for iova_to_phys() which could be extended. I > > suppose it's also possible to atomically switch the HA and HD bits in > > context descriptors. > > Maybe we need not switch HA and HD bits, just turn on them all the time? I think we need START/STOP, start is called when migration is started and STOP called when migration finished. I think you mean that during the migration(between START and STOP), HA and HD functionality is always turned on. > > > > > * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP would query the > dirty bit for all > > VFIO mappings. > > We need a clear during the query which mean we have to reset the dirty status after a query. > > I think we need to consider the case of IOMMU dirty pages logging. We > want > to test Kirti's VFIO migration patches combined with SMMU HTTU, any > suggestions? @kirti @yan @alex As we know, you have worked on this feature for a long time, and it seems that everything is going very well now, thanks very much for your VFIO migration framework, it really helps a lot, and we want to start with SMMU HTTU enabled based on your jobs, it's kind of hardware dirty page tracking, and expected to bring us better performance, any suggestions? Thank you. Zengtao ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault
Hi Mark, On 5/26/20 10:34 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote: There are two stages of fault pages and the stage one page fault is handled by guest itself. The guest is trapped to host when the page fault is caused by stage 2 page table, for example missing. The guest is suspended until the requested page is populated. To populate the requested page can be related to IO activities if the page was swapped out previously. In this case, the guest has to suspend for a few of milliseconds at least, regardless of the overall system load. There is no useful work done during the suspended period from guest's view. This is a bit difficult to read. How about: | When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that | is not mapped at Stage-2), the vCPU is suspended until the host has | handled the fault. It can take the host milliseconds or longer to | handle the fault as this may require IO, and when the system load is | low neither the host nor guest perform useful work during such | periods. Yes, It's much better. This adds asychornous page fault to improve the situation. A signal Nit: typo for `asynchronous` here, and there are a few other typos in the patch itself. It would be nice if you could run a spellcheck over that. Sure. (PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time to be populated. Guest might reschedule to another running process if possible. Otherwise, the vCPU is put into power-saving mode, which is actually to cause vCPU reschedule from host's view. A followup signal (PAGE_READY) is sent to guest once the requested page is populated. The suspended task is waken up or scheduled when guest receives the signal. With this mechanism, the vCPU won't be stuck when the requested page is being populated by host. It would probably be best to say 'notification' rather than 'signal' here, and say 'the guest is notified', etc. As above, it seems that this is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to make it clear which context this applies to. Ok. There are more details highlighted as below. Note the implementation is similar to what x86 has to some extent: * A dedicated SMCCC ID is reserved to enable, disable or configure the functionality. The only 64-bits parameter is conveyed by two registers (w2/w1). Bits[63:56] is the bitmap used to specify the operated functionality like enabling/disabling/configuration. The bits[55:6] is the physical address of control block or external data abort injection disallowed region. Bit[5:0] are used to pass control flags. * Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page isn't ready. In the mean while, a work is started to populate the page asynchronously in background. The stage 2 page table entry is updated accordingly and another signal (PAGE_READY) is fired after the request page is populted. The signals is notified by injected data abort fault. * The signals are fired and consumed in sequential fashion. It means no more signals will be fired if there is pending one, awaiting the guest to consume. It's because the injected data abort faults have to be done in sequential fashion. Signed-off-by: Gavin Shan --- arch/arm64/include/asm/kvm_host.h | 43 arch/arm64/include/asm/kvm_para.h | 27 ++ arch/arm64/include/uapi/asm/Kbuild | 2 - arch/arm64/include/uapi/asm/kvm_para.h | 22 ++ arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/Makefile| 2 + include/linux/arm-smccc.h | 6 + virt/kvm/arm/arm.c | 36 ++- virt/kvm/arm/async_pf.c| 335 + virt/kvm/arm/hypercalls.c | 8 + virt/kvm/arm/mmu.c | 29 ++- 11 files changed, 506 insertions(+), 5 deletions(-) create mode 100644 arch/arm64/include/asm/kvm_para.h create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h create mode 100644 virt/kvm/arm/async_pf.c diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index f77c706777ec..a207728d6f3f 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -250,6 +250,23 @@ struct vcpu_reset_state { boolreset; }; +#ifdef CONFIG_KVM_ASYNC_PF + +/* Should be a power of two number */ +#define ASYNC_PF_PER_VCPU 64 What exactly is this number? It's maximal number of async page faults pending on the specific vCPU. + +/* + * The association of gfn and token. The token will be sent to guest as + * page fault address. Also, the guest could be in aarch32 mode. So its + * length should be 32-bits. + */ The length of what should be 32-bit? The token? The guest sees the token as the fault address? How exactly is that exposed to the guest, is that via a
Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
Hi Mark, On 5/26/20 8:42 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote: Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to kvm_vcpu_get_esr() to it a bit more self-explaining because the functions returns ESR instead of HSR on aarch64. This shouldn't cause any functional changes. Signed-off-by: Gavin Shan I think that this would be a nice cleanup on its own, and could be taken independently of the rest of this series if it were rebased and sent as a single patch. Yeah, I'll see how PATCH[3,4,5] can be posted independently as part of the preparatory work, which is suggested by you in another reply. By the way, I assume the cleanup patches are good enough to target 5.8.rc1/rc2 if you agree. Thanks, Gavin --- arch/arm64/include/asm/kvm_emulate.h | 36 +++- arch/arm64/kvm/handle_exit.c | 12 +- arch/arm64/kvm/hyp/switch.c | 2 +- arch/arm64/kvm/sys_regs.c| 6 ++--- virt/kvm/arm/hyp/aarch32.c | 2 +- virt/kvm/arm/hyp/vgic-v3-sr.c| 4 ++-- virt/kvm/arm/mmu.c | 6 ++--- 7 files changed, 35 insertions(+), 33 deletions(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index a30b4eec7cb4..bd1a69e7c104 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -265,14 +265,14 @@ static inline bool vcpu_mode_priv(const struct kvm_vcpu *vcpu) return mode != PSR_MODE_EL0t; } -static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu) +static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu) { return vcpu->arch.fault.esr_el2; } static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu) { - u32 esr = kvm_vcpu_get_hsr(vcpu); + u32 esr = kvm_vcpu_get_esr(vcpu); if (esr & ESR_ELx_CV) return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT; @@ -297,64 +297,66 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu *vcpu) static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu) { - return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK; + return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK; } static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV); } static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu) { - return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC); + return kvm_vcpu_get_esr(vcpu) & + (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC); } static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE); } static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF); } static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu) { - return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT; + return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT; } static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW); } static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) || + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) || kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */ } static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM); } static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu) { - return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> ESR_ELx_SAS_SHIFT); + return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >> +ESR_ELx_SAS_SHIFT); } /* This one is not specific to Data Abort */ static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu *vcpu) { - return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL); + return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL); } static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu) { - return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu)); + return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu)); } static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu) @@ -364,12 +366,12 @@ static inline bool
Re: [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr
Hi Mark, On 5/26/20 8:45 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:15PM +1000, Gavin Shan wrote: This replace the variable names to make them self-explaining. The tracepoint isn't changed accordingly because they're part of ABI: * @hsr to @esr * @hsr_ec to @ec * Use kvm_vcpu_trap_get_class() helper if possible Signed-off-by: Gavin Shan As with patch 3, I think this cleanup makes sense independent from the rest of the series, and I think it'd make sense to bundle all the patches renaming hsr -> esr, and send those as a preparatory series. Yes, PATCH[3/4/5] will be posted independently, as part of the preparatory work, as you suggested. Thanks, Gavin Thanks, Mark. --- arch/arm64/kvm/handle_exit.c | 28 ++-- arch/arm64/kvm/hyp/switch.c | 9 - arch/arm64/kvm/sys_regs.c| 30 +++--- 3 files changed, 33 insertions(+), 34 deletions(-) diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c index 00858db82a64..e3b3dcd5b811 100644 --- a/arch/arm64/kvm/handle_exit.c +++ b/arch/arm64/kvm/handle_exit.c @@ -123,13 +123,13 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct kvm_run *run) */ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run) { - u32 hsr = kvm_vcpu_get_esr(vcpu); + u32 esr = kvm_vcpu_get_esr(vcpu); int ret = 0; run->exit_reason = KVM_EXIT_DEBUG; - run->debug.arch.hsr = hsr; + run->debug.arch.hsr = esr; - switch (ESR_ELx_EC(hsr)) { + switch (kvm_vcpu_trap_get_class(esr)) { case ESR_ELx_EC_WATCHPT_LOW: run->debug.arch.far = vcpu->arch.fault.far_el2; /* fall through */ @@ -139,8 +139,8 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run) case ESR_ELx_EC_BRK64: break; default: - kvm_err("%s: un-handled case hsr: %#08x\n", - __func__, (unsigned int) hsr); + kvm_err("%s: un-handled case esr: %#08x\n", + __func__, (unsigned int)esr); ret = -1; break; } @@ -150,10 +150,10 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run) static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run) { - u32 hsr = kvm_vcpu_get_esr(vcpu); + u32 esr = kvm_vcpu_get_esr(vcpu); - kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n", - hsr, esr_get_class_string(hsr)); + kvm_pr_unimpl("Unknown exception class: esr: %#08x -- %s\n", + esr, esr_get_class_string(esr)); kvm_inject_undefined(vcpu); return 1; @@ -230,10 +230,10 @@ static exit_handle_fn arm_exit_handlers[] = { static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu) { - u32 hsr = kvm_vcpu_get_esr(vcpu); - u8 hsr_ec = ESR_ELx_EC(hsr); + u32 esr = kvm_vcpu_get_esr(vcpu); + u8 ec = kvm_vcpu_trap_get_class(esr); - return arm_exit_handlers[hsr_ec]; + return arm_exit_handlers[ec]; } /* @@ -273,15 +273,15 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, { if (ARM_SERROR_PENDING(exception_index)) { u32 esr = kvm_vcpu_get_esr(vcpu); - u8 hsr_ec = ESR_ELx_EC(esr); + u8 ec = kvm_vcpu_trap_get_class(esr); /* * HVC/SMC already have an adjusted PC, which we need * to correct in order to return to after having * injected the SError. */ - if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 || - hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) { + if (ec == ESR_ELx_EC_HVC32 || ec == ESR_ELx_EC_HVC64 || + ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) { u32 adj = kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2; *vcpu_pc(vcpu) -= adj; } diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c index 369f22f49f3d..7bf4840bf90e 100644 --- a/arch/arm64/kvm/hyp/switch.c +++ b/arch/arm64/kvm/hyp/switch.c @@ -356,8 +356,8 @@ static bool __hyp_text __populate_fault_info(struct kvm_vcpu *vcpu) static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu) { u32 esr = kvm_vcpu_get_esr(vcpu); + u8 ec = kvm_vcpu_trap_get_class(esr); bool vhe, sve_guest, sve_host; - u8 hsr_ec; if (!system_supports_fpsimd()) return false; @@ -372,14 +372,13 @@ static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu) vhe = has_vhe(); } - hsr_ec = kvm_vcpu_trap_get_class(esr); - if (hsr_ec != ESR_ELx_EC_FP_ASIMD && - hsr_ec != ESR_ELx_EC_SVE) + if (ec != ESR_ELx_EC_FP_ASIMD && + ec !=
Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault
Hi Mark, On 5/26/20 11:09 PM, Mark Rutland wrote: At a high-level I'm rather fearful of this series. I can see many ways that this can break, and I can also see that even if/when we get things into a working state, constant vigilance will be requried for any changes to the entry code. I'm not keen on injecting non-architectural exceptions in this way, and I'm also not keen on how deep the PV hooks are injected currently (e.g. in the ret_to_user path). First of all, thank you for your time and providing your comments continuously. Since the series is tagged as RFC, it's not a surprise to see something is obviously broken. However, Could you please provide more details? With more details, I can figure out the solutions. If I'm correct, you're talking about the added entry code and the injected PV hooks. Anyway, please provide more details about your concerns so that I can figure out the solutions. Let me briefly explain why we need the injected PV hooks in ret_to_user: There are two fashions of wakeup and I would call them as direct wakeup and delayed wakeup. The sleeping process is waked up directly when received PAGE_READY notification from the host, which is the process of direct wakeup. However there are some cases the direct wakeup can't be carried out. For example, the sleeper and the waker are same process or the (CFS) runqueue has been locked by somebody else. In these cases, the wakeup is delayed until the idle process is running or in ret_to_user. It's how delayed wakeup works. I see a few patches have preparator cleanup that I think would be worthwhile regardless of this series; if you could factor those out and send them on their own it would get that out of the way and make it easier to review the series itself. Similarly, there's some duplication of code from arch/x86 which I think can be factored out to virt/kvm instead as preparatory work. Yep, I agree there are several cleanup patches can be posted separately and merged in advance. I will do that and thanks for the comments. About the shared code between arm64/x86, I need some time to investigate. Basically, I agree to do so. I also included Paolo here to check his opnion. It's no doubt these are all preparatory work, to make the review a bit easier as you said :) Generally, I also think that you need to spend some time on commit messages and/or documentation to better explain the concepts and expected usage. I had to reverse-engineer the series by reviewing it in entirety before I had an idea as to how basic parts of it strung together, and a more thorough conceptual explanation would make it much easier to critique the approach rather than the individual patches. Yes, sure. I will do this in the future. Sorry about having taken you too much to do the reverse-engineering. In next revision, I might put more information in the cover letter and commit log to explain how things are designed and working :) On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote: Testing === The tests are carried on the following machine. A guest with single vCPU and 4GB memory is started. Also, the QEMU process is put into memory cgroup (v1) whose memory limit is set to 2GB. In the guest, there are two threads, which are memory bound and CPU bound separately. The memory bound thread allocates all available memory, accesses and them free them. The CPU bound thread simply executes block of "nop". I appreciate this is a microbenchmark, but that sounds far from realistic. Is there a specitic real workload that's expected to be representative of? Can you run tests with a real workload? For example, a kernel build inside the VM? Yeah, I agree it's far from a realistic workload. However, it's the test case which was suggested when async page fault was proposed from day one, according to the following document. On the page#34, you can see the benchmark, which is similar to what we're doing. https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf Ok. I will test with the workload to build kernel or another better one to represent the case. The test is carried out for 5 time continuously and the average number (per minute) of executed blocks in the CPU bound thread is taken as indicator of improvement. Vendor: GIGABYTE CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz Memory: 32GB Disk: Fusion-MPT SAS-3 (PCIe3.0 x8) Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540 7629633480 6170527920) With-APF:8286827472/minute = avg(8464584540 8177073360 8262723180 8095084020 8434672260) Outcome: +17.8% Another test case is to measure the time consumed by the application, but with the CPU-bound thread disabled. Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2) With-APF:40.8s = avg(40.6 41.1 40.9 41.0 40.7) Outcome: +1.2% So this is pure overhead in that case?
Re: [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct
Hi Mark, On 5/26/20 8:51 PM, Mark Rutland wrote: On Fri, May 08, 2020 at 01:29:14PM +1000, Gavin Shan wrote: There are a set of inline functions defined in kvm_emulate.h. Those functions reads ESR from vCPU fault information struct and then operate on it. So it's tied with vCPU fault information and vCPU struct. It limits their usage scope. This detaches these functions from the vCPU struct. With this, the caller has flexibility on where the ESR is read. It shouldn't cause any functional changes. Signed-off-by: Gavin Shan --- arch/arm64/include/asm/kvm_emulate.h | 83 +++- arch/arm64/kvm/handle_exit.c | 20 -- arch/arm64/kvm/hyp/switch.c | 24 --- arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c | 7 +- arch/arm64/kvm/inject_fault.c| 4 +- arch/arm64/kvm/sys_regs.c| 12 ++-- virt/kvm/arm/arm.c | 4 +- virt/kvm/arm/hyp/aarch32.c | 2 +- virt/kvm/arm/hyp/vgic-v3-sr.c| 5 +- virt/kvm/arm/mmio.c | 27 virt/kvm/arm/mmu.c | 22 --- 11 files changed, 112 insertions(+), 98 deletions(-) diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h index bd1a69e7c104..2873bf6dc85e 100644 --- a/arch/arm64/include/asm/kvm_emulate.h +++ b/arch/arm64/include/asm/kvm_emulate.h @@ -270,10 +270,8 @@ static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu) return vcpu->arch.fault.esr_el2; } -static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu) +static __always_inline int kvm_vcpu_get_condition(u32 esr) Given the `vcpu` argument has been removed, it's odd to keep `vcpu` in the name, rather than `esr`. e.g. this would make more sense as something like esr_get_condition(). ... and if we did something like that, we could move most of the extraction functions into , and share them with non-KVM code. Otherwise, do you need to extract all of these for your use-case, or do you only need a few of the helpers? If you only need a few, it might be better to only factor those out for now, and keep the existing API in place with wrappers, e.g. have: | esr_get_condition(u32 esr) { | ... | } | | kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu) | { | return esr_get_condition(kvm_vcpu_get_esr(vcpu)); | } Sure, I'll follow approach#1, to move these helper functions to asm/esr.h and with "vcpu" dropped in their names. I don't think it makes sense to maintain two sets of helper functions for the simple logic. So the helper function will be called where they should be, as below: esr_get_condition(u32 esr) { ... } bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu) { int cond = esr_get_condition(kvm_vcpu_get_esr(vcpu)); : } Thanks, Gavin ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH v4 3/7] KVM: PPC: Remove redundant kvm_run from vcpu_arch
On 2020/5/27 12:20, Paul Mackerras wrote: On Mon, Apr 27, 2020 at 12:35:10PM +0800, Tianjia Zhang wrote: The 'kvm_run' field already exists in the 'vcpu' structure, which is the same structure as the 'kvm_run' in the 'vcpu_arch' and should be deleted. Signed-off-by: Tianjia Zhang Thanks, patches 3 and 4 of this series applied to my kvm-ppc-next branch. Paul. Thanks for your suggestion, for 5/7, I will submit a new version patch. Thanks, Tianjia ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Re: [PATCH RFCv2 9/9] arm64: Support async page fault
Hi Gavin, I definitely appreciate the work, but this is repeating most of the mistakes done in the x86 implementation. In particular: - the page ready signal can be done as an interrupt, rather than an exception. This is because "page ready" can be handled asynchronously, in contrast to "page not present" which must be done on the same instruction that triggers it. You can refer to the recent series from Vitaly Kuznetsov that switched "page ready" to an interrupt. - the page not present is reusing the memory abort exception, and there's really no reason to do so. I think it would be best if ARM could reserve one ESR exception code for the hypervisor. Mark, any ideas how to proceed here? - for x86 we're also thinking of initiating the page fault from the exception handler, rather than doing so from the hypervisor before injecting the exception. If ARM leads the way here, we would do our best to share code when x86 does the same. - do not bother with using KVM_ASYNC_PF_SEND_ALWAYS, it's a fringe case that adds a lot of complexity. Also, please include me on further iterations of the series. Thanks! Paolo On 08/05/20 05:29, Gavin Shan wrote: > This supports asynchronous page fault for the guest. The design is > similar to what x86 has: on receiving a PAGE_NOT_PRESENT signal from > the host, the current task is either rescheduled or put into power > saving mode. The task will be waken up when PAGE_READY signal is > received. The PAGE_READY signal might be received in the context > of the suspended process, to be waken up. That means the suspended > process has to wake up itself, but it's not safe and prone to cause > dead-lock on CPU runqueue lock. So the wakeup is delayed on returning > from kernel space to user space or idle process is picked for running. > > The signals are conveyed through the async page fault control block, > which was passed to host on enabling the functionality. On each page > fault, the control block is checked and switch to the async page fault > handling flow if any signals exist. > > The feature is put into the CONFIG_KVM_GUEST umbrella, which is added > by this patch. So we have inline functions implemented in kvm_para.h, > like other architectures do, to check if async page fault (one of the > KVM para-virtualized features) is available. Also, the kernel boot > parameter "no-kvmapf" can be specified to disable the feature. > > Signed-off-by: Gavin Shan > --- > arch/arm64/Kconfig | 11 + > arch/arm64/include/asm/exception.h | 3 + > arch/arm64/include/asm/kvm_para.h | 27 +- > arch/arm64/kernel/entry.S | 33 +++ > arch/arm64/kernel/process.c| 4 + > arch/arm64/mm/fault.c | 434 + > 6 files changed, 505 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 40fb05d96c60..2d5e5ee62d6d 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -1045,6 +1045,17 @@ config PARAVIRT > under a hypervisor, potentially improving performance significantly > over full virtualization. > > +config KVM_GUEST > + bool "KVM Guest Support" > + depends on PARAVIRT > + default y > + help > + This option enables various optimizations for running under the KVM > + hypervisor. Overhead for the kernel when not running inside KVM should > + be minimal. > + > + In case of doubt, say Y > + > config PARAVIRT_TIME_ACCOUNTING > bool "Paravirtual steal time accounting" > select PARAVIRT > diff --git a/arch/arm64/include/asm/exception.h > b/arch/arm64/include/asm/exception.h > index 7a6e81ca23a8..d878afa42746 100644 > --- a/arch/arm64/include/asm/exception.h > +++ b/arch/arm64/include/asm/exception.h > @@ -46,4 +46,7 @@ void bad_el0_sync(struct pt_regs *regs, int reason, > unsigned int esr); > void do_cp15instr(unsigned int esr, struct pt_regs *regs); > void do_el0_svc(struct pt_regs *regs); > void do_el0_svc_compat(struct pt_regs *regs); > +#ifdef CONFIG_KVM_GUEST > +void kvm_async_pf_delayed_wake(void); > +#endif > #endif /* __ASM_EXCEPTION_H */ > diff --git a/arch/arm64/include/asm/kvm_para.h > b/arch/arm64/include/asm/kvm_para.h > index 0ea481dd1c7a..b2f8ef243df7 100644 > --- a/arch/arm64/include/asm/kvm_para.h > +++ b/arch/arm64/include/asm/kvm_para.h > @@ -3,6 +3,20 @@ > #define _ASM_ARM_KVM_PARA_H > > #include > +#include > +#include > + > +#ifdef CONFIG_KVM_GUEST > +static inline int kvm_para_available(void) > +{ > + return 1; > +} > +#else > +static inline int kvm_para_available(void) > +{ > + return 0; > +} > +#endif /* CONFIG_KVM_GUEST */ > > static inline bool kvm_check_and_clear_guest_paused(void) > { > @@ -11,17 +25,16 @@ static inline bool kvm_check_and_clear_guest_paused(void) > > static inline unsigned int kvm_arch_para_features(void) > { > - return 0; > + unsigned int features = 0; > + > + if
Re: [RFC] Use SMMU HTTU for DMA dirty page tracking
On 2020/5/27 11:27, Tian, Kevin wrote: >> From: Xiang Zheng >> Sent: Monday, May 25, 2020 7:34 PM >> >> [+cc Kirti, Yan, Alex] >> >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote: >>> Hi, >>> >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote: Hi all, Is there any plan for enabling SMMU HTTU? >>> >>> Not outside of SVA, as far as I know. >>> >> I have seen the patch locates in the SVA series patch, which adds support for HTTU: https://www.spinics.net/lists/arm-kernel/msg798694.html HTTU reduces the number of access faults on SMMU fault queue (permission faults also benifit from it). Besides reducing the faults, HTTU also helps to track dirty pages for device DMA. Is it feasible to utilize HTTU to get dirty pages on device DMA during VFIO live migration? >>> >>> As you know there is a VFIO interface for this under discussion: >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email- >> kwankh...@nvidia.com/ >>> It doesn't implement an internal API to communicate with the IOMMU >> driver >>> about dirty pages. > > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level > page tables (Rev 3.0). > Thank you, Kevin. When will you send this series patches? Maybe(Hope) we can also support hardware-based dirty pages tracking via common APIs based on your patches. :) >> >>> If SMMU can track dirty pages, devices are not required to implement additional dirty pages tracking to support VFIO live migration. >>> >>> It seems feasible, though tracking it in the device might be more >>> efficient. I might have misunderstood but I think for live migration of >>> the Intel NIC they trap guest accesses to the device and introspect its >>> state to figure out which pages it is accessing. > > Does HTTU implement A/D-like mechanism in SMMU page tables, or just > report dirty pages in a log buffer? Either way tracking dirty pages in IOMMU > side is generic thus doesn't require device-specific tweak like in Intel NIC. > Currently HTTU just implement A/D-like mechanism in SMMU page tables. We certainly expect SMMU can also implement PML-like feature so that we can avoid walking the whole page table to get the dirty pages. By the way, I'm not sure whether HTTU or SLAD can help for mediated deivce. -- Thanks, Xiang ___ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
RE: [RFC PATCH v12 07/11] psci: Add hypercall service for kvm ptp.
Hi Steven, > -Original Message- > From: Steven Price > Sent: Tuesday, May 26, 2020 7:02 PM > To: Jianyong Wu ; net...@vger.kernel.org; > yangbo...@nxp.com; john.stu...@linaro.org; t...@linutronix.de; > pbonz...@redhat.com; sean.j.christopher...@intel.com; m...@kernel.org; > richardcoch...@gmail.com; Mark Rutland ; > w...@kernel.org; Suzuki Poulose > Cc: linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org; > kvmarm@lists.cs.columbia.edu; k...@vger.kernel.org; Steve Capper > ; Kaly Xin ; Justin He > ; Wei Chen ; nd > Subject: Re: [RFC PATCH v12 07/11] psci: Add hypercall service for kvm ptp. > > On 25/05/2020 03:11, Jianyong Wu wrote: > > Hi Steven, > > Hi Jianyong, > > [...]>>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c > >>> index db6dce3d0e23..c964122f8dae 100644 > >>> --- a/virt/kvm/arm/hypercalls.c > >>> +++ b/virt/kvm/arm/hypercalls.c > >>> @@ -3,6 +3,7 @@ > >>> > >>>#include > >>>#include > >>> +#include > >>> > >>>#include > >>> > >>> @@ -11,6 +12,10 @@ > >>> > >>>int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) > >>>{ > >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST > >>> + struct system_time_snapshot systime_snapshot; > >>> + u64 cycles; > >>> +#endif > >>> u32 func_id = smccc_get_function(vcpu); > >>> u32 val[4] = {SMCCC_RET_NOT_SUPPORTED}; > >>> u32 feature; > >>> @@ -70,7 +75,49 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) > >>> break; > >>> case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID: > >>> val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES); > >>> + > >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST > >>> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_KVM_PTP); #endif > >>> break; > >>> + > >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST > >>> + /* > >>> + * This serves virtual kvm_ptp. > >>> + * Four values will be passed back. > >>> + * reg0 stores high 32-bit host ktime; > >>> + * reg1 stores low 32-bit host ktime; > >>> + * reg2 stores high 32-bit difference of host cycles and cntvoff; > >>> + * reg3 stores low 32-bit difference of host cycles and cntvoff. > >>> + */ > >>> + case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID: > >>> + /* > >>> + * system time and counter value must captured in the same > >>> + * time to keep consistency and precision. > >>> + */ > >>> + ktime_get_snapshot(_snapshot); > >>> + if (systime_snapshot.cs_id != CSID_ARM_ARCH_COUNTER) > >>> + break; > >>> + val[0] = upper_32_bits(systime_snapshot.real); > >>> + val[1] = lower_32_bits(systime_snapshot.real); > >>> + /* > >>> + * which of virtual counter or physical counter being > >>> + * asked for is decided by the first argument. > >>> + */ > >>> + feature = smccc_get_arg1(vcpu); > >>> + switch (feature) { > >>> + case ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_FUNC_ID: > >>> + cycles = systime_snapshot.cycles; > >>> + break; > >>> + default: > >> > >> There's something a bit odd here. > >> > >> ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID and > >> ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_FUNC_ID look like they > should be > >> names of separate (top-level) functions, but actually the _PHY_ one > >> is a parameter for the first. If the intention is to have a parameter > >> then it would be better to pick a better name for the _PHY_ define > >> and not define it using ARM_SMCCC_CALL_VAL. > >> > > Yeah, _PHY_ is not the same meaning with _PTP_FUNC_ID, so I think it > should be a different name. > > What about ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_COUNTER? > > Personally I'd go with something much shorter, e.g. > ARM_PTP_PHY_COUNTER. > This is just an argument to an SMCCC call so there's no need for most of the > prefix, indeed if (for whatever reason) there was a non-SMCCC mechanism > added to do the same thing it would be reasonable to reuse the same values. > Ok , this shorter name is better. > >> Second the use of "default:" means that there's no possibility to > >> later extend this interface for more clocks if needed in the future. > >> > > I think we can add more clocks by adding more cases, this "default" means > we can use no first arg to determine the default clock. > > The problem with the 'default' is it means it's not possible to probe whether > the kernel supports any more clocks. If we used a different value (that the > kernel doesn't support) then we end up in the default case and have no idea > whether the clock value is the one we requested or not. > Yeah, it's more meaningful. Should return the exact value back to user. > It's generally better when defining an ABI to explicitly return an error for > unknown parameters, that way a future user of the ABI can discover > whether the call did what was expected or not. > ok. I will fix it. > >> Alternatively you could indeed implement as two top-level functions