Re: [PATCH 1/3] KVM: arm64: Stop writing aarch32's CSSELR into ACTLR

2020-05-27 Thread Sasha Levin
Hi

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.6.14, v5.4.42, v4.19.124, v4.14.181, 
v4.9.224, v4.4.224.

v5.6.14: Build OK!
v5.4.42: Build OK!
v4.19.124: Failed to apply! Possible dependencies:
f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest")

v4.14.181: Failed to apply! Possible dependencies:
005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch 
path")
93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from 
guests")
f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest")

v4.9.224: Failed to apply! Possible dependencies:
005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch 
path")
016f98afd050 ("irqchip/gic-v3: Use nops macro for Cavium ThunderX erratum 
23154")
0d449541c185 ("KVM: arm64: use common invariant sysreg definitions")
0e9884fe63c6 ("arm64: sysreg: subsume GICv3 sysreg definitions")
14ae7518dd55 ("arm64: sysreg: add register encodings used by KVM")
47863d41ecf8 ("arm64: sysreg: sort by encoding")
82e0191a1aa1 ("arm64: Support systems without FP/ASIMD")
851050a573e1 ("KVM: arm64: Use common sysreg definitions")
93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from 
guests")
bca8f17f57bd ("arm64: Get rid of asm/opcodes.h")
c7a3c61fc606 ("arm64: sysreg: add performance monitor registers")
c9a3c58f01fb ("KVM: arm64: Add the EL1 physical timer access handler")
cd9e1927a525 ("arm64: Work around broken .inst when defective gas is 
detected")
f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest")

v4.4.224: Failed to apply! Possible dependencies:
005781be127f ("arm64: KVM: Move CPU ID reg trap setup off the world switch 
path")
06282fd2c2bf ("arm64: KVM: Implement vgic-v2 save/restore")
068a17a5805d ("arm64: mm: create new fine-grained mappings at boot")
072f0a633838 ("arm64: Introduce raw_{d,i}cache_line_size")
0a28714c53fd ("arm64: Use PoU cache instr for I/D coherency")
116c81f427ff ("arm64: Work around systems with mismatched cache line sizes")
1431af367e52 ("arm64: KVM: Implement timer save/restore")
157962f5a8f2 ("arm64: decouple early fixmap init from linear mapping")
1e48ef7fcc37 ("arm64: add support for building vmlinux as a relocatable PIE 
binary")
2a803c4db615 ("arm64: head.S: use memset to clear BSS")
57f4959bad0a ("arm64: kernel: Add support for User Access Override")
6d6ec20fcf28 ("arm64: KVM: Implement system register save/restore")
7b7293ae3dbd ("arm64: Fold proc-macros.S into assembler.h")
82869ac57b5d ("arm64: kernel: Add support for hibernate/suspend-to-disk")
82e0191a1aa1 ("arm64: Support systems without FP/ASIMD")
8eb992674c9e ("arm64: KVM: Implement debug save/restore")
910917bb7db0 ("arm64: KVM: Map the kernel RO section into HYP")
93390c0a1b20 ("arm64: KVM: Hide unsupported AArch64 CPU features from 
guests")
9e8e865bbe29 ("arm64: unify idmap removal")
a0bf9776cd0b ("arm64: kvm: deal with kernel symbols outside of linear 
mapping")
a7f8de168ace ("arm64: allow kernel Image to be loaded anywhere in physical 
memory")
ab893fb9f1b1 ("arm64: introduce KIMAGE_VADDR as the virtual base of the 
kernel region")
adc9b2dfd009 ("arm64: kernel: Rework finisher callback out of 
__cpu_suspend_enter()")
b3122023df93 ("arm64: Fix an enum typo in mm/dump.c")
b97b66c14b96 ("arm64: KVM: Implement guest entry")
be901e9b15cd ("arm64: KVM: Implement the core world switch")
c1a88e9124a4 ("arm64: kasan: avoid TLB conflicts")
c76a0a6695c6 ("arm64: KVM: Add a HYP-specific header file")
d5370f754875 ("arm64: prefetch: add alternative pattern for CPUs without a 
prefetcher")
f68d2b1b73cc ("arm64: KVM: Implement vgic-v3 save/restore")
f7f2b15c3d42 ("arm64: KVM: Expose sanitised cache type register to guest")
f80fb3a3d508 ("arm64: add support for kernel ASLR")
f9040773b7bb ("arm64: move kernel image to base of vmalloc area")
fd045f6cd98e ("arm64: add support for module PLTs")


NOTE: The patch will not be queued to stable trees until it is upstream.

How should we proceed with this patch?

-- 
Thanks
Sasha
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault

2020-05-27 Thread Paolo Bonzini
On 27/05/20 09:48, Marc Zyngier wrote:
> 
> My own question is whether this even makes any sense 10 years later.
> The HW has massively changed, and this adds a whole lot of complexity
> to both the hypervisor and the guest.

It still makes sense, but indeed it's for different reasons.  One
example is host page cache sharing, where (parts of) the host page cache
are visible to the guest.  In this context, async page faults are used
for any kind of host page faults, not just paging out memory due to
overcommit.

But I agree that it is very very important to design the exception model
first, as we're witnessing in x86 land the problems with a poor design.
 Nothing major, but just pain all around.

Paolo

> It also plays very ugly games
> with the exception model, which doesn't give me the warm fuzzy feeling
> that it's going to be great.

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 26/26] KVM: arm64: Parametrize exception entry with a target EL

2020-05-27 Thread Mark Rutland
On Wed, May 27, 2020 at 10:34:09AM +0100, Marc Zyngier wrote:
> HI Mark,
> 
> On 2020-05-19 11:44, Mark Rutland wrote:
> > On Wed, Apr 22, 2020 at 01:00:50PM +0100, Marc Zyngier wrote:
> > > -static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
> > > +static void enter_exception(struct kvm_vcpu *vcpu, unsigned long
> > > target_mode,
> > > + enum exception_type type)
> > 
> > Since this is all for an AArch64 target, could we keep `64` in the name,
> > e.g enter_exception64? That'd mirror the callers below.
> > 
> > >  {
> > > - unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> > > - unsigned long old, new;
> > > + unsigned long sctlr, vbar, old, new, mode;
> > > + u64 exc_offset;
> > > +
> > > + mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT);
> > > +
> > > + if  (mode == target_mode)
> > > + exc_offset = CURRENT_EL_SP_ELx_VECTOR;
> > > + else if ((mode | 1) == target_mode)
> > > + exc_offset = CURRENT_EL_SP_EL0_VECTOR;
> > 
> > It would be nice if we could add a mnemonic for the `1` here, e.g.
> > PSR_MODE_SP0 or PSR_MODE_THREAD_BIT.
> 
> I've addressed both comments as follows:
> 
> diff --git a/arch/arm64/include/asm/ptrace.h
> b/arch/arm64/include/asm/ptrace.h
> index bf57308fcd63..953b6a1ce549 100644
> --- a/arch/arm64/include/asm/ptrace.h
> +++ b/arch/arm64/include/asm/ptrace.h
> @@ -35,6 +35,7 @@
>  #define GIC_PRIO_PSR_I_SET   (1 << 4)
> 
>  /* Additional SPSR bits not exposed in the UABI */
> +#define PSR_MODE_THREAD_BIT  (1 << 0)
>  #define PSR_IL_BIT   (1 << 20)
> 
>  /* AArch32-specific ptrace requests */
> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index 3dbcbc839b9c..ebfdfc27b2bd 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -43,8 +43,8 @@ enum exception_type {
>   * Here we manipulate the fields in order of the AArch64 SPSR_ELx layout,
> from
>   * MSB to LSB.
>   */
> -static void enter_exception(struct kvm_vcpu *vcpu, unsigned long
> target_mode,
> - enum exception_type type)
> +static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long
> target_mode,
> +   enum exception_type type)
>  {
>   unsigned long sctlr, vbar, old, new, mode;
>   u64 exc_offset;
> @@ -53,7 +53,7 @@ static void enter_exception(struct kvm_vcpu *vcpu,
> unsigned long target_mode,
> 
>   if  (mode == target_mode)
>   exc_offset = CURRENT_EL_SP_ELx_VECTOR;
> - else if ((mode | 1) == target_mode)
> + else if ((mode | PSR_MODE_THREAD_BIT) == target_mode)
>   exc_offset = CURRENT_EL_SP_EL0_VECTOR;
>   else if (!(mode & PSR_MODE32_BIT))
>   exc_offset = LOWER_EL_AArch64_VECTOR;
> @@ -126,7 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool
> is_iabt, unsigned long addr
>   bool is_aarch32 = vcpu_mode_is_32bit(vcpu);
>   u32 esr = 0;
> 
> - enter_exception(vcpu, PSR_MODE_EL1h, except_type_sync);
> + enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
> 
>   vcpu_write_sys_reg(vcpu, addr, FAR_EL1);
> 
> @@ -156,7 +156,7 @@ static void inject_undef64(struct kvm_vcpu *vcpu)
>  {
>   u32 esr = (ESR_ELx_EC_UNKNOWN << ESR_ELx_EC_SHIFT);
> 
> - enter_exception(vcpu, PSR_MODE_EL1h, except_type_sync);
> + enter_exception64(vcpu, PSR_MODE_EL1h, except_type_sync);
> 
>   /*
>* Build an unknown exception, depending on the instruction

Thanks; that all looks good to me, and my R-b stands!

Mark.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 20/26] KVM: arm64: Move ELR_EL1 to the system register array

2020-05-27 Thread Marc Zyngier

On 2020-05-26 17:29, James Morse wrote:

Hi Marc,

On 22/04/2020 13:00, Marc Zyngier wrote:
As ELR-EL1 is a VNCR-capable register with ARMv8.4-NV, let's move it 
to

the sys_regs array and repaint the accessors. While we're at it, let's
kill the now useless accessors used only on the fault injection path.


Reviewed-by: James Morse 


A curiosity:

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h

index 95977b80265ce..46949fce3e813 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -184,6 +184,8 @@ enum vcpu_sysreg {


Comment above the enum has some claims about the order, but its
already out of order with
__vcpu_read_sys_reg_from_cpu()... (PAR_EL1 being the culprit)


This comment dates back from the original assembly implementation,
where I was paranoid about accessing the sys_regs array in strict
order to maximize the chances of the prefetcher doing the right thing.

As always with premature optimization, it was worthless, and moving
things to C forced us to do things differently anyway (not to mention
VHE...).

I'll delete the comment in a separate patch.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 19/26] KVM: arm64: Make struct kvm_regs userspace-only

2020-05-27 Thread Marc Zyngier

On 2020-05-26 17:29, James Morse wrote:

Hi Marc,

On 22/04/2020 13:00, Marc Zyngier wrote:

struct kvm_regs is used by userspace to indicate which register gets
accessed by the {GET,SET}_ONE_REG API. But as we're about to refactor
the layout of the in-kernel register structures, we need the kernel to
move away from it.

Let's make kvm_regs userspace only, and let the kernel map it to its 
own

internal representation.



diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
index 23ebe51410f06..9fec9231b63e2 100644
--- a/arch/arm64/kvm/guest.c
+++ b/arch/arm64/kvm/guest.c
@@ -102,6 +102,55 @@ static int core_reg_size_from_offset(const struct 
kvm_vcpu *vcpu, u64 off)

return size;
 }

+static void *core_reg_addr(struct kvm_vcpu *vcpu, const struct 
kvm_one_reg *reg)

+{
+   u64 off = core_reg_offset_from_id(reg->id);
+
+   switch (off) {



+   default:
+   return NULL;


Doesn't this switch statement catch an out of range offset, and a
misaligned offset?

... We still test for those explicitly in the caller. Better safe than 
implicit?


Indeed, this is not supposed to happen at all. Maybe I should just fold
validate_core_offset offset there, and make this NULL value the error
case.




+   }
+}


With the reset thing reported by Zenghui and Zengtao on the previous
patch fixed:
Reviewed-by: James Morse 

(otherwise struct kvm_regs isn't userspace-only!)


Indeed!

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 10/26] KVM: arm64: Refactor vcpu_{read,write}_sys_reg

2020-05-27 Thread Marc Zyngier

On 2020-05-26 17:28, James Morse wrote:

Hi Marc,

On 22/04/2020 13:00, Marc Zyngier wrote:

Extract the direct HW accessors for later reuse.



diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 51db934702b64..46f218982df8c 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c



+u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg)
+{
+   u64 val = 0x8badf00d8badf00d;
+
+   if (!vcpu->arch.sysregs_loaded_on_cpu) {
+   goto memory_read;
}

-immediate_write:
+   if (__vcpu_read_sys_reg_from_cpu(reg, ))
+   return val;
+
+memory_read:
+   return __vcpu_sys_reg(vcpu, reg);
+}



The goto here is a bit odd, is it just an artefact of how we got here?


That's because a lot of this changes when NV gets added to the mix,
see [1].


Is this easier on the eye?:
| u64 vcpu_read_sys_reg(const struct kvm_vcpu *vcpu, int reg)
| {
|   u64 val = 0x8badf00d8badf00d;
|
|   if (vcpu->arch.sysregs_loaded_on_cpu &&
|   __vcpu_read_sys_reg_from_cpu(reg, ))
|   return val;
|
|   return __vcpu_sys_reg(vcpu, reg);
| }


Definitely. I don't mind reworking the NV branch so that the label
gets introduced there.


+void vcpu_write_sys_reg(struct kvm_vcpu *vcpu, u64 val, int reg)
+{
+   if (!vcpu->arch.sysregs_loaded_on_cpu)
+   goto memory_write;
+
+   if (__vcpu_write_sys_reg_to_cpu(val, reg))
+   return;
+
+memory_write:
 __vcpu_sys_reg(vcpu, reg) = val;
 }


Again I think its clearer without the goto


Regardless:
Reviewed-by: James Morse 


Thanks,

M.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/commit/?h=kvm-arm64/nv-5.7-rc1-WIP=11f3217d39a602cbfac7d08064c8b31afb57348e

--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 26/26] KVM: arm64: Parametrize exception entry with a target EL

2020-05-27 Thread Marc Zyngier

HI Mark,

On 2020-05-19 11:44, Mark Rutland wrote:

On Wed, Apr 22, 2020 at 01:00:50PM +0100, Marc Zyngier wrote:

We currently assume that an exception is delivered to EL1, always.
Once we emulate EL2, this no longer will be the case. To prepare
for this, add a target_mode parameter.

While we're at it, merge the computing of the target PC and PSTATE in
a single function that updates both PC and CPSR after saving their
previous values in the corresponding ELR/SPSR. This ensures that they
are updated in the correct order (a pretty common source of bugs...).

Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/inject_fault.c | 75 
++-

 1 file changed, 38 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/kvm/inject_fault.c 
b/arch/arm64/kvm/inject_fault.c

index d3ebf8bca4b89..3dbcbc839b9c3 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -26,28 +26,12 @@ enum exception_type {
except_type_serror  = 0x180,
 };

-static u64 get_except_vector(struct kvm_vcpu *vcpu, enum 
exception_type type)

-{
-   u64 exc_offset;
-
-   switch (*vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT)) {
-   case PSR_MODE_EL1t:
-   exc_offset = CURRENT_EL_SP_EL0_VECTOR;
-   break;
-   case PSR_MODE_EL1h:
-   exc_offset = CURRENT_EL_SP_ELx_VECTOR;
-   break;
-   case PSR_MODE_EL0t:
-   exc_offset = LOWER_EL_AArch64_VECTOR;
-   break;
-   default:
-   exc_offset = LOWER_EL_AArch32_VECTOR;
-   }
-
-   return vcpu_read_sys_reg(vcpu, VBAR_EL1) + exc_offset + type;
-}
-
 /*
+ * This performs the exception entry at a given EL (@target_mode), 
stashing PC
+ * and PSTATE into ELR and SPSR respectively, and compute the new 
PC/PSTATE.
+ * The EL passed to this function *must* be a non-secure, privileged 
mode with

+ * bit 0 being set (PSTATE.SP == 1).
+ *
  * When an exception is taken, most PSTATE fields are left unchanged 
in the
  * handler. However, some are explicitly overridden (e.g. M[4:0]). 
Luckily all
  * of the inherited bits have the same position in the 
AArch64/AArch32 SPSR_ELx
@@ -59,10 +43,35 @@ static u64 get_except_vector(struct kvm_vcpu 
*vcpu, enum exception_type type)
  * Here we manipulate the fields in order of the AArch64 SPSR_ELx 
layout, from

  * MSB to LSB.
  */
-static unsigned long get_except64_pstate(struct kvm_vcpu *vcpu)
+static void enter_exception(struct kvm_vcpu *vcpu, unsigned long 
target_mode,

+   enum exception_type type)


Since this is all for an AArch64 target, could we keep `64` in the 
name,

e.g enter_exception64? That'd mirror the callers below.


 {
-   unsigned long sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
-   unsigned long old, new;
+   unsigned long sctlr, vbar, old, new, mode;
+   u64 exc_offset;
+
+   mode = *vcpu_cpsr(vcpu) & (PSR_MODE_MASK | PSR_MODE32_BIT);
+
+   if  (mode == target_mode)
+   exc_offset = CURRENT_EL_SP_ELx_VECTOR;
+   else if ((mode | 1) == target_mode)
+   exc_offset = CURRENT_EL_SP_EL0_VECTOR;


It would be nice if we could add a mnemonic for the `1` here, e.g.
PSR_MODE_SP0 or PSR_MODE_THREAD_BIT.


I've addressed both comments as follows:

diff --git a/arch/arm64/include/asm/ptrace.h 
b/arch/arm64/include/asm/ptrace.h

index bf57308fcd63..953b6a1ce549 100644
--- a/arch/arm64/include/asm/ptrace.h
+++ b/arch/arm64/include/asm/ptrace.h
@@ -35,6 +35,7 @@
 #define GIC_PRIO_PSR_I_SET (1 << 4)

 /* Additional SPSR bits not exposed in the UABI */
+#define PSR_MODE_THREAD_BIT(1 << 0)
 #define PSR_IL_BIT (1 << 20)

 /* AArch32-specific ptrace requests */
diff --git a/arch/arm64/kvm/inject_fault.c 
b/arch/arm64/kvm/inject_fault.c

index 3dbcbc839b9c..ebfdfc27b2bd 100644
--- a/arch/arm64/kvm/inject_fault.c
+++ b/arch/arm64/kvm/inject_fault.c
@@ -43,8 +43,8 @@ enum exception_type {
  * Here we manipulate the fields in order of the AArch64 SPSR_ELx 
layout, from

  * MSB to LSB.
  */
-static void enter_exception(struct kvm_vcpu *vcpu, unsigned long 
target_mode,

-   enum exception_type type)
+static void enter_exception64(struct kvm_vcpu *vcpu, unsigned long 
target_mode,

+ enum exception_type type)
 {
unsigned long sctlr, vbar, old, new, mode;
u64 exc_offset;
@@ -53,7 +53,7 @@ static void enter_exception(struct kvm_vcpu *vcpu, 
unsigned long target_mode,


if  (mode == target_mode)
exc_offset = CURRENT_EL_SP_ELx_VECTOR;
-   else if ((mode | 1) == target_mode)
+   else if ((mode | PSR_MODE_THREAD_BIT) == target_mode)
exc_offset = CURRENT_EL_SP_EL0_VECTOR;
else if (!(mode & PSR_MODE32_BIT))
exc_offset = LOWER_EL_AArch64_VECTOR;
@@ -126,7 +126,7 @@ static void inject_abt64(struct kvm_vcpu *vcpu, bool 
is_iabt, unsigned long addr

 

Re: [RFC PATCH 2/7] KVM: arm64: Set DBM bit of PTEs if hw DBM enabled

2020-05-27 Thread zhukeqian
Hi Catalin,

On 2020/5/26 19:49, Catalin Marinas wrote:
> On Mon, May 25, 2020 at 07:24:01PM +0800, Keqian Zhu wrote:
>> diff --git a/arch/arm64/include/asm/pgtable-prot.h 
>> b/arch/arm64/include/asm/pgtable-prot.h
>> index 1305e28225fc..f9910ba2afd8 100644
>> --- a/arch/arm64/include/asm/pgtable-prot.h
>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>> @@ -79,6 +79,7 @@ extern bool arm64_use_ng_mappings;
>>  })
>>  
>>  #define PAGE_S2 __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN)
>> +#define PAGE_S2_DBM __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(NORMAL) | PTE_S2_RDONLY | PAGE_S2_XN | PTE_DBM)
> 
> You don't need a new page permission (see below).
> 
>>  #define PAGE_S2_DEVICE  __pgprot(_PROT_DEFAULT | 
>> PAGE_S2_MEMATTR(DEVICE_nGnRE) | PTE_S2_RDONLY | PTE_S2_XN)
>>  
>>  #define PAGE_NONE   __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | 
>> PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index e3b9ee268823..dc97988eb2e0 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -1426,6 +1426,10 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t 
>> addr, phys_addr_t end)
>>  pte = pte_offset_kernel(pmd, addr);
>>  do {
>>  if (!pte_none(*pte)) {
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +if (kvm_hw_dbm_enabled() && !kvm_s2pte_dbm(pte))
>> +kvm_set_s2pte_dbm(pte);
>> +#endif
>>  if (!kvm_s2pte_readonly(pte))
>>  kvm_set_s2pte_readonly(pte);
>>  }
> 
> Setting the DBM bit is equivalent to marking the page writable. The
> actual writable pte bit (S2AP[1] or HAP[2] as we call them in Linux for
> legacy reasons) tells you whether the page has been dirtied but it is
> still writable if you set DBM. Doing this in stage2_wp_ptes()
> practically means that you no longer have read-only pages at S2. There
> are several good reasons why you don't want to break this. For example,
> the S2 pte may already be read-only for other reasons (CoW).
> 
Thanks, your comments help to solve the first problem in cover letter.

> I think you should only set the DBM bit if the pte was previously
> writable. In addition, any permission change to the S2 pte must take
> into account the DBM bit and clear it while transferring the dirty
> status to the underlying page. I'm not deeply familiar with all these
> callbacks into KVM but two such paths are kvm_unmap_hva_range() and the
> kvm_mmu_notifier_change_pte().
Yes, I agree.
> 
> 
>> @@ -1827,7 +1831,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
>> phys_addr_t fault_ipa,
>>  
>>  ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, _pmd);
>>  } else {
>> -pte_t new_pte = kvm_pfn_pte(pfn, mem_type);
>> +pte_t new_pte;
>> +
>> +#ifdef CONFIG_ARM64_HW_AFDBM
>> +if (kvm_hw_dbm_enabled() &&
>> +pgprot_val(mem_type) == pgprot_val(PAGE_S2)) {
>> +mem_type = PAGE_S2_DBM;
>> +}
>> +#endif
>> +new_pte = kvm_pfn_pte(pfn, mem_type);
>>  
>>  if (writable) {
>>  new_pte = kvm_s2pte_mkwrite(new_pte);
> 
> That's wrong here. Basically for any fault you get, you just turn the S2
> page writable. The point of DBM is that you don't get write faults at
> all if you have a writable page. So, as I said above, only set the DBM
> bit if you stored a writable S2 pte (kvm_s2pte_mkwrite()).
Yeah, you are right. I will correct it in Patch v1.
> 

Thanks,
Keqian

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-27 Thread Jean-Philippe Brucker
On Wed, May 27, 2020 at 08:40:47AM +, Tian, Kevin wrote:
> > From: Xiang Zheng 
> > Sent: Wednesday, May 27, 2020 2:45 PM
> > 
> > 
> > On 2020/5/27 11:27, Tian, Kevin wrote:
> > >> From: Xiang Zheng
> > >> Sent: Monday, May 25, 2020 7:34 PM
> > >>
> > >> [+cc Kirti, Yan, Alex]
> > >>
> > >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
> > >>> Hi,
> > >>>
> > >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
> >  Hi all,
> > 
> >  Is there any plan for enabling SMMU HTTU?
> > >>>
> > >>> Not outside of SVA, as far as I know.
> > >>>
> > >>
> >  I have seen the patch locates in the SVA series patch, which adds
> >  support for HTTU:
> >  https://www.spinics.net/lists/arm-kernel/msg798694.html
> > 
> >  HTTU reduces the number of access faults on SMMU fault queue
> >  (permission faults also benifit from it).
> > 
> >  Besides reducing the faults, HTTU also helps to track dirty pages for
> >  device DMA. Is it feasible to utilize HTTU to get dirty pages on device
> >  DMA during VFIO live migration?
> > >>>
> > >>> As you know there is a VFIO interface for this under discussion:
> > >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-
> > >> kwankh...@nvidia.com/
> > >>> It doesn't implement an internal API to communicate with the IOMMU
> > >> driver
> > >>> about dirty pages.
> > >
> > > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level
> > > page tables (Rev 3.0).
> > >
> > 
> > Thank you, Kevin.
> > 
> > When will you send this series patches? Maybe(Hope) we can also support
> > hardware-based dirty pages tracking via common APIs based on your
> > patches. :)
> 
> Yan is working with Kirti on basic live migration support now. After that
> part is done, we will start working on A/D bit support. Yes, common APIs
> are definitely the goal here.
> 
> > 
> > >>
> > >>>
> >  If SMMU can track dirty pages, devices are not required to implement
> >  additional dirty pages tracking to support VFIO live migration.
> > >>>
> > >>> It seems feasible, though tracking it in the device might be more
> > >>> efficient. I might have misunderstood but I think for live migration of
> > >>> the Intel NIC they trap guest accesses to the device and introspect its
> > >>> state to figure out which pages it is accessing.
> > >
> > > Does HTTU implement A/D-like mechanism in SMMU page tables, or just
> > > report dirty pages in a log buffer? Either way tracking dirty pages in 
> > > IOMMU
> > > side is generic thus doesn't require device-specific tweak like in Intel 
> > > NIC.
> > >
> > 
> > Currently HTTU just implement A/D-like mechanism in SMMU page tables.
> > We certainly
> > expect SMMU can also implement PML-like feature so that we can avoid
> > walking the
> > whole page table to get the dirty pages.

There is no reporting of dirty pages in log buffer. It might be possible
to do software logging based on PRI or Stall, but that requires special
support in the endpoint as well as the SMMU.

> Is there a link to HTTU introduction?

I don't know any gentle introduction, but there are sections D5.4.11
"Hardware management of the Access flag and dirty state" in the ARM
Architecture Reference Manual (DDI0487E), and section 3.13 "Translation
table entries and Access/Dirty flags" in the SMMU specification
(IHI0070C). HTTU stands for "Hardware Translation Table Update".

In short, when HTTU is enabled, the SMMU translation performs an atomic
read-modify-write on the leaf translation table descriptor, setting some
bits depending on the type of memory access. This can be enabled
independently on both stage-1 and stage-2 tables (equivalent to your 1st
and 2nd page tables levels, I think).

Thanks,
Jean
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 08/26] KVM: arm64: Use TTL hint in when invalidating stage-2 translations

2020-05-27 Thread Marc Zyngier

On 2020-05-13 10:06, Andrew Scull wrote:

On Tue, May 12, 2020 at 01:04:31PM +0100, James Morse wrote:

Hi Andrew,

On 07/05/2020 16:13, Andrew Scull wrote:
>> @@ -176,7 +177,7 @@ static void clear_stage2_pud_entry(struct kvm_s2_mmu 
*mmu, pud_t *pud, phys_addr
>>pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
>>VM_BUG_ON(stage2_pud_huge(kvm, *pud));
>>stage2_pud_clear(kvm, pud);
>> -  kvm_tlb_flush_vmid_ipa(mmu, addr);
>> +  kvm_tlb_flush_vmid_ipa(mmu, addr, S2_NO_LEVEL_HINT);
>>stage2_pmd_free(kvm, pmd_table);
>>put_page(virt_to_page(pud));
>>  }
>> @@ -186,7 +187,7 @@ static void clear_stage2_pmd_entry(struct kvm_s2_mmu 
*mmu, pmd_t *pmd, phys_addr
>>pte_t *pte_table = pte_offset_kernel(pmd, 0);
>>VM_BUG_ON(pmd_thp_or_huge(*pmd));
>>pmd_clear(pmd);
>> -  kvm_tlb_flush_vmid_ipa(mmu, addr);
>> +  kvm_tlb_flush_vmid_ipa(mmu, addr, S2_NO_LEVEL_HINT);
>>free_page((unsigned long)pte_table);
>>put_page(virt_to_page(pmd));
>>  }
>
> Going by the names, is it possible to give a better level hint for these
> cases?

There is no leaf entry being invalidated here. After clearing the 
range, we found we'd

emptied (and invalidated) a whole page of mappings:
|   if (stage2_pmd_table_empty(kvm, start_pmd))
|   clear_stage2_pud_entry(mmu, pud, start_addr);

Now we want to remove the link to the empty page so we can free it. We 
are changing the

structure of the tables, not what gets mapped.

I think this is why we need the un-hinted behaviour, to invalidate 
"any level of the
translation table walk required to translate the specified IPA". 
Otherwise the hardware

can look for a leaf at the indicated level, find none, and do nothing.


This is sufficiently horrible, its possible I've got it completely 
wrong! (does it make

sense?)


Ok. `addr` is an IPA, that IPA is now omitted from the map so doesn't
appear in any entry of the table, least of all a leaf entry. That makes
sense.

Is there a convention to distinguish IPA and PA similar to the
distinction for VA or does kvmarm just use phys_addr_t all round?

It seems like the TTL patches are failry self contained if it would be
easier to serparate them out from these larger series?


They are. This whole series is a mix of unrelated patches anyway.
Their only goal is to make my life a bit easier in the distant
future.

I'll repost that anyway, as I have made some cosmetic changes.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 03/26] KVM: arm64: Factor out stage 2 page table data from struct kvm

2020-05-27 Thread Alexandru Elisei
Hi Marc,

On 5/27/20 9:41 AM, Marc Zyngier wrote:
> Hi Alex,
>
> On 2020-05-12 17:53, Alexandru Elisei wrote:
>> Hi,
>>
>> On 5/12/20 12:17 PM, James Morse wrote:
>>> Hi Alex, Marc,
>>>
>>> (just on this last_vcpu_ran thing...)
>>>
>>> On 11/05/2020 17:38, Alexandru Elisei wrote:
 On 4/22/20 1:00 PM, Marc Zyngier wrote:
> From: Christoffer Dall 
>
> As we are about to reuse our stage 2 page table manipulation code for
> shadow stage 2 page tables in the context of nested virtualization, we
> are going to manage multiple stage 2 page tables for a single VM.
>
> This requires some pretty invasive changes to our data structures,
> which moves the vmid and pgd pointers into a separate structure and
> change pretty much all of our mmu code to operate on this structure
> instead.
>
> The new structure is called struct kvm_s2_mmu.
>
> There is no intended functional change by this patch alone.
> diff --git a/arch/arm64/include/asm/kvm_host.h
> b/arch/arm64/include/asm/kvm_host.h
> index 7dd8fefa6aecd..664a5d92ae9b8 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -63,19 +63,32 @@ struct kvm_vmid {
>  u32    vmid;
>  };
>
> -struct kvm_arch {
> +struct kvm_s2_mmu {
>  struct kvm_vmid vmid;
>
> -    /* stage2 entry level table */
> -    pgd_t *pgd;
> -    phys_addr_t pgd_phys;
> -
> -    /* VTCR_EL2 value for this VM */
> -    u64    vtcr;
> +    /*
> + * stage2 entry level table
> + *
> + * Two kvm_s2_mmu structures in the same VM can point to the same pgd
> + * here.  This happens when running a non-VHE guest hypervisor which
> + * uses the canonical stage 2 page table for both vEL2 and for vEL1/0
> + * with vHCR_EL2.VM == 0.
 It makes more sense to me to say that a non-VHE guest hypervisor will use 
 the
 canonical stage *1* page table when running at EL2
>>> Can KVM say anything about stage1? Its totally under the the guests control
>>> even at vEL2...
>>
>> It just occurred to me that "canonical stage 2 page table" refers to the L0
>> hypervisor stage 2, not to the L1 hypervisor stage 2. If you don't mind my
>> suggestion, perhaps the comment can be slightly improved to avoid any 
>> confusion?
>> Maybe something along the lines of "[..] This happens when running a
>> non-VHE guest
>> hypervisor, in which case we use the canonical stage 2 page table for both 
>> vEL2
>> and for vEL1/0 with vHCR_EL2.VM == 0".
>
> If the confusion stems from the lack of guest stage-2, how about:
>
> "This happens when running a guest using a translation regime that isn't
>  affected by its own stage-2 translation, such as a non-VHE hypervisor
>  running at vEL2, or for vEL1/EL0 with vHCR_EL2.VM == 0. In that case,
>  we use the canonical stage-2 page tables."
>
> instead? Does this lift the ambiguity?

Yes, that's perfect.

Thanks,
Alex
>
> Thanks,
>
>     M.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs

2020-05-27 Thread Zenghui Yu

Hi Marc,

On 2020/5/27 15:55, Marc Zyngier wrote:

Hi Zenghui,

On 2020-05-27 08:41, Zenghui Yu wrote:

On 2020/5/27 0:11, Marc Zyngier wrote:

On a system that uses SPIs to implement MSIs (as it would be
the case on a GICv2 system exposing a GICv2m to its guests),
we deny the possibility of injecting SPIs on the in-atomic
fast-path.

This results in a very large amount of context-switches
(roughly equivalent to twice the interrupt rate) on the host,
and suboptimal performance for the guest (as measured with
a test workload involving a virtio interface backed by vhost-net).
Given that GICv2 systems are usually on the low-end of the spectrum
performance wise, they could do without the aggravation.

We solved this for GICv3+ITS by having a translation cache. But
SPIs do not need any extra infrastructure, and can be immediately
injected in the virtual distributor as the locking is already
heavy enough that we don't need to worry about anything.

This halves the number of context switches for the same workload.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/kvm/vgic/vgic-irqfd.c | 20 
  arch/arm64/kvm/vgic/vgic-its.c   |  3 +--
  2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c 
b/arch/arm64/kvm/vgic/vgic-irqfd.c

index d8cdfea5cc96..11a9f81115ab 100644
--- a/arch/arm64/kvm/vgic/vgic-irqfd.c
+++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
@@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct 
kvm_kernel_irq_routing_entry *e,

    struct kvm *kvm, int irq_source_id, int level,
    bool line_status)


... and you may also need to update the comment on top of it to
reflect this change.

/**
 * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
 *
 * Currently only direct MSI injection is supported.
 */


As far as I can tell, it is still valid (at least from the guest's
perspective). You could in practice use that to deal with level
interrupts, but we only inject the rising edge on this path, never
the falling edge. So effectively, this is limited to edge interrupts,
which is mostly MSIs.


Oops... I had wrongly mixed MSI with the architecture-defined LPI, and
was think that we should add something like "direct SPI injection is
also supported now". Sorry.



Unless you are thinking of something else which I would have missed?


No, please ignore the noisy.


Thanks,
Zenghui
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH 03/26] KVM: arm64: Factor out stage 2 page table data from struct kvm

2020-05-27 Thread Marc Zyngier

Hi Alex,

On 2020-05-12 17:53, Alexandru Elisei wrote:

Hi,

On 5/12/20 12:17 PM, James Morse wrote:

Hi Alex, Marc,

(just on this last_vcpu_ran thing...)

On 11/05/2020 17:38, Alexandru Elisei wrote:

On 4/22/20 1:00 PM, Marc Zyngier wrote:

From: Christoffer Dall 

As we are about to reuse our stage 2 page table manipulation code 
for
shadow stage 2 page tables in the context of nested virtualization, 
we

are going to manage multiple stage 2 page tables for a single VM.

This requires some pretty invasive changes to our data structures,
which moves the vmid and pgd pointers into a separate structure and
change pretty much all of our mmu code to operate on this structure
instead.

The new structure is called struct kvm_s2_mmu.

There is no intended functional change by this patch alone.
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h

index 7dd8fefa6aecd..664a5d92ae9b8 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -63,19 +63,32 @@ struct kvm_vmid {
u32vmid;
 };

-struct kvm_arch {
+struct kvm_s2_mmu {
struct kvm_vmid vmid;

-   /* stage2 entry level table */
-   pgd_t *pgd;
-   phys_addr_t pgd_phys;
-
-   /* VTCR_EL2 value for this VM */
-   u64vtcr;
+   /*
+* stage2 entry level table
+*
+	 * Two kvm_s2_mmu structures in the same VM can point to the same 
pgd
+	 * here.  This happens when running a non-VHE guest hypervisor 
which
+	 * uses the canonical stage 2 page table for both vEL2 and for 
vEL1/0

+* with vHCR_EL2.VM == 0.
It makes more sense to me to say that a non-VHE guest hypervisor will 
use the

canonical stage *1* page table when running at EL2
Can KVM say anything about stage1? Its totally under the the guests 
control even at vEL2...


It just occurred to me that "canonical stage 2 page table" refers to 
the L0
hypervisor stage 2, not to the L1 hypervisor stage 2. If you don't mind 
my
suggestion, perhaps the comment can be slightly improved to avoid any 
confusion?

Maybe something along the lines of "[..] This happens when running a
non-VHE guest
hypervisor, in which case we use the canonical stage 2 page table for 
both vEL2

and for vEL1/0 with vHCR_EL2.VM == 0".


If the confusion stems from the lack of guest stage-2, how about:

"This happens when running a guest using a translation regime that isn't
 affected by its own stage-2 translation, such as a non-VHE hypervisor
 running at vEL2, or for vEL1/EL0 with vHCR_EL2.VM == 0. In that case,
 we use the canonical stage-2 page tables."

instead? Does this lift the ambiguity?

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-27 Thread Tian, Kevin
> From: Xiang Zheng 
> Sent: Wednesday, May 27, 2020 2:45 PM
> 
> 
> On 2020/5/27 11:27, Tian, Kevin wrote:
> >> From: Xiang Zheng
> >> Sent: Monday, May 25, 2020 7:34 PM
> >>
> >> [+cc Kirti, Yan, Alex]
> >>
> >> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
> >>> Hi,
> >>>
> >>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
>  Hi all,
> 
>  Is there any plan for enabling SMMU HTTU?
> >>>
> >>> Not outside of SVA, as far as I know.
> >>>
> >>
>  I have seen the patch locates in the SVA series patch, which adds
>  support for HTTU:
>  https://www.spinics.net/lists/arm-kernel/msg798694.html
> 
>  HTTU reduces the number of access faults on SMMU fault queue
>  (permission faults also benifit from it).
> 
>  Besides reducing the faults, HTTU also helps to track dirty pages for
>  device DMA. Is it feasible to utilize HTTU to get dirty pages on device
>  DMA during VFIO live migration?
> >>>
> >>> As you know there is a VFIO interface for this under discussion:
> >>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-
> >> kwankh...@nvidia.com/
> >>> It doesn't implement an internal API to communicate with the IOMMU
> >> driver
> >>> about dirty pages.
> >
> > We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level
> > page tables (Rev 3.0).
> >
> 
> Thank you, Kevin.
> 
> When will you send this series patches? Maybe(Hope) we can also support
> hardware-based dirty pages tracking via common APIs based on your
> patches. :)

Yan is working with Kirti on basic live migration support now. After that
part is done, we will start working on A/D bit support. Yes, common APIs
are definitely the goal here.

> 
> >>
> >>>
>  If SMMU can track dirty pages, devices are not required to implement
>  additional dirty pages tracking to support VFIO live migration.
> >>>
> >>> It seems feasible, though tracking it in the device might be more
> >>> efficient. I might have misunderstood but I think for live migration of
> >>> the Intel NIC they trap guest accesses to the device and introspect its
> >>> state to figure out which pages it is accessing.
> >
> > Does HTTU implement A/D-like mechanism in SMMU page tables, or just
> > report dirty pages in a log buffer? Either way tracking dirty pages in IOMMU
> > side is generic thus doesn't require device-specific tweak like in Intel 
> > NIC.
> >
> 
> Currently HTTU just implement A/D-like mechanism in SMMU page tables.
> We certainly
> expect SMMU can also implement PML-like feature so that we can avoid
> walking the
> whole page table to get the dirty pages.

Is there a link to HTTU introduction?

> 
> By the way, I'm not sure whether HTTU or SLAD can help for mediated deivce.
> 

A/D bit applies to mediated device on VT-d.

Thanks
Kevin
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs

2020-05-27 Thread Marc Zyngier

Hi Zenghui,

On 2020-05-27 08:41, Zenghui Yu wrote:

On 2020/5/27 0:11, Marc Zyngier wrote:

On a system that uses SPIs to implement MSIs (as it would be
the case on a GICv2 system exposing a GICv2m to its guests),
we deny the possibility of injecting SPIs on the in-atomic
fast-path.

This results in a very large amount of context-switches
(roughly equivalent to twice the interrupt rate) on the host,
and suboptimal performance for the guest (as measured with
a test workload involving a virtio interface backed by vhost-net).
Given that GICv2 systems are usually on the low-end of the spectrum
performance wise, they could do without the aggravation.

We solved this for GICv3+ITS by having a translation cache. But
SPIs do not need any extra infrastructure, and can be immediately
injected in the virtual distributor as the locking is already
heavy enough that we don't need to worry about anything.

This halves the number of context switches for the same workload.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/kvm/vgic/vgic-irqfd.c | 20 
  arch/arm64/kvm/vgic/vgic-its.c   |  3 +--
  2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c 
b/arch/arm64/kvm/vgic/vgic-irqfd.c

index d8cdfea5cc96..11a9f81115ab 100644
--- a/arch/arm64/kvm/vgic/vgic-irqfd.c
+++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
@@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct 
kvm_kernel_irq_routing_entry *e,

  struct kvm *kvm, int irq_source_id, int level,
  bool line_status)


... and you may also need to update the comment on top of it to
reflect this change.

/**
 * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
 *
 * Currently only direct MSI injection is supported.
 */


As far as I can tell, it is still valid (at least from the guest's
perspective). You could in practice use that to deal with level
interrupts, but we only inject the rising edge on this path, never
the falling edge. So effectively, this is limited to edge interrupts,
which is mostly MSIs.

Unless you are thinking of something else which I would have missed?

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault

2020-05-27 Thread Marc Zyngier

On 2020-05-27 03:39, Gavin Shan wrote:

Hi Mark,


[...]


Can you run tests with a real workload? For example, a kernel build
inside the VM?



Yeah, I agree it's far from a realistic workload. However, it's the 
test case
which was suggested when async page fault was proposed from day one, 
according
to the following document. On the page#34, you can see the benchmark, 
which is

similar to what we're doing.

https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf


My own question is whether this even makes any sense 10 years later.

The HW has massively changed, and this adds a whole lot of complexity
to both the hypervisor and the guest. It also plays very ugly games
with the exception model, which doesn't give me the warm fuzzy feeling
that it's going to be great.

Ok. I will test with the workload to build kernel or another better one 
to

represent the case.


Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow in-atomic injection of SPIs

2020-05-27 Thread Zenghui Yu

On 2020/5/27 0:11, Marc Zyngier wrote:

On a system that uses SPIs to implement MSIs (as it would be
the case on a GICv2 system exposing a GICv2m to its guests),
we deny the possibility of injecting SPIs on the in-atomic
fast-path.

This results in a very large amount of context-switches
(roughly equivalent to twice the interrupt rate) on the host,
and suboptimal performance for the guest (as measured with
a test workload involving a virtio interface backed by vhost-net).
Given that GICv2 systems are usually on the low-end of the spectrum
performance wise, they could do without the aggravation.

We solved this for GICv3+ITS by having a translation cache. But
SPIs do not need any extra infrastructure, and can be immediately
injected in the virtual distributor as the locking is already
heavy enough that we don't need to worry about anything.

This halves the number of context switches for the same workload.

Signed-off-by: Marc Zyngier 
---
  arch/arm64/kvm/vgic/vgic-irqfd.c | 20 
  arch/arm64/kvm/vgic/vgic-its.c   |  3 +--
  2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kvm/vgic/vgic-irqfd.c b/arch/arm64/kvm/vgic/vgic-irqfd.c
index d8cdfea5cc96..11a9f81115ab 100644
--- a/arch/arm64/kvm/vgic/vgic-irqfd.c
+++ b/arch/arm64/kvm/vgic/vgic-irqfd.c
@@ -107,15 +107,27 @@ int kvm_arch_set_irq_inatomic(struct 
kvm_kernel_irq_routing_entry *e,
  struct kvm *kvm, int irq_source_id, int level,
  bool line_status)


... and you may also need to update the comment on top of it to
reflect this change.

/**
 * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
 *
 * Currently only direct MSI injection is supported.
 */


Thanks,
Zenghui


  {
-   if (e->type == KVM_IRQ_ROUTING_MSI && vgic_has_its(kvm) && level) {
+   if (!level)
+   return -EWOULDBLOCK;
+
+   switch (e->type) {
+   case KVM_IRQ_ROUTING_MSI: {
struct kvm_msi msi;
  
+		if (!vgic_has_its(kvm))

+   return -EINVAL;
+
kvm_populate_msi(e, );
-   if (!vgic_its_inject_cached_translation(kvm, ))
-   return 0;
+   return vgic_its_inject_cached_translation(kvm, );
}
  
-	return -EWOULDBLOCK;

+   case KVM_IRQ_ROUTING_IRQCHIP:
+   /* Injecting SPIs is always possible in atomic context */
+   return vgic_irqfd_set_irq(e, kvm, irq_source_id, 1, 
line_status);
+
+   default:
+   return -EWOULDBLOCK;
+   }
  }
  
  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm)

diff --git a/arch/arm64/kvm/vgic/vgic-its.c b/arch/arm64/kvm/vgic/vgic-its.c
index c012a52b19f5..40cbaca81333 100644
--- a/arch/arm64/kvm/vgic/vgic-its.c
+++ b/arch/arm64/kvm/vgic/vgic-its.c
@@ -757,9 +757,8 @@ int vgic_its_inject_cached_translation(struct kvm *kvm, 
struct kvm_msi *msi)
  
  	db = (u64)msi->address_hi << 32 | msi->address_lo;

irq = vgic_its_check_cache(kvm, db, msi->devid, msi->data);
-
if (!irq)
-   return -1;
+   return -EWOULDBLOCK;
  
  	raw_spin_lock_irqsave(>irq_lock, flags);

irq->pending_latch = true;


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-27 Thread Marc Zyngier

On 2020-05-27 05:05, Gavin Shan wrote:

Hi Mark,



[...]


+struct kvm_vcpu_pv_apf_data {
+   __u32   reason;
+   __u8pad[60];
+   __u32   enabled;
+};


What's all the padding for?



The padding is ensure the @reason and @enabled in different cache
line. @reason is shared by host/guest, while @enabled is almostly
owned by guest.


So you are assuming that a cache line is at most 64 bytes.
It is actualy implementation defined, and you can probe for it
by looking at the CTR_EL0 register. There are implementations
ranging from 32 to 256 bytes in the wild, and let's not mention
broken big-little implementations here.

[...]

+bool kvm_arch_can_inject_async_page_not_present(struct kvm_vcpu 
*vcpu)

+{
+   u64 vbar, pc;
+   u32 val;
+   int ret;
+
+   if (!(vcpu->arch.apf.control_block & KVM_ASYNC_PF_ENABLED))
+   return false;
+
+   if (vcpu->arch.apf.send_user_only && vcpu_mode_priv(vcpu))
+   return false;
+
+   /* Pending page fault, which ins't acknowledged by guest */
+   ret = kvm_async_pf_read_cache(vcpu, );
+   if (ret || val)
+   return false;
+
+   /*
+* Events can't be injected through data abort because it's
+* going to clobber ELR_EL1, which might not consued (or saved)
+* by guest yet.
+*/
+   vbar = vcpu_read_sys_reg(vcpu, VBAR_EL1);
+   pc = *vcpu_pc(vcpu);
+   if (pc >= vbar && pc < (vbar + vcpu->arch.apf.no_fault_inst_range))
+   return false;


Ah, so that's when this `no_fault_inst_range` is for.

As-is this is not sufficient, and we'll need t be extremely careful
here.

The vectors themselves typically only have a small amount of stub 
code,

and the bulk of the non-reentrant exception entry work happens
elsewhere, in a mixture of assembly and C code that isn't even 
virtually

contiguous with either the vectors or itself.

It's possible in theory that code in modules (or perhaps in eBPF JIT'd
code) that isn't safe to take a fault from, so even having a 
contiguous

range controlled by the kernel isn't ideal.

How does this work on x86?



Yeah, here we just provide a mechanism to forbid injecting data abort. 
The
range is fed by guest through HVC call. So I think it's guest related 
issue.
You had more comments about this in PATCH[9]. I will explain a bit more 
there.


x86 basically relies on EFLAGS[IF] flag. The async page fault can be 
injected
if it's on. Otherwise, it's forbidden. It's workable because exception 
is

special interrupt to x86 if I'm correct.

   return (vmcs_readl(GUEST_RFLAGS) & X86_EFLAGS_IF) &&
  !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) &
(GUEST_INTR_STATE_STI | 
GUEST_INTR_STATE_MOV_SS));


I really wish this was relying on an architected exception delivery
mechanism that can be blocked by the guest itself (PSTATE.{I,F,A}).
Trying to guess based on the PC won't fly. But these signals are
pretty hard to multiplex with anything else. Like any form of
non-architected exception injection, I don't see a good path forward
unless we start considering something like SDEI.

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()

2020-05-27 Thread Marc Zyngier

On 2020-05-27 03:43, Gavin Shan wrote:

Hi Mark,

On 5/26/20 8:42 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:

Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
kvm_vcpu_get_esr() to it a bit more self-explaining because the
functions returns ESR instead of HSR on aarch64. This shouldn't
cause any functional changes.

Signed-off-by: Gavin Shan 


I think that this would be a nice cleanup on its own, and could be 
taken
independently of the rest of this series if it were rebased and sent 
as

a single patch.



Yeah, I'll see how PATCH[3,4,5] can be posted independently
as part of the preparatory work, which is suggested by you
in another reply.

By the way, I assume the cleanup patches are good enough to
target 5.8.rc1/rc2 if you agree.


It's fine to base them on -rc1 or -rc2. They will not be merged
before 5.9 though.

Thanks,

M.
--
Jazz is not dead. It just smells funny...
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 6/7] KVM: MIPS: clean up redundant 'kvm_run' parameters

2020-05-27 Thread Tianjia Zhang




On 2020/4/27 13:40, Huacai Chen wrote:

Reviewed-by: Huacai Chen 

On Mon, Apr 27, 2020 at 12:35 PM Tianjia Zhang
 wrote:


In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.

Signed-off-by: Tianjia Zhang 
---
  arch/mips/include/asm/kvm_host.h |  28 +---
  arch/mips/kvm/emulate.c  |  59 ++--
  arch/mips/kvm/mips.c |  11 ++-
  arch/mips/kvm/trap_emul.c| 114 ++-
  arch/mips/kvm/vz.c   |  26 +++
  5 files changed, 87 insertions(+), 151 deletions(-)



Hi Huacai,

These two patches(6/7 and 7/7) should be merged into the tree of the 
mips architecture separately. At present, there seems to be no good way 
to merge the whole architecture patchs.


For this series of patches, some architectures have been merged, some 
need to update the patch.


Thanks and best,
Tianjia
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 6/9] kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 8:58 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:16PM +1000, Gavin Shan wrote:

This renames user_mem_abort() to kvm_handle_user_mem_abort(), and
then export it. The function will be used in asynchronous page fault
to populate a page table entry once the corresponding page is populated
from the backup device (e.g. swap partition):

* Parameter @fault_status is replace by @esr.
* The parameters are reorder based on their importance.


It seems like multiple changes are going on here, and it would be
clearer with separate patches.

Passing the ESR rather than the extracted fault status seems fine, but
for clarirty it's be nicer to do this in its own patch.



Ok. I'll have a separate patch to do this.


Why is it necessary to re-order the function parameters? Does that align
with other function prototypes?



There are no explicit reasons for it. I can restore the order to what we
had previously if you like.


What exactly is the `prefault` parameter meant to do? It doesn't do
anything currently, so it'd be better to introduce it later when logic
using it is instroduced, or where callers will pass distinct values.



Yes, fair enough. I will merge the logic with PATCH[7] then.


Thanks,
Mark.



Thanks,
Gavin



This shouldn't cause any functional changes.

Signed-off-by: Gavin Shan 
---
  arch/arm64/include/asm/kvm_host.h |  4 
  virt/kvm/arm/mmu.c| 14 --
  2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 32c8a675e5a4..f77c706777ec 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -437,6 +437,10 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
  struct kvm_vcpu_events *events);
  
  #define KVM_ARCH_WANT_MMU_NOTIFIER

+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
+ struct kvm_memory_slot *memslot,
+ phys_addr_t fault_ipa, unsigned long hva,
+ bool prefault);
  int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
  int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index e462e0368fd9..95aaabb2b1fc 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -1656,12 +1656,12 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
   (hva & ~(map_size - 1)) + map_size <= uaddr_end;
  }
  
-static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,

- struct kvm_memory_slot *memslot, unsigned long hva,
- unsigned long fault_status)
+int kvm_handle_user_mem_abort(struct kvm_vcpu *vcpu, unsigned int esr,
+ struct kvm_memory_slot *memslot,
+ phys_addr_t fault_ipa, unsigned long hva,
+ bool prefault)
  {
-   int ret;
-   u32 esr = kvm_vcpu_get_esr(vcpu);
+   unsigned int fault_status = kvm_vcpu_trap_get_fault_type(esr);
bool write_fault, writable, force_pte = false;
bool exec_fault, needs_exec;
unsigned long mmu_seq;
@@ -1674,6 +1674,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
pgprot_t mem_type = PAGE_S2;
bool logging_active = memslot_is_logging(memslot);
unsigned long vma_pagesize, flags = 0;
+   int ret;
  
  	write_fault = kvm_is_write_fault(esr);

exec_fault = kvm_vcpu_trap_is_iabt(esr);
@@ -1995,7 +1996,8 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
goto out_unlock;
}
  
-	ret = user_mem_abort(vcpu, fault_ipa, memslot, hva, fault_status);

+   ret = kvm_handle_user_mem_abort(vcpu, esr, memslot,
+   fault_ipa, hva, false);
if (ret == 0)
ret = 1;
  out:
--
2.23.0





___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-27 Thread Zengtao (B)
> -Original Message-
> From: zhengxiang (A)
> Sent: Monday, May 25, 2020 7:34 PM
> To: Jean-Philippe Brucker
> Cc: linux-arm-ker...@lists.infradead.org; kvmarm@lists.cs.columbia.edu;
> Will Deacon; Wanghaibin (D); Zengtao (B); m...@kernel.org; James Morse;
> julien.thierry.k...@gmail.com; Suzuki K Poulose; Wangzhou (B);
> io...@lists.linux-foundation.org; Kirti Wankhede; Yan Zhao;
> alex.william...@redhat.com
> Subject: Re: [RFC] Use SMMU HTTU for DMA dirty page tracking
> 
> [+cc Kirti, Yan, Alex]
> 
> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
> > Hi,
> >
> > On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
> >> Hi all,
> >>
> >> Is there any plan for enabling SMMU HTTU?
> >
> > Not outside of SVA, as far as I know.
> >
> 
> >> I have seen the patch locates in the SVA series patch, which adds
> >> support for HTTU:
> >> https://www.spinics.net/lists/arm-kernel/msg798694.html
> >>
> >> HTTU reduces the number of access faults on SMMU fault queue
> >> (permission faults also benifit from it).
> >>
> >> Besides reducing the faults, HTTU also helps to track dirty pages for
> >> device DMA. Is it feasible to utilize HTTU to get dirty pages on device
> >> DMA during VFIO live migration?
> >
> > As you know there is a VFIO interface for this under discussion:
> >
> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-kwankhe
> d...@nvidia.com/
> > It doesn't implement an internal API to communicate with the IOMMU
> driver
> > about dirty pages.
> 
> >
> >> If SMMU can track dirty pages, devices are not required to implement
> >> additional dirty pages tracking to support VFIO live migration.
> >
> > It seems feasible, though tracking it in the device might be more
> > efficient. I might have misunderstood but I think for live migration of
> > the Intel NIC they trap guest accesses to the device and introspect its
> > state to figure out which pages it is accessing.
> >
> > With HTTU I suppose (without much knowledge about live migration)
> that
> > you'd need several new interfaces to the IOMMU drivers:
> >
> > * A way for VFIO to query HTTU support in the SMMU. There are some
> >   discussions about communicating more IOMMU capabilities through
> VFIO but
> >   no implementation yet. When HTTU isn't supported the DIRTY_PAGES
> bitmap
> >   would report all pages as they do now.
> >
> > * VFIO_IOMMU_DIRTY_PAGES_FLAG_START/STOP would clear the dirty
> bit
> >   for all VFIO mappings (which is going to take some time). There is a
> >   walker in io-pgtable for iova_to_phys() which could be extended. I
> >   suppose it's also possible to atomically switch the HA and HD bits in
> >   context descriptors.
> 
> Maybe we need not switch HA and HD bits, just turn on them all the time?

I think we need START/STOP, start is called when migration is started and STOP
called when migration finished.

I think you mean that during the migration(between START and STOP), HA and HD
functionality is always turned on.

> 
> >
> > * VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP would query the
> dirty bit for all
> >   VFIO mappings.
> >
We need a clear during the query which mean we have to reset the dirty status 
after a query.

> 
> I think we need to consider the case of IOMMU dirty pages logging. We
> want
> to test Kirti's VFIO migration patches combined with SMMU HTTU, any
> suggestions?

@kirti @yan @alex
As we know, you have worked on this feature for a long time, and it seems that
 everything is going very well now, thanks very much for your VFIO migration 
framework, it really helps a lot, and we want to start with SMMU HTTU enabled
 based on your jobs, it's kind of hardware dirty page tracking, and expected to 
bring us
 better performance, any suggestions? 

Thank you.
Zengtao 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 7/9] kvm/arm64: Support async page fault

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 10:34 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:17PM +1000, Gavin Shan wrote:

There are two stages of fault pages and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. To populate the
requested page can be related to IO activities if the page was swapped
out previously. In this case, the guest has to suspend for a few of
milliseconds at least, regardless of the overall system load. There
is no useful work done during the suspended period from guest's view.


This is a bit difficult to read. How about:

| When a vCPU triggers a Stage-2 fault (e.g. when accessing a page that
| is not mapped at Stage-2), the vCPU is suspended until the host has
| handled the fault. It can take the host milliseconds or longer to
| handle the fault as this may require IO, and when the system load is
| low neither the host nor guest perform useful work during such
| periods.



Yes, It's much better.



This adds asychornous page fault to improve the situation. A signal


Nit: typo for `asynchronous` here, and there are a few other typos in
the patch itself. It would be nice if you could run a spellcheck over
that.



Sure.


(PAGE_NOT_PRESENT) is sent to guest if the requested page needs some time
to be populated. Guest might reschedule to another running process if
possible. Otherwise, the vCPU is put into power-saving mode, which is
actually to cause vCPU reschedule from host's view. A followup signal
(PAGE_READY) is sent to guest once the requested page is populated.
The suspended task is waken up or scheduled when guest receives the
signal. With this mechanism, the vCPU won't be stuck when the requested
page is being populated by host.


It would probably be best to say 'notification' rather than 'signal'
here, and say 'the guest is notified', etc. As above, it seems that this
is per-vCPU, so it's probably better to say 'vCPU' rather than guest, to
make it clear which context this applies to.



Ok.



There are more details highlighted as below. Note the implementation is
similar to what x86 has to some extent:

* A dedicated SMCCC ID is reserved to enable, disable or configure
  the functionality. The only 64-bits parameter is conveyed by two
  registers (w2/w1). Bits[63:56] is the bitmap used to specify the
  operated functionality like enabling/disabling/configuration. The
  bits[55:6] is the physical address of control block or external
  data abort injection disallowed region. Bit[5:0] are used to pass
  control flags.

* Signal (PAGE_NOT_PRESENT) is sent to guest if the requested page
  isn't ready. In the mean while, a work is started to populate the
  page asynchronously in background. The stage 2 page table entry is
  updated accordingly and another signal (PAGE_READY) is fired after
  the request page is populted. The signals is notified by injected
  data abort fault.

* The signals are fired and consumed in sequential fashion. It means
  no more signals will be fired if there is pending one, awaiting the
  guest to consume. It's because the injected data abort faults have
  to be done in sequential fashion.

Signed-off-by: Gavin Shan 
---
  arch/arm64/include/asm/kvm_host.h  |  43 
  arch/arm64/include/asm/kvm_para.h  |  27 ++
  arch/arm64/include/uapi/asm/Kbuild |   2 -
  arch/arm64/include/uapi/asm/kvm_para.h |  22 ++
  arch/arm64/kvm/Kconfig |   1 +
  arch/arm64/kvm/Makefile|   2 +
  include/linux/arm-smccc.h  |   6 +
  virt/kvm/arm/arm.c |  36 ++-
  virt/kvm/arm/async_pf.c| 335 +
  virt/kvm/arm/hypercalls.c  |   8 +
  virt/kvm/arm/mmu.c |  29 ++-
  11 files changed, 506 insertions(+), 5 deletions(-)
  create mode 100644 arch/arm64/include/asm/kvm_para.h
  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
  create mode 100644 virt/kvm/arm/async_pf.c

diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index f77c706777ec..a207728d6f3f 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -250,6 +250,23 @@ struct vcpu_reset_state {
boolreset;
  };
  
+#ifdef CONFIG_KVM_ASYNC_PF

+
+/* Should be a power of two number */
+#define ASYNC_PF_PER_VCPU  64


What exactly is this number?



It's maximal number of async page faults pending on the specific vCPU.


+
+/*
+ * The association of gfn and token. The token will be sent to guest as
+ * page fault address. Also, the guest could be in aarch32 mode. So its
+ * length should be 32-bits.
+ */


The length of what should be 32-bit? The token?

The guest sees the token as the fault address? How exactly is that
exposed to the guest, is that via a 

Re: [PATCH RFCv2 3/9] kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 8:42 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:13PM +1000, Gavin Shan wrote:

Since kvm/arm32 was removed, this renames kvm_vcpu_get_hsr() to
kvm_vcpu_get_esr() to it a bit more self-explaining because the
functions returns ESR instead of HSR on aarch64. This shouldn't
cause any functional changes.

Signed-off-by: Gavin Shan 


I think that this would be a nice cleanup on its own, and could be taken
independently of the rest of this series if it were rebased and sent as
a single patch.



Yeah, I'll see how PATCH[3,4,5] can be posted independently
as part of the preparatory work, which is suggested by you
in another reply.

By the way, I assume the cleanup patches are good enough to
target 5.8.rc1/rc2 if you agree.

Thanks,
Gavin

---
  arch/arm64/include/asm/kvm_emulate.h | 36 +++-
  arch/arm64/kvm/handle_exit.c | 12 +-
  arch/arm64/kvm/hyp/switch.c  |  2 +-
  arch/arm64/kvm/sys_regs.c|  6 ++---
  virt/kvm/arm/hyp/aarch32.c   |  2 +-
  virt/kvm/arm/hyp/vgic-v3-sr.c|  4 ++--
  virt/kvm/arm/mmu.c   |  6 ++---
  7 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index a30b4eec7cb4..bd1a69e7c104 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -265,14 +265,14 @@ static inline bool vcpu_mode_priv(const struct kvm_vcpu 
*vcpu)
return mode != PSR_MODE_EL0t;
  }
  
-static __always_inline u32 kvm_vcpu_get_hsr(const struct kvm_vcpu *vcpu)

+static __always_inline u32 kvm_vcpu_get_esr(const struct kvm_vcpu *vcpu)
  {
return vcpu->arch.fault.esr_el2;
  }
  
  static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)

  {
-   u32 esr = kvm_vcpu_get_hsr(vcpu);
+   u32 esr = kvm_vcpu_get_esr(vcpu);
  
  	if (esr & ESR_ELx_CV)

return (esr & ESR_ELx_COND_MASK) >> ESR_ELx_COND_SHIFT;
@@ -297,64 +297,66 @@ static inline u64 kvm_vcpu_get_disr(const struct kvm_vcpu 
*vcpu)
  
  static inline u32 kvm_vcpu_hvc_get_imm(const struct kvm_vcpu *vcpu)

  {
-   return kvm_vcpu_get_hsr(vcpu) & ESR_ELx_xVC_IMM_MASK;
+   return kvm_vcpu_get_esr(vcpu) & ESR_ELx_xVC_IMM_MASK;
  }
  
  static __always_inline bool kvm_vcpu_dabt_isvalid(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_ISV);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_ISV);
  }
  
  static inline unsigned long kvm_vcpu_dabt_iss_nisv_sanitized(const struct kvm_vcpu *vcpu)

  {
-   return kvm_vcpu_get_hsr(vcpu) & (ESR_ELx_CM | ESR_ELx_WNR | 
ESR_ELx_FSC);
+   return kvm_vcpu_get_esr(vcpu) &
+  (ESR_ELx_CM | ESR_ELx_WNR | ESR_ELx_FSC);
  }
  
  static inline bool kvm_vcpu_dabt_issext(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SSE);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SSE);
  }
  
  static inline bool kvm_vcpu_dabt_issf(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SF);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_SF);
  }
  
  static __always_inline int kvm_vcpu_dabt_get_rd(const struct kvm_vcpu *vcpu)

  {
-   return (kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
+   return (kvm_vcpu_get_esr(vcpu) & ESR_ELx_SRT_MASK) >> ESR_ELx_SRT_SHIFT;
  }
  
  static __always_inline bool kvm_vcpu_dabt_iss1tw(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_S1PTW);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_S1PTW);
  }
  
  static __always_inline bool kvm_vcpu_dabt_iswrite(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_WNR) ||
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_WNR) ||
kvm_vcpu_dabt_iss1tw(vcpu); /* AF/DBM update */
  }
  
  static inline bool kvm_vcpu_dabt_is_cm(const struct kvm_vcpu *vcpu)

  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_CM);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_CM);
  }
  
  static __always_inline unsigned int kvm_vcpu_dabt_get_as(const struct kvm_vcpu *vcpu)

  {
-   return 1 << ((kvm_vcpu_get_hsr(vcpu) & ESR_ELx_SAS) >> 
ESR_ELx_SAS_SHIFT);
+   return 1 << ((kvm_vcpu_get_esr(vcpu) & ESR_ELx_SAS) >>
+ESR_ELx_SAS_SHIFT);
  }
  
  /* This one is not specific to Data Abort */

  static __always_inline bool kvm_vcpu_trap_il_is32bit(const struct kvm_vcpu 
*vcpu)
  {
-   return !!(kvm_vcpu_get_hsr(vcpu) & ESR_ELx_IL);
+   return !!(kvm_vcpu_get_esr(vcpu) & ESR_ELx_IL);
  }
  
  static __always_inline u8 kvm_vcpu_trap_get_class(const struct kvm_vcpu *vcpu)

  {
-   return ESR_ELx_EC(kvm_vcpu_get_hsr(vcpu));
+   return ESR_ELx_EC(kvm_vcpu_get_esr(vcpu));
  }
  
  static inline bool kvm_vcpu_trap_is_iabt(const struct kvm_vcpu *vcpu)

@@ -364,12 +366,12 @@ static inline bool 

Re: [PATCH RFCv2 5/9] kvm/arm64: Replace hsr with esr

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 8:45 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:15PM +1000, Gavin Shan wrote:

This replace the variable names to make them self-explaining. The
tracepoint isn't changed accordingly because they're part of ABI:

* @hsr to @esr
* @hsr_ec to @ec
* Use kvm_vcpu_trap_get_class() helper if possible

Signed-off-by: Gavin Shan 


As with patch 3, I think this cleanup makes sense independent from the
rest of the series, and I think it'd make sense to bundle all the
patches renaming hsr -> esr, and send those as a preparatory series.



Yes, PATCH[3/4/5] will be posted independently, as part of the
preparatory work, as you suggested.

Thanks,
Gavin


Thanks,
Mark.


---
  arch/arm64/kvm/handle_exit.c | 28 ++--
  arch/arm64/kvm/hyp/switch.c  |  9 -
  arch/arm64/kvm/sys_regs.c| 30 +++---
  3 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 00858db82a64..e3b3dcd5b811 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -123,13 +123,13 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
   */
  static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, struct kvm_run *run)
  {
-   u32 hsr = kvm_vcpu_get_esr(vcpu);
+   u32 esr = kvm_vcpu_get_esr(vcpu);
int ret = 0;
  
  	run->exit_reason = KVM_EXIT_DEBUG;

-   run->debug.arch.hsr = hsr;
+   run->debug.arch.hsr = esr;
  
-	switch (ESR_ELx_EC(hsr)) {

+   switch (kvm_vcpu_trap_get_class(esr)) {
case ESR_ELx_EC_WATCHPT_LOW:
run->debug.arch.far = vcpu->arch.fault.far_el2;
/* fall through */
@@ -139,8 +139,8 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, 
struct kvm_run *run)
case ESR_ELx_EC_BRK64:
break;
default:
-   kvm_err("%s: un-handled case hsr: %#08x\n",
-   __func__, (unsigned int) hsr);
+   kvm_err("%s: un-handled case esr: %#08x\n",
+   __func__, (unsigned int)esr);
ret = -1;
break;
}
@@ -150,10 +150,10 @@ static int kvm_handle_guest_debug(struct kvm_vcpu *vcpu, 
struct kvm_run *run)
  
  static int kvm_handle_unknown_ec(struct kvm_vcpu *vcpu, struct kvm_run *run)

  {
-   u32 hsr = kvm_vcpu_get_esr(vcpu);
+   u32 esr = kvm_vcpu_get_esr(vcpu);
  
-	kvm_pr_unimpl("Unknown exception class: hsr: %#08x -- %s\n",

- hsr, esr_get_class_string(hsr));
+   kvm_pr_unimpl("Unknown exception class: esr: %#08x -- %s\n",
+ esr, esr_get_class_string(esr));
  
  	kvm_inject_undefined(vcpu);

return 1;
@@ -230,10 +230,10 @@ static exit_handle_fn arm_exit_handlers[] = {
  
  static exit_handle_fn kvm_get_exit_handler(struct kvm_vcpu *vcpu)

  {
-   u32 hsr = kvm_vcpu_get_esr(vcpu);
-   u8 hsr_ec = ESR_ELx_EC(hsr);
+   u32 esr = kvm_vcpu_get_esr(vcpu);
+   u8 ec = kvm_vcpu_trap_get_class(esr);
  
-	return arm_exit_handlers[hsr_ec];

+   return arm_exit_handlers[ec];
  }
  
  /*

@@ -273,15 +273,15 @@ int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run 
*run,
  {
if (ARM_SERROR_PENDING(exception_index)) {
u32 esr = kvm_vcpu_get_esr(vcpu);
-   u8 hsr_ec = ESR_ELx_EC(esr);
+   u8 ec = kvm_vcpu_trap_get_class(esr);
  
  		/*

 * HVC/SMC already have an adjusted PC, which we need
 * to correct in order to return to after having
 * injected the SError.
 */
-   if (hsr_ec == ESR_ELx_EC_HVC32 || hsr_ec == ESR_ELx_EC_HVC64 ||
-   hsr_ec == ESR_ELx_EC_SMC32 || hsr_ec == ESR_ELx_EC_SMC64) {
+   if (ec == ESR_ELx_EC_HVC32 || ec == ESR_ELx_EC_HVC64 ||
+   ec == ESR_ELx_EC_SMC32 || ec == ESR_ELx_EC_SMC64) {
u32 adj =  kvm_vcpu_trap_il_is32bit(esr) ? 4 : 2;
*vcpu_pc(vcpu) -= adj;
}
diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
index 369f22f49f3d..7bf4840bf90e 100644
--- a/arch/arm64/kvm/hyp/switch.c
+++ b/arch/arm64/kvm/hyp/switch.c
@@ -356,8 +356,8 @@ static bool __hyp_text __populate_fault_info(struct 
kvm_vcpu *vcpu)
  static bool __hyp_text __hyp_handle_fpsimd(struct kvm_vcpu *vcpu)
  {
u32 esr = kvm_vcpu_get_esr(vcpu);
+   u8 ec = kvm_vcpu_trap_get_class(esr);
bool vhe, sve_guest, sve_host;
-   u8 hsr_ec;
  
  	if (!system_supports_fpsimd())

return false;
@@ -372,14 +372,13 @@ static bool __hyp_text __hyp_handle_fpsimd(struct 
kvm_vcpu *vcpu)
vhe = has_vhe();
}
  
-	hsr_ec = kvm_vcpu_trap_get_class(esr);

-   if (hsr_ec != ESR_ELx_EC_FP_ASIMD &&
-   hsr_ec != ESR_ELx_EC_SVE)
+   if (ec != ESR_ELx_EC_FP_ASIMD &&
+   ec != 

Re: [PATCH RFCv2 0/9] kvm/arm64: Support Async Page Fault

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 11:09 PM, Mark Rutland wrote:

At a high-level I'm rather fearful of this series. I can see many ways
that this can break, and I can also see that even if/when we get things
into a working state, constant vigilance will be requried for any
changes to the entry code.

I'm not keen on injecting non-architectural exceptions in this way, and
I'm also not keen on how deep the PV hooks are injected currently (e.g.
in the ret_to_user path).



First of all, thank you for your time and providing your comments continuously.
Since the series is tagged as RFC, it's not a surprise to see something is
obviously broken. However, Could you please provide more details? With more
details, I can figure out the solutions. If I'm correct, you're talking about
the added entry code and the injected PV hooks. Anyway, please provide more
details about your concerns so that I can figure out the solutions.

Let me briefly explain why we need the injected PV hooks in ret_to_user: There
are two fashions of wakeup and I would call them as direct wakeup and delayed
wakeup. The sleeping process is waked up directly when received PAGE_READY
notification from the host, which is the process of direct wakeup. However there
are some cases the direct wakeup can't be carried out. For example, the sleeper
and the waker are same process or the (CFS) runqueue has been locked by somebody
else. In these cases, the wakeup is delayed until the idle process is running or
in ret_to_user. It's how delayed wakeup works.


I see a few patches have preparator cleanup that I think would be
worthwhile regardless of this series; if you could factor those out and
send them on their own it would get that out of the way and make it
easier to review the series itself. Similarly, there's some duplication
of code from arch/x86 which I think can be factored out to virt/kvm
instead as preparatory work.



Yep, I agree there are several cleanup patches can be posted separately
and merged in advance. I will do that and thanks for the comments.

About the shared code between arm64/x86, I need some time to investigate.
Basically, I agree to do so. I also included Paolo here to check his opnion.

It's no doubt these are all preparatory work, to make the review a bit
easier as you said :)


Generally, I also think that you need to spend some time on commit
messages and/or documentation to better explain the concepts and
expected usage. I had to reverse-engineer the series by reviewing it in
entirety before I had an idea as to how basic parts of it strung
together, and a more thorough conceptual explanation would make it much
easier to critique the approach rather than the individual patches.



Yes, sure. I will do this in the future. Sorry about having taken you
too much to do the reverse-engineering. In next revision, I might put
more information in the cover letter and commit log to explain how things
are designed and working :)


On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote:

Testing
===
The tests are carried on the following machine. A guest with single vCPU
and 4GB memory is started. Also, the QEMU process is put into memory cgroup
(v1) whose memory limit is set to 2GB. In the guest, there are two threads,
which are memory bound and CPU bound separately. The memory bound thread
allocates all available memory, accesses and them free them. The CPU bound
thread simply executes block of "nop".


I appreciate this is a microbenchmark, but that sounds far from
realistic.

Is there a specitic real workload that's expected to be representative
of?

Can you run tests with a real workload? For example, a kernel build
inside the VM?



Yeah, I agree it's far from a realistic workload. However, it's the test case
which was suggested when async page fault was proposed from day one, according
to the following document. On the page#34, you can see the benchmark, which is
similar to what we're doing.

https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf

Ok. I will test with the workload to build kernel or another better one to
represent the case.


The test is carried out for 5 time
continuously and the average number (per minute) of executed blocks in the
CPU bound thread is taken as indicator of improvement.

Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
Memory: 32GB   Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)

Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
 7629633480 6170527920)
With-APF:8286827472/minute = avg(8464584540 8177073360 8262723180
 8095084020 8434672260)
Outcome: +17.8%

Another test case is to measure the time consumed by the application, but
with the CPU-bound thread disabled.

Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
With-APF:40.8s = avg(40.6 41.1 40.9 41.0 40.7)
Outcome: +1.2%


So this is pure overhead in that case?



Re: [PATCH RFCv2 4/9] kvm/arm64: Detach ESR operator from vCPU struct

2020-05-27 Thread Gavin Shan

Hi Mark,

On 5/26/20 8:51 PM, Mark Rutland wrote:

On Fri, May 08, 2020 at 01:29:14PM +1000, Gavin Shan wrote:

There are a set of inline functions defined in kvm_emulate.h. Those
functions reads ESR from vCPU fault information struct and then operate
on it. So it's tied with vCPU fault information and vCPU struct. It
limits their usage scope.

This detaches these functions from the vCPU struct. With this, the
caller has flexibility on where the ESR is read. It shouldn't cause
any functional changes.

Signed-off-by: Gavin Shan 
---
  arch/arm64/include/asm/kvm_emulate.h | 83 +++-
  arch/arm64/kvm/handle_exit.c | 20 --
  arch/arm64/kvm/hyp/switch.c  | 24 ---
  arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |  7 +-
  arch/arm64/kvm/inject_fault.c|  4 +-
  arch/arm64/kvm/sys_regs.c| 12 ++--
  virt/kvm/arm/arm.c   |  4 +-
  virt/kvm/arm/hyp/aarch32.c   |  2 +-
  virt/kvm/arm/hyp/vgic-v3-sr.c|  5 +-
  virt/kvm/arm/mmio.c  | 27 
  virt/kvm/arm/mmu.c   | 22 ---
  11 files changed, 112 insertions(+), 98 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_emulate.h 
b/arch/arm64/include/asm/kvm_emulate.h
index bd1a69e7c104..2873bf6dc85e 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -270,10 +270,8 @@ static __always_inline u32 kvm_vcpu_get_esr(const struct 
kvm_vcpu *vcpu)
return vcpu->arch.fault.esr_el2;
  }
  
-static __always_inline int kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)

+static __always_inline int kvm_vcpu_get_condition(u32 esr)


Given the `vcpu` argument has been removed, it's odd to keep `vcpu` in the
name, rather than `esr`.

e.g. this would make more sense as something like esr_get_condition().

... and if we did something like that, we could move most of the
extraction functions into , and share them with non-KVM code.

Otherwise, do you need to extract all of these for your use-case, or do
you only need a few of the helpers? If you only need a few, it might be
better to only factor those out for now, and keep the existing API in
place with wrappers, e.g. have:

| esr_get_condition(u32 esr) {
|   ...
| }
|
| kvm_vcpu_get_condition(const struct kvm_vcpu *vcpu)
| {
|   return esr_get_condition(kvm_vcpu_get_esr(vcpu));
| }



Sure, I'll follow approach#1, to move these helper functions to asm/esr.h
and with "vcpu" dropped in their names. I don't think it makes sense to
maintain two sets of helper functions for the simple logic. So the helper
function will be called where they should be, as below:

   esr_get_condition(u32 esr) { ... }
   
   bool __hyp_text kvm_condition_valid32(const struct kvm_vcpu *vcpu)

   {
int cond = esr_get_condition(kvm_vcpu_get_esr(vcpu));
:
   }

Thanks,
Gavin

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v4 3/7] KVM: PPC: Remove redundant kvm_run from vcpu_arch

2020-05-27 Thread Tianjia Zhang




On 2020/5/27 12:20, Paul Mackerras wrote:

On Mon, Apr 27, 2020 at 12:35:10PM +0800, Tianjia Zhang wrote:

The 'kvm_run' field already exists in the 'vcpu' structure, which
is the same structure as the 'kvm_run' in the 'vcpu_arch' and
should be deleted.

Signed-off-by: Tianjia Zhang 


Thanks, patches 3 and 4 of this series applied to my kvm-ppc-next branch.

Paul.



Thanks for your suggestion, for 5/7, I will submit a new version patch.

Thanks,
Tianjia
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH RFCv2 9/9] arm64: Support async page fault

2020-05-27 Thread Paolo Bonzini
Hi Gavin,

I definitely appreciate the work, but this is repeating most of the
mistakes done in the x86 implementation.  In particular:

- the page ready signal can be done as an interrupt, rather than an
exception.  This is because "page ready" can be handled asynchronously,
in contrast to "page not present" which must be done on the same
instruction that triggers it.  You can refer to the recent series from
Vitaly Kuznetsov that switched "page ready" to an interrupt.

- the page not present is reusing the memory abort exception, and
there's really no reason to do so.  I think it would be best if ARM
could reserve one ESR exception code for the hypervisor.  Mark, any
ideas how to proceed here?

- for x86 we're also thinking of initiating the page fault from the
exception handler, rather than doing so from the hypervisor before
injecting the exception.  If ARM leads the way here, we would do our
best to share code when x86 does the same.

- do not bother with using KVM_ASYNC_PF_SEND_ALWAYS, it's a fringe case
that adds a lot of complexity.

Also, please include me on further iterations of the series.

Thanks!

Paolo

On 08/05/20 05:29, Gavin Shan wrote:
> This supports asynchronous page fault for the guest. The design is
> similar to what x86 has: on receiving a PAGE_NOT_PRESENT signal from
> the host, the current task is either rescheduled or put into power
> saving mode. The task will be waken up when PAGE_READY signal is
> received. The PAGE_READY signal might be received in the context
> of the suspended process, to be waken up. That means the suspended
> process has to wake up itself, but it's not safe and prone to cause
> dead-lock on CPU runqueue lock. So the wakeup is delayed on returning
> from kernel space to user space or idle process is picked for running.
> 
> The signals are conveyed through the async page fault control block,
> which was passed to host on enabling the functionality. On each page
> fault, the control block is checked and switch to the async page fault
> handling flow if any signals exist.
> 
> The feature is put into the CONFIG_KVM_GUEST umbrella, which is added
> by this patch. So we have inline functions implemented in kvm_para.h,
> like other architectures do, to check if async page fault (one of the
> KVM para-virtualized features) is available. Also, the kernel boot
> parameter "no-kvmapf" can be specified to disable the feature.
> 
> Signed-off-by: Gavin Shan 
> ---
>  arch/arm64/Kconfig |  11 +
>  arch/arm64/include/asm/exception.h |   3 +
>  arch/arm64/include/asm/kvm_para.h  |  27 +-
>  arch/arm64/kernel/entry.S  |  33 +++
>  arch/arm64/kernel/process.c|   4 +
>  arch/arm64/mm/fault.c  | 434 +
>  6 files changed, 505 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 40fb05d96c60..2d5e5ee62d6d 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1045,6 +1045,17 @@ config PARAVIRT
> under a hypervisor, potentially improving performance significantly
> over full virtualization.
>  
> +config KVM_GUEST
> + bool "KVM Guest Support"
> + depends on PARAVIRT
> + default y
> + help
> +   This option enables various optimizations for running under the KVM
> +   hypervisor. Overhead for the kernel when not running inside KVM should
> +   be minimal.
> +
> +   In case of doubt, say Y
> +
>  config PARAVIRT_TIME_ACCOUNTING
>   bool "Paravirtual steal time accounting"
>   select PARAVIRT
> diff --git a/arch/arm64/include/asm/exception.h 
> b/arch/arm64/include/asm/exception.h
> index 7a6e81ca23a8..d878afa42746 100644
> --- a/arch/arm64/include/asm/exception.h
> +++ b/arch/arm64/include/asm/exception.h
> @@ -46,4 +46,7 @@ void bad_el0_sync(struct pt_regs *regs, int reason, 
> unsigned int esr);
>  void do_cp15instr(unsigned int esr, struct pt_regs *regs);
>  void do_el0_svc(struct pt_regs *regs);
>  void do_el0_svc_compat(struct pt_regs *regs);
> +#ifdef CONFIG_KVM_GUEST
> +void kvm_async_pf_delayed_wake(void);
> +#endif
>  #endif   /* __ASM_EXCEPTION_H */
> diff --git a/arch/arm64/include/asm/kvm_para.h 
> b/arch/arm64/include/asm/kvm_para.h
> index 0ea481dd1c7a..b2f8ef243df7 100644
> --- a/arch/arm64/include/asm/kvm_para.h
> +++ b/arch/arm64/include/asm/kvm_para.h
> @@ -3,6 +3,20 @@
>  #define _ASM_ARM_KVM_PARA_H
>  
>  #include 
> +#include 
> +#include 
> +
> +#ifdef CONFIG_KVM_GUEST
> +static inline int kvm_para_available(void)
> +{
> + return 1;
> +}
> +#else
> +static inline int kvm_para_available(void)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_KVM_GUEST */
>  
>  static inline bool kvm_check_and_clear_guest_paused(void)
>  {
> @@ -11,17 +25,16 @@ static inline bool kvm_check_and_clear_guest_paused(void)
>  
>  static inline unsigned int kvm_arch_para_features(void)
>  {
> - return 0;
> + unsigned int features = 0;
> +
> + if 

Re: [RFC] Use SMMU HTTU for DMA dirty page tracking

2020-05-27 Thread Xiang Zheng


On 2020/5/27 11:27, Tian, Kevin wrote:
>> From: Xiang Zheng
>> Sent: Monday, May 25, 2020 7:34 PM
>>
>> [+cc Kirti, Yan, Alex]
>>
>> On 2020/5/23 1:14, Jean-Philippe Brucker wrote:
>>> Hi,
>>>
>>> On Tue, May 19, 2020 at 05:42:55PM +0800, Xiang Zheng wrote:
 Hi all,

 Is there any plan for enabling SMMU HTTU?
>>>
>>> Not outside of SVA, as far as I know.
>>>
>>
 I have seen the patch locates in the SVA series patch, which adds
 support for HTTU:
 https://www.spinics.net/lists/arm-kernel/msg798694.html

 HTTU reduces the number of access faults on SMMU fault queue
 (permission faults also benifit from it).

 Besides reducing the faults, HTTU also helps to track dirty pages for
 device DMA. Is it feasible to utilize HTTU to get dirty pages on device
 DMA during VFIO live migration?
>>>
>>> As you know there is a VFIO interface for this under discussion:
>>> https://lore.kernel.org/kvm/1589781397-28368-1-git-send-email-
>> kwankh...@nvidia.com/
>>> It doesn't implement an internal API to communicate with the IOMMU
>> driver
>>> about dirty pages.
> 
> We plan to add such API later, e.g. to utilize A/D bit in VT-d 2nd-level 
> page tables (Rev 3.0). 
> 

Thank you, Kevin.

When will you send this series patches? Maybe(Hope) we can also support
hardware-based dirty pages tracking via common APIs based on your patches. :)

>>
>>>
 If SMMU can track dirty pages, devices are not required to implement
 additional dirty pages tracking to support VFIO live migration.
>>>
>>> It seems feasible, though tracking it in the device might be more
>>> efficient. I might have misunderstood but I think for live migration of
>>> the Intel NIC they trap guest accesses to the device and introspect its
>>> state to figure out which pages it is accessing.
> 
> Does HTTU implement A/D-like mechanism in SMMU page tables, or just
> report dirty pages in a log buffer? Either way tracking dirty pages in IOMMU
> side is generic thus doesn't require device-specific tweak like in Intel NIC.
> 

Currently HTTU just implement A/D-like mechanism in SMMU page tables. We 
certainly
expect SMMU can also implement PML-like feature so that we can avoid walking the
whole page table to get the dirty pages.

By the way, I'm not sure whether HTTU or SLAD can help for mediated deivce.

-- 
Thanks,
Xiang

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


RE: [RFC PATCH v12 07/11] psci: Add hypercall service for kvm ptp.

2020-05-27 Thread Jianyong Wu
Hi Steven,

> -Original Message-
> From: Steven Price 
> Sent: Tuesday, May 26, 2020 7:02 PM
> To: Jianyong Wu ; net...@vger.kernel.org;
> yangbo...@nxp.com; john.stu...@linaro.org; t...@linutronix.de;
> pbonz...@redhat.com; sean.j.christopher...@intel.com; m...@kernel.org;
> richardcoch...@gmail.com; Mark Rutland ;
> w...@kernel.org; Suzuki Poulose 
> Cc: linux-ker...@vger.kernel.org; linux-arm-ker...@lists.infradead.org;
> kvmarm@lists.cs.columbia.edu; k...@vger.kernel.org; Steve Capper
> ; Kaly Xin ; Justin He
> ; Wei Chen ; nd 
> Subject: Re: [RFC PATCH v12 07/11] psci: Add hypercall service for kvm ptp.
> 
> On 25/05/2020 03:11, Jianyong Wu wrote:
> > Hi Steven,
> 
> Hi Jianyong,
> 
> [...]>>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> >>> index db6dce3d0e23..c964122f8dae 100644
> >>> --- a/virt/kvm/arm/hypercalls.c
> >>> +++ b/virt/kvm/arm/hypercalls.c
> >>> @@ -3,6 +3,7 @@
> >>>
> >>>#include 
> >>>#include 
> >>> +#include 
> >>>
> >>>#include 
> >>>
> >>> @@ -11,6 +12,10 @@
> >>>
> >>>int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >>>{
> >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST
> >>> + struct system_time_snapshot systime_snapshot;
> >>> + u64 cycles;
> >>> +#endif
> >>>   u32 func_id = smccc_get_function(vcpu);
> >>>   u32 val[4] = {SMCCC_RET_NOT_SUPPORTED};
> >>>   u32 feature;
> >>> @@ -70,7 +75,49 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >>>   break;
> >>>   case ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID:
> >>>   val[0] = BIT(ARM_SMCCC_KVM_FUNC_FEATURES);
> >>> +
> >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST
> >>> + val[0] |= BIT(ARM_SMCCC_KVM_FUNC_KVM_PTP); #endif
> >>>   break;
> >>> +
> >>> +#ifdef CONFIG_ARM64_KVM_PTP_HOST
> >>> + /*
> >>> +  * This serves virtual kvm_ptp.
> >>> +  * Four values will be passed back.
> >>> +  * reg0 stores high 32-bit host ktime;
> >>> +  * reg1 stores low 32-bit host ktime;
> >>> +  * reg2 stores high 32-bit difference of host cycles and cntvoff;
> >>> +  * reg3 stores low 32-bit difference of host cycles and cntvoff.
> >>> +  */
> >>> + case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
> >>> + /*
> >>> +  * system time and counter value must captured in the same
> >>> +  * time to keep consistency and precision.
> >>> +  */
> >>> + ktime_get_snapshot(_snapshot);
> >>> + if (systime_snapshot.cs_id != CSID_ARM_ARCH_COUNTER)
> >>> + break;
> >>> + val[0] = upper_32_bits(systime_snapshot.real);
> >>> + val[1] = lower_32_bits(systime_snapshot.real);
> >>> + /*
> >>> +  * which of virtual counter or physical counter being
> >>> +  * asked for is decided by the first argument.
> >>> +  */
> >>> + feature = smccc_get_arg1(vcpu);
> >>> + switch (feature) {
> >>> + case ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_FUNC_ID:
> >>> + cycles = systime_snapshot.cycles;
> >>> + break;
> >>> + default:
> >>
> >> There's something a bit odd here.
> >>
> >> ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID and
> >> ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_FUNC_ID look like they
> should be
> >> names of separate (top-level) functions, but actually the _PHY_ one
> >> is a parameter for the first. If the intention is to have a parameter
> >> then it would be better to pick a better name for the _PHY_ define
> >> and not define it using ARM_SMCCC_CALL_VAL.
> >>
> > Yeah, _PHY_ is not the same meaning with _PTP_FUNC_ID,  so I think it
> should be a different name.
> > What about ARM_SMCCC_VENDOR_HYP_KVM_PTP_PHY_COUNTER?
> 
> Personally I'd go with something much shorter, e.g.
> ARM_PTP_PHY_COUNTER.
> This is just an argument to an SMCCC call so there's no need for most of the
> prefix, indeed if (for whatever reason) there was a non-SMCCC mechanism
> added to do the same thing it would be reasonable to reuse the same values.
> 
Ok ,  this shorter name is better.

> >> Second the use of "default:" means that there's no possibility to
> >> later extend this interface for more clocks if needed in the future.
> >>
> > I think we can add more clocks by adding more cases, this "default" means
> we can use no first arg to determine the default clock.
> 
> The problem with the 'default' is it means it's not possible to probe whether
> the kernel supports any more clocks. If we used a different value (that the
> kernel doesn't support) then we end up in the default case and have no idea
> whether the clock value is the one we requested or not.
> 
Yeah,  it's more meaningful. Should return the exact value back to user.

> It's generally better when defining an ABI to explicitly return an error for
> unknown parameters, that way a future user of the ABI can discover
> whether the call did what was expected or not.
> 

ok. I will fix it.

> >> Alternatively you could indeed implement as two top-level functions