RE: [PATCH v4 3/7] KVM: arm/arm64: Fix the documentation

2015-10-08 Thread Pavel Fedin
 Hello!

> > --- a/Documentation/virtual/kvm/devices/arm-vgic.txt
> > +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > @@ -44,28 +44,29 @@ Groups:
> >Attributes:
> >  The attr field of kvm_device_attr encodes two values:
> >  bits: | 63     40 | 39 ..  32  |  31   0 |
> > -values:   |reserved   |   cpu id   |  offset |
> > +values:   |reserved   |  cpu idx   |  offset |
> 
> why should this be changed to cpu idx?

 Because it's index (from 0 to N - 1), and "cpu id" may confuse readers that it 
should be MPIDR
affinity value. In register access function we do "vcpu = 
kvm_get_vcpu(dev->kvm, cpuid);" (see here:
http://lxr.free-electrons.com/source/virt/kvm/arm/vgic-v2-emul.c#L664), and 
kvm_get_vcpu just
indexes the array: 
http://lxr.free-electrons.com/source/include/linux/kvm_host.h#L427
 I decided to change this after 
http://www.spinics.net/lists/kvm-arm/msg16359.html, Andre clearly
mistook this ID for being affinity value.
 Before GICv3 nobody saw the difference because we had only up to 16 CPUs, with 
IDs from 0 to 15, i.
e. corresponding to indexes.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: nVMX: expose VPID capability to L1

2015-10-08 Thread Wanpeng Li

On 9/29/15 6:39 PM, Paolo Bonzini wrote:

On 29/09/2015 04:55, Wanpeng Li wrote:

Expose VPID capability to L1.

Signed-off-by: Wanpeng Li 
---
v1 -> v2:
  * set only VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT

Thanks.  I've checked more thoroughly your implementation against the
SDM now, and there are a few missing things between this patch and the
one that emulates INVVPID:

- you're not setting bit 32 of the VMX_EPT_VPID_CAP MSR

- you were not checking against the supported types in the
implementation of the INVVPID instruction

- the memory operand must always be read even if it isn't needed (e.g.,
for type==global), similar to INVEPT

- for single-context invalidation you're not checking that VPID != 0,
though in practice that doesn't matter because we don't want to support
single-context invalidation

- you're always setting the MSR's bits to 1 even if !enable_vpid


I believe you mean always setting the MSR's bits to 1 when !enable_ept 
and enable_vpid.




At this point it's better if you resend the whole nested VPID
implementation, i.e. the following five patches:

KVM: VMX: adjust interface to allocate/free_vpid
KVM: VMX: introduce __vmx_flush_tlb to handle specific vpid
KVM: nVMX: emulate the INVVPID instruction
KVM: nVMX: nested VPID emulation
KVM: nVMX: expose VPID capability to L1

with the above issues fixed.


I just send out v2.


Please also send kvm-unit-tests patches
that tests for the error cases.


I will do it later.

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/5] KVM: VMX: adjust interface to allocate/free_vpid

2015-10-08 Thread Wanpeng Li
Adjust allocate/free_vid so that they can be reused for the nested vpid.

Reviewed-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6407674..3c9e2a4a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
return r;
 }
 
-static void allocate_vpid(struct vcpu_vmx *vmx)
+static int allocate_vpid(void)
 {
int vpid;
 
-   vmx->vpid = 0;
if (!enable_vpid)
-   return;
+   return 0;
spin_lock(_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
-   if (vpid < VMX_NR_VPIDS) {
-   vmx->vpid = vpid;
+   if (vpid < VMX_NR_VPIDS)
__set_bit(vpid, vmx_vpid_bitmap);
-   }
+   else
+   vpid = 0;
spin_unlock(_vpid_lock);
+   return vpid;
 }
 
-static void free_vpid(struct vcpu_vmx *vmx)
+static void free_vpid(int vpid)
 {
-   if (!enable_vpid)
+   if (!enable_vpid || vpid == 0)
return;
spin_lock(_vpid_lock);
-   if (vmx->vpid != 0)
-   __clear_bit(vmx->vpid, vmx_vpid_bitmap);
+   __clear_bit(vpid, vmx_vpid_bitmap);
spin_unlock(_vpid_lock);
 }
 
@@ -8492,7 +8491,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 
if (enable_pml)
vmx_disable_pml(vmx);
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
leave_guest_mode(vcpu);
vmx_load_vmcs01(vcpu);
free_nested(vmx);
@@ -8511,7 +8510,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
if (!vmx)
return ERR_PTR(-ENOMEM);
 
-   allocate_vpid(vmx);
+   vmx->vpid = allocate_vpid();
 
err = kvm_vcpu_init(>vcpu, kvm, id);
if (err)
@@ -8587,7 +8586,7 @@ free_msrs:
 uninit_vcpu:
kvm_vcpu_uninit(>vcpu);
 free_vcpu:
-   free_vpid(vmx);
+   free_vpid(vmx->vpid);
kmem_cache_free(kvm_vcpu_cache, vmx);
return ERR_PTR(err);
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/5] KVM: VMX: introduce __vmx_flush_tlb to handle specific vpid

2015-10-08 Thread Wanpeng Li
Introduce __vmx_flush_tlb() to handle specific vpid. It will be 
used by later patches, note that the "all context" variant can 
be mapped to vpid_sync_vcpu_single with vpid02 as the argument 
(a nice side effect of vpid02 design).

Reviewed-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3c9e2a4a..215db2b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1337,13 +1337,13 @@ static void loaded_vmcs_clear(struct loaded_vmcs 
*loaded_vmcs)
 __loaded_vmcs_clear, loaded_vmcs, 1);
 }
 
-static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
+static inline void vpid_sync_vcpu_single(int vpid)
 {
-   if (vmx->vpid == 0)
+   if (vpid == 0)
return;
 
if (cpu_has_vmx_invvpid_single())
-   __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0);
+   __invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vpid, 0);
 }
 
 static inline void vpid_sync_vcpu_global(void)
@@ -1352,10 +1352,10 @@ static inline void vpid_sync_vcpu_global(void)
__invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0);
 }
 
-static inline void vpid_sync_context(struct vcpu_vmx *vmx)
+static inline void vpid_sync_context(int vpid)
 {
if (cpu_has_vmx_invvpid_single())
-   vpid_sync_vcpu_single(vmx);
+   vpid_sync_vcpu_single(vpid);
else
vpid_sync_vcpu_global();
 }
@@ -3441,9 +3441,9 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 #endif
 
-static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
+static inline void __vmx_flush_tlb(struct kvm_vcpu *vcpu, int vpid)
 {
-   vpid_sync_context(to_vmx(vcpu));
+   vpid_sync_context(vpid);
if (enable_ept) {
if (!VALID_PAGE(vcpu->arch.mmu.root_hpa))
return;
@@ -3451,6 +3451,11 @@ static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
}
 }
 
+static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+   __vmx_flush_tlb(vcpu, to_vmx(vcpu)->vpid);
+}
+
 static void vmx_decache_cr0_guest_bits(struct kvm_vcpu *vcpu)
 {
ulong cr0_guest_owned_bits = vcpu->arch.cr0_guest_owned_bits;
@@ -4784,7 +4789,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
vmx_fpu_activate(vcpu);
update_exception_bitmap(vcpu);
 
-   vpid_sync_context(vmx);
+   vpid_sync_context(vmx->vpid);
 }
 
 /*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/5] KVM: nVMX: nested VPID emulation

2015-10-08 Thread Wanpeng Li
v1 -> v2:
 * set bit 32 of the VMX_EPT_VPID_CAP MSR
 * check against the supported types in the implementation of 
   the INVVPID instruction
 * the memory operand must always be read even if it isn't needed 
   (e.g., for type==global), similar to INVEPT
 * for single-context invalidation to check that VPID != 0, though in 
   practice that doesn't matter because we don't want to support
   single-context invalidation
 * don't set msr's ept related bits if !enable_ept 


VPID is used to tag address space and avoid a TLB flush. Currently L0 use
the same VPID to run L1 and all its guests. KVM flushes VPID when switching
between L1 and L2.

This patch advertises VPID to the L1 hypervisor, then address space of L1
and L2 can be separately treated and avoid TLB flush when swithing between
L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid
w/ an invvpid.

Performance:

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla


Wanpeng Li (5):
  KVM: VMX: adjust interface to allocate/free_vpid
  KVM: VMX: introduce __vmx_flush_tlb to handle specific vpid
  KVM: nVMX: emulate the INVVPID instruction
  KVM: nVMX: nested VPID emulation
  KVM: nVMX: expose VPID capability to L1

 arch/x86/include/asm/vmx.h |   3 +
 arch/x86/kvm/vmx.c | 164 ++---
 2 files changed, 129 insertions(+), 38 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Christoffer Dall
On Thu, Oct 08, 2015 at 12:10:35PM +0300, Pavel Fedin wrote:
>  Hello!
> 
[...]
> 
> > The architecture defines how to address a specific CPU, and that's using
> > the MPIDR, not inventing our own scheme, so that's what we should do.
> 
>  But doesn't the same apply to GICv2 then? It just happened so that we had 
> Aff0 == idx, therefore
> looks like nobody cared.

I don't think it's about caring, but we (I) didn't consider it when
designing that API.

>  My argument is to have GICv3 API as similar to GICv2 as possible, IMHO that 
> would make it easier to
> maintain the code, and perhaps give some way to reusing it.

Plenty of things are broken about the VGICv2 implementation and API, so
my goal is not to have GICv3 be similar to GICv2, but to design a good
API.

> 
> > For example, I don't think you had the full 32-bits available to address
> > a CPU in your proposal, so if nothing else, that proposal had less
> > expressive power than what the architecture defines.
> 
>  My proposal was:
> 
> --- cut ---
>   KVM_DEV_ARM_VGIC_GRP_DIST_REGS
>   Attributes:
> The attr field of kvm_device_attr encodes two values:
> bits: |  63  | 62 .. 52 | 51 ..  32  |  31   0 |
> values:   | size | reserved |  cpu idx   |  offset |
> 
> All distributor regs can be accessed as (rw, 32-bit)
> For GICv3 some regsisters are actually (rw, 64-bit) according to the
> specification. In order to perform full 64-bit access 'size' bit should be
> set to 1. KVM_DEV_ARM_VGIC_64BIT flag value is provided for this purpose.
> --- cut ---
> 
>  Bit 63 was meant to specify access size (u64 or u32), and bits 62 - 52 were 
> reserved just in order
> to match ARM64_SYS_REG() macro, which uses these bits for own purpose.
> 
>  But, since your proposal suggests that all GICv3 accesses are 64-bit wide, 
> and we use own system
> register encoding (i don't see any serious problems with this), it is 
> possible just to use bits
> 63...32 for vCPU index. So, maximum number of CPUs would be the same 
> 0x as in your proposal,
> and the API would be fully compatible with GICv2 one (well, except access 
> size), and we would use
> the same definitions and the same approach to handle both. This would bring 
> more consistency and
> less confusion to userspace developers who use the API.

I don't agree; using the same API with slightly different semantics
depending on which device you created is much more error prone than
having a clear separation between the two different APIs, IMHO.

> 
>  By the way, the rest of KVM API (KVM_IRQ_LINE, for example), also uses vCPU 
> index.
> 
>  That's all my arguments for vCPU index instead of affinity value

I'm not convinced that we should be compatible with GICv2 at all.  (The
deeper argument here is that GICv2 had an architectural limitation to 8
CPUs so all the CPU addressing is simple in that case.  This is a
fundamental evolution from GICv2 to GICv3 so it is only natural that
there are substantial changes in this area.)

I'll let Marc or Peter chime in if they disagree.

>, and if you say "that doesn't
> count, we don't have to be compatible with GICv2", i'll accept this and 
> proceed with the rewrite.
> 

Cool!

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Pavel Fedin
 Hello!

> +The mpidr encoding is based on the affinity information in the
> +architecture defined MPIDR, and the field is encoded as follows:
> +  | 63  56 | 55  48 | 47  40 | 39  32 |
> +  |Aff3|Aff2|Aff1|Aff0|

 One concern about this... Does it already have "We are Bosses, we Decided it, 
It's not subject to
change, We do not care" status? Actually, current approach with using index 
instead of Aff bits
works pretty well. It works fine with qemu too, because internally qemu also 
maintains CPU index,
which it uses for GICv2 accesses.
 Keeping index allows to reuse more code, and provides better backwards 
compatibility. So could we
do this?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 00/16] KVM: arm64: GICv3 ITS emulation

2015-10-08 Thread Pavel Fedin
 Hello!

 Sorry for taking up your time, and thank you very much for the explanation.

> I'd appreciate if you could try to read and understand the architecture
> spec instead of randomly googling and quoting various bits of
> irrelevant information.

 I give my apologizes for not having time to read the whole specs from 
beginning to the end. Can
only add that it's quite weird to have these important things in "Terminology" 
section. I would
expect them to be in 6.1, for example. That was the part i read, but failed to 
find the exact
answer:
--- cut ---
LPIs do not have an active state, and transition to the inactive state on being 
acknowledged by a PE
--- cut ---

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/5] KVM: nVMX: expose VPID capability to L1

2015-10-08 Thread Wanpeng Li
Expose VPID capability to L1. For nested guests, we don't do anything 
specific for single context invalidation. Hence, only advertise support 
for global context invalidation. The major benefit of nested VPID comes 
from having separate vpids when switching between L1 and L2, and also 
when L2's vCPUs not sched in/out on L1.

Reviewed-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 36 
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 31d272e..22b4dc7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -442,7 +442,7 @@ struct nested_vmx {
u32 nested_vmx_true_entry_ctls_low;
u32 nested_vmx_misc_low;
u32 nested_vmx_misc_high;
-   u32 nested_vmx_ept_caps;
+   u64 nested_vmx_ept_vpid_caps;
 };
 
 #define POSTED_INTR_ON  0
@@ -2489,18 +2489,22 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
/* nested EPT: emulate EPT also to L1 */
vmx->nested.nested_vmx_secondary_ctls_high |=
SECONDARY_EXEC_ENABLE_EPT;
-   vmx->nested.nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
+   vmx->nested.nested_vmx_ept_vpid_caps = VMX_EPT_PAGE_WALK_4_BIT |
 VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
 VMX_EPT_INVEPT_BIT;
-   vmx->nested.nested_vmx_ept_caps &= vmx_capability.ept;
+   vmx->nested.nested_vmx_ept_vpid_caps &= vmx_capability.ept;
/*
 * For nested guests, we don't do anything specific
 * for single context invalidation. Hence, only advertise
 * support for global context invalidation.
 */
-   vmx->nested.nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT;
+   vmx->nested.nested_vmx_ept_vpid_caps |= 
VMX_EPT_EXTENT_GLOBAL_BIT;
} else
-   vmx->nested.nested_vmx_ept_caps = 0;
+   vmx->nested.nested_vmx_ept_vpid_caps = 0;
+
+   if (enable_vpid)
+   vmx->nested.nested_vmx_ept_vpid_caps |= (VMX_VPID_INVVPID_BIT |
+   VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT) << 32;
 
if (enable_unrestricted_guest)
vmx->nested.nested_vmx_secondary_ctls_high |=
@@ -2616,8 +2620,7 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
vmx->nested.nested_vmx_secondary_ctls_high);
break;
case MSR_IA32_VMX_EPT_VPID_CAP:
-   /* Currently, no nested vpid support */
-   *pdata = vmx->nested.nested_vmx_ept_caps;
+   *pdata = vmx->nested.nested_vmx_ept_vpid_caps;
break;
default:
return 1;
@@ -7152,7 +7155,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 
if (!(vmx->nested.nested_vmx_secondary_ctls_high &
  SECONDARY_EXEC_ENABLE_EPT) ||
-   !(vmx->nested.nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
+   !(vmx->nested.nested_vmx_ept_vpid_caps & VMX_EPT_INVEPT_BIT)) {
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}
@@ -7168,7 +7171,7 @@ static int handle_invept(struct kvm_vcpu *vcpu)
vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
 
-   types = (vmx->nested.nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6;
+   types = (vmx->nested.nested_vmx_ept_vpid_caps >> VMX_EPT_EXTENT_SHIFT) 
& 6;
 
if (!(types & (1UL << type))) {
nested_vmx_failValid(vcpu,
@@ -7207,14 +7210,15 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 static int handle_invvpid(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
-   u32 vmx_instruction_info;
+   u32 vmx_instruction_info, types;
unsigned long type;
gva_t gva;
struct x86_exception e;
int vpid;
 
if (!(vmx->nested.nested_vmx_secondary_ctls_high &
- SECONDARY_EXEC_ENABLE_VPID)) {
+ SECONDARY_EXEC_ENABLE_VPID) ||
+   !(vmx->nested.nested_vmx_ept_vpid_caps & (VMX_VPID_INVVPID_BIT 
<< 32))) {
kvm_queue_exception(vcpu, UD_VECTOR);
return 1;
}
@@ -7225,6 +7229,14 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
 
+   types = (vmx->nested.nested_vmx_ept_vpid_caps >> VMX_VPID_EXTENT_SHIFT) 
& 0x7;
+
+   if (!(types & (1UL << type))) {
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   return 1;
+   }
+
/* according to the intel vmx instruction 

[PATCH v2 4/5] KVM: nVMX: nested VPID emulation

2015-10-08 Thread Wanpeng Li
VPID is used to tag address space and avoid a TLB flush. Currently L0 use
the same VPID to run L1 and all its guests. KVM flushes VPID when switching
between L1 and L2.

This patch advertises VPID to the L1 hypervisor, then address space of L1
and L2 can be separately treated and avoid TLB flush when swithing between
L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid
w/ an invvpid.

Performance:

run lmbench on L2 w/ 3.5 kernel.

Context switching - times in microseconds - smaller is better
-
Host OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
 ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
- - -- -- -- -- -- --- ---
kernelLinux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.6 2.88000  
nested VPID
kernelLinux 3.5.0-1 1.2600 1.4300 1.5600   12.7   12.9 3.49000 7.46000  
vanilla

Reviewed-by: Jan Kiszka 
Reviewed-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 39 ---
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 87d042a..31d272e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -424,6 +424,9 @@ struct nested_vmx {
/* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
u64 vmcs01_debugctl;
 
+   u16 vpid02;
+   u16 last_vpid;
+
u32 nested_vmx_procbased_ctls_low;
u32 nested_vmx_procbased_ctls_high;
u32 nested_vmx_true_procbased_ctls_low;
@@ -1157,6 +1160,11 @@ static inline bool 
nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
 }
 
+static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
+{
+   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
+}
+
 static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
 {
return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
@@ -2471,6 +2479,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+   SECONDARY_EXEC_ENABLE_VPID |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
SECONDARY_EXEC_WBINVD_EXITING |
@@ -6680,6 +6689,7 @@ static void free_nested(struct vcpu_vmx *vmx)
return;
 
vmx->nested.vmxon = false;
+   free_vpid(vmx->nested.vpid02);
nested_release_vmcs12(vmx);
if (enable_shadow_vmcs)
free_vmcs(vmx->nested.current_shadow_vmcs);
@@ -7234,7 +7244,7 @@ static int handle_invvpid(struct kvm_vcpu *vcpu)
VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
return 1;
}
-   vmx_flush_tlb(vcpu);
+   __vmx_flush_tlb(vcpu, to_vmx(vcpu)->nested.vpid02);
nested_vmx_succeed(vcpu);
break;
default:
@@ -8610,8 +8620,10 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
goto free_vmcs;
}
 
-   if (nested)
+   if (nested) {
nested_vmx_setup_ctls_msrs(vmx);
+   vmx->nested.vpid02 = allocate_vpid();
+   }
 
vmx->nested.posted_intr_nv = -1;
vmx->nested.current_vmptr = -1ull;
@@ -8632,6 +8644,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, 
unsigned int id)
return >vcpu;
 
 free_vmcs:
+   free_vpid(vmx->nested.vpid02);
free_loaded_vmcs(vmx->loaded_vmcs);
 free_msrs:
kfree(vmx->guest_msrs);
@@ -9493,12 +9506,24 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
struct vmcs12 *vmcs12)
 
if (enable_vpid) {
/*
-* Trivially support vpid by letting L2s share their parent
-* L1's vpid. TODO: move to a more elaborate solution, giving
-* each L2 its own vpid and exposing the vpid feature to L1.
+* There is no direct mapping between vpid02 and vpid12, the
+* vpid02 is per-vCPU for L0 and reused while the value of
+* vpid12 is changed w/ one invvpid during nested vmentry.
+* The vpid12 is allocated by L1 for L2, so it will not
+* influence global bitmap(for vpid01 and vpid02 allocation)
+* even if spawn a lot of nested vCPUs.
 */
-   vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
-   vmx_flush_tlb(vcpu);
+   if (nested_cpu_has_vpid(vmcs12) && vmx->nested.vpid02) {
+   

[PATCH v2 3/5] KVM: nVMX: emulate the INVVPID instruction

2015-10-08 Thread Wanpeng Li
Add the INVVPID instruction emulation.

Reviewed-by: Wincy Van 
Signed-off-by: Wanpeng Li 
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx.c | 49 +-
 2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 448b7ca..af5fdaf 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -397,8 +397,10 @@ enum vmcs_field {
 #define IDENTITY_PAGETABLE_PRIVATE_MEMSLOT (KVM_USER_MEM_SLOTS + 2)
 
 #define VMX_NR_VPIDS   (1 << 16)
+#define VMX_VPID_EXTENT_INDIVIDUAL_ADDR0
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT 1
 #define VMX_VPID_EXTENT_ALL_CONTEXT2
+#define VMX_VPID_EXTENT_SHIFT  40
 
 #define VMX_EPT_EXTENT_INDIVIDUAL_ADDR 0
 #define VMX_EPT_EXTENT_CONTEXT 1
@@ -416,6 +418,7 @@ enum vmcs_field {
 #define VMX_EPT_EXTENT_CONTEXT_BIT (1ull << 25)
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull << 26)
 
+#define VMX_VPID_INVVPID_BIT(1ull << 0) /* (32 - 32) */
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT  (1ull << 9) /* (41 - 32) */
 #define VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT  (1ull << 10) /* (42 - 32) */
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 215db2b..87d042a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7196,7 +7196,54 @@ static int handle_invept(struct kvm_vcpu *vcpu)
 
 static int handle_invvpid(struct kvm_vcpu *vcpu)
 {
-   kvm_queue_exception(vcpu, UD_VECTOR);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u32 vmx_instruction_info;
+   unsigned long type;
+   gva_t gva;
+   struct x86_exception e;
+   int vpid;
+
+   if (!(vmx->nested.nested_vmx_secondary_ctls_high &
+ SECONDARY_EXEC_ENABLE_VPID)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf);
+
+   /* according to the intel vmx instruction reference, the memory
+* operand is read even if it isn't needed (e.g., for type==global)
+*/
+   if (get_vmx_mem_address(vcpu, vmcs_readl(EXIT_QUALIFICATION),
+   vmx_instruction_info, false, ))
+   return 1;
+   if (kvm_read_guest_virt(>arch.emulate_ctxt, gva, ,
+   sizeof(u32), )) {
+   kvm_inject_page_fault(vcpu, );
+   return 1;
+   }
+
+   switch (type) {
+   case VMX_VPID_EXTENT_ALL_CONTEXT:
+   if (get_vmcs12(vcpu)->virtual_processor_id == 0) {
+   nested_vmx_failValid(vcpu,
+   VMXERR_INVALID_OPERAND_TO_INVEPT_INVVPID);
+   return 1;
+   }
+   vmx_flush_tlb(vcpu);
+   nested_vmx_succeed(vcpu);
+   break;
+   default:
+   /* Trap single context invalidation invvpid calls */
+   BUG_ON(1);
+   break;
+   }
+
+   skip_emulated_instruction(vcpu);
return 1;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Fix memory leak if timer initialization fails

2015-10-08 Thread Christoffer Dall
On Tue, Oct 06, 2015 at 11:14:35AM +0300, Pavel Fedin wrote:
> Jump to correct label and free kvm_host_cpu_state
> 
> Signed-off-by: Pavel Fedin 
> ---
>  arch/arm/kvm/arm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index dc017ad..78b2869 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -1080,7 +1080,7 @@ static int init_hyp_mode(void)
>*/
>   err = kvm_timer_hyp_init();
>   if (err)
> - goto out_free_mappings;
> + goto out_free_context;
>  
>  #ifndef CONFIG_HOTPLUG_CPU
>   free_boot_hyp_pgd();

Nice catch.

Applied, thanks.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Pavel Fedin
 Hello!

> >  One concern about this... Does it already have "We are Bosses, we Decided 
> > it, It's not
> subject to
> > change, We do not care" status?
> 
> That's a rather negative question.

 Sorry, didn't want to offend anyone. I just wanted to tell that i know that 
you, as maintainers,
have much more power than i do, and you can always say "it's political 
decision, we just want it and
that's final", and if you choose to do this, i'm perfectly OK with that, just 
say it.

> The architecture defines how to address a specific CPU, and that's using
> the MPIDR, not inventing our own scheme, so that's what we should do.

 But doesn't the same apply to GICv2 then? It just happened so that we had Aff0 
== idx, therefore
looks like nobody cared.
 My argument is to have GICv3 API as similar to GICv2 as possible, IMHO that 
would make it easier to
maintain the code, and perhaps give some way to reusing it.

> For example, I don't think you had the full 32-bits available to address
> a CPU in your proposal, so if nothing else, that proposal had less
> expressive power than what the architecture defines.

 My proposal was:

--- cut ---
  KVM_DEV_ARM_VGIC_GRP_DIST_REGS
  Attributes:
The attr field of kvm_device_attr encodes two values:
bits: |  63  | 62 .. 52 | 51 ..  32  |  31   0 |
values:   | size | reserved |  cpu idx   |  offset |

All distributor regs can be accessed as (rw, 32-bit)
For GICv3 some regsisters are actually (rw, 64-bit) according to the
specification. In order to perform full 64-bit access 'size' bit should be
set to 1. KVM_DEV_ARM_VGIC_64BIT flag value is provided for this purpose.
--- cut ---

 Bit 63 was meant to specify access size (u64 or u32), and bits 62 - 52 were 
reserved just in order
to match ARM64_SYS_REG() macro, which uses these bits for own purpose.

 But, since your proposal suggests that all GICv3 accesses are 64-bit wide, and 
we use own system
register encoding (i don't see any serious problems with this), it is possible 
just to use bits
63...32 for vCPU index. So, maximum number of CPUs would be the same 0x 
as in your proposal,
and the API would be fully compatible with GICv2 one (well, except access 
size), and we would use
the same definitions and the same approach to handle both. This would bring 
more consistency and
less confusion to userspace developers who use the API.

 By the way, the rest of KVM API (KVM_IRQ_LINE, for example), also uses vCPU 
index.

 That's all my arguments for vCPU index instead of affinity value, and if you 
say "that doesn't
count, we don't have to be compatible with GICv2", i'll accept this and proceed 
with the rewrite.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Christoffer Dall
On Thu, Oct 08, 2015 at 10:17:09AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > +The mpidr encoding is based on the affinity information in the
> > +architecture defined MPIDR, and the field is encoded as follows:
> > +  | 63  56 | 55  48 | 47  40 | 39  32 |
> > +  |Aff3|Aff2|Aff1|Aff0|
> 
>  One concern about this... Does it already have "We are Bosses, we Decided 
> it, It's not subject to
> change, We do not care" status? 

That's a rather negative question.

We spent a fair amount of time discussing this during SFO15 and arrived
at the conclusion that this was the way to go.

If there are good arguments for why this is not sufficient or causes
problems, then of course we'll make changes as required.

> Actually, current approach with using index instead of Aff bits
> works pretty well. It works fine with qemu too, because internally qemu also 
> maintains CPU index,
> which it uses for GICv2 accesses.
>  Keeping index allows to reuse more code, and provides better backwards 
> compatibility. So could we
> do this?
> 

The architecture defines how to address a specific CPU, and that's using
the MPIDR, not inventing our own scheme, so that's what we should do.

For example, I don't think you had the full 32-bits available to address
a CPU in your proposal, so if nothing else, that proposal had less
expressive power than what the architecture defines.

I also don't buy the argument that this saves a lot of code.

If you have something that deals with a cpu index, surely two simple
functions can allow the same amount of code reuse:

unsigned long mpidr_to_vcpu_idx(unsigned long mpidr);
unsigned long vcpu_idx_to_mpidr(unsigned long vcpu_idx);

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/7] KVM: arm/arm64: Fix the documentation

2015-10-08 Thread Christoffer Dall
On Thu, Oct 08, 2015 at 09:52:02AM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > > --- a/Documentation/virtual/kvm/devices/arm-vgic.txt
> > > +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt
> > > @@ -44,28 +44,29 @@ Groups:
> > >Attributes:
> > >  The attr field of kvm_device_attr encodes two values:
> > >  bits: | 63     40 | 39 ..  32  |  31   0 |
> > > -values:   |reserved   |   cpu id   |  offset |
> > > +values:   |reserved   |  cpu idx   |  offset |
> > 
> > why should this be changed to cpu idx?
> 
>  Because it's index (from 0 to N - 1), and "cpu id" may confuse readers that 
> it should be MPIDR
> affinity value. In register access function we do "vcpu = 
> kvm_get_vcpu(dev->kvm, cpuid);" (see here:
> http://lxr.free-electrons.com/source/virt/kvm/arm/vgic-v2-emul.c#L664), and 
> kvm_get_vcpu just
> indexes the array: 
> http://lxr.free-electrons.com/source/include/linux/kvm_host.h#L427
>  I decided to change this after 
> http://www.spinics.net/lists/kvm-arm/msg16359.html, Andre clearly
> mistook this ID for being affinity value.
>  Before GICv3 nobody saw the difference because we had only up to 16 CPUs, 
> with IDs from 0 to 15, i.
> e. corresponding to indexes.
> 
ok, fair enough.  This kind of rationale is helpful to put in the commit
text though.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Pavel Fedin
 Hello!

> Yes, I am looking at merging this. From the discussion with Pavel I
> remember some things that I disagreed with, so I may propose a follow-up
> patch. I will give this a try tomorrow.

 I reply to this thread, because this relates to the whole changeset. After the 
merge, some pieces
are missing, considering my live migration W.I.P (see patch below). Together 
with this, vITS v3
backported to v4.2.2 and...

Tested-by: Pavel Fedin 

---
>From b08e9ba1da69f9cf5731c89a4ff39561cd16e6ea Mon Sep 17 00:00:00 2001
From: Pavel Fedin 
Date: Thu, 8 Oct 2015 14:43:23 +0300
Subject: [PATCH] Missing vITS pieces for live migration and safety

1. Split up vits_init() and perform allocations on PENDBASER access. Fixes
   crash when setting LPI registers from userspace before any vCPU has
   been run. The bug is triggered by reset routine in qemu.

2. Properly handle LPIs in vgic_unqueue_irqs(). Does not corrupt memory any
   more if LPI happens to get in there.

Signed-off-by: Pavel Fedin 
---
 virt/kvm/arm/its-emul.c | 26 +++---
 virt/kvm/arm/its-emul.h |  1 +
 virt/kvm/arm/vgic-v3-emul.c | 11 ++-
 virt/kvm/arm/vgic.c | 15 +++
 4 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c
index b40a7fc..b1d61df 100644
--- a/virt/kvm/arm/its-emul.c
+++ b/virt/kvm/arm/its-emul.c
@@ -1117,7 +1117,9 @@ int vits_init(struct kvm *kvm)
 {
struct vgic_dist *dist = >arch.vgic;
struct vgic_its *its = >its;
-   int ret;
+
+   if (dist->pendbaser)
+   return 0;
 
dist->pendbaser = kcalloc(dist->nr_cpus, sizeof(u64), GFP_KERNEL);
if (!dist->pendbaser)
@@ -1132,18 +1134,27 @@ int vits_init(struct kvm *kvm)
INIT_LIST_HEAD(>device_list);
INIT_LIST_HEAD(>collection_list);
 
-   ret = vgic_register_kvm_io_dev(kvm, dist->vgic_its_base,
-  KVM_VGIC_V3_ITS_SIZE, vgicv3_its_ranges,
-  -1, >iodev);
-   if (ret)
-   return ret;
-
its->enabled = false;
dist->msis_require_devid = true;
 
return 0;
 }
 
+int vits_map_resources(struct kvm *kvm)
+{
+   struct vgic_dist *dist = >arch.vgic;
+   struct vgic_its *its = >its;
+   int ret;
+
+   ret = vits_init(kvm);
+   if (ret)
+   return ret;
+
+   return vgic_register_kvm_io_dev(kvm, dist->vgic_its_base,
+  KVM_VGIC_V3_ITS_SIZE, vgicv3_its_ranges,
+  -1, >iodev);
+}
+
 void vits_destroy(struct kvm *kvm)
 {
struct vgic_dist *dist = >arch.vgic;
@@ -1182,6 +1193,7 @@ void vits_destroy(struct kvm *kvm)
kfree(its->buffer_page);
kfree(dist->pendbaser);
 
+   dist->pendbaser = NULL;
its->enabled = false;
spin_unlock(>lock);
 }
diff --git a/virt/kvm/arm/its-emul.h b/virt/kvm/arm/its-emul.h
index 95e56a7..236f153 100644
--- a/virt/kvm/arm/its-emul.h
+++ b/virt/kvm/arm/its-emul.h
@@ -34,6 +34,7 @@
 
 void vgic_enable_lpis(struct kvm_vcpu *vcpu);
 int vits_init(struct kvm *kvm);
+int vits_map_resources(struct kvm *kvm);
 void vits_destroy(struct kvm *kvm);
 
 int vits_inject_msi(struct kvm *kvm, struct kvm_msi *msi);
diff --git a/virt/kvm/arm/vgic-v3-emul.c b/virt/kvm/arm/vgic-v3-emul.c
index 3e262f3..b340202 100644
--- a/virt/kvm/arm/vgic-v3-emul.c
+++ b/virt/kvm/arm/vgic-v3-emul.c
@@ -693,6 +693,15 @@ static bool handle_mmio_pendbaser_redist(struct kvm_vcpu 
*vcpu,
struct vgic_dist *dist = >kvm->arch.vgic;
int mode = ACCESS_READ_VALUE;
 
+   /*
+* This makes sure that dist->pendbaser is allocated.
+* We don't use any locks here because the actual initialization will
+* happen either during register access from userspace, or upon first
+* run. Both situations are already single-threaded.
+*/
+   if (vits_init(vcpu->kvm))
+   return false;
+
/* Storing a value with LPIs already enabled is undefined */
mode |= dist->lpis_enabled ? ACCESS_WRITE_IGNORED : ACCESS_WRITE_VALUE;
vgic_reg64_access(mmio, offset,
@@ -880,7 +889,7 @@ static int vgic_v3_map_resources(struct kvm *kvm,
}
 
if (vgic_has_its(kvm)) {
-   ret = vits_init(kvm);
+   ret = vits_map_resources(kvm);
if (ret)
goto out_unregister;
}
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index b32baa0..8dbbb1a 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -711,6 +711,7 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
*mmio,
  */
 void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 {
+   struct vgic_dist *dist = >kvm->arch.vgic;
struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
u64 elrsr = vgic_get_elrsr(vcpu);

Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Peter Maydell
On 8 October 2015 at 10:10, Pavel Fedin  wrote:
>  Sorry, didn't want to offend anyone. I just wanted to tell that i know
> that you, as maintainers, have much more power than i do, and you can
> always say "it's political decision, we just want it and that's final",
> and if you choose to do this, i'm perfectly OK with that, just say it.

This isn't intended to be a political decision; it's our joint
technical opinion on the best design for this API. Since we all
happened to be in one physical location the other week it was a good
opportunity for us to work through how we thought the API should
look from a technical perspective. At that point it seemed to us
to be clearer to write up the results of that discussion as a
single patch to the API documentation, rather than doing it by
(for instance) making lots of different review comments on your patches.
Christoffer's API documentation patch is a patch like any other and
it has to go through review. If there are parts where you don't
understand the rationale or you think we got something wrong you
should let us know.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Pavel Fedin
 Hello!

> Well, compatibility with GICv2 is the biggest mistake we made when
> designing the GICv3 architecture. And that's why our emulation doesn't
> give a damn about v2 compatibility.

 Ok, i see your arguments, and after that it becomes a matter of personal 
taste. Three beat one, i
go your way. :) Don't know if i'll be able to roll out v5 tomorrow, but i think 
on monday i'll
definitely do.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Christoffer Dall
On Thu, Oct 08, 2015 at 12:15:06PM +0100, Andre Przywara wrote:
> Hi,
> 
> On 08/10/15 11:56, Marc Zyngier wrote:
> > On 08/10/15 11:14, Christoffer Dall wrote:
> >> Hi Pavel,
> >>
> >> On Fri, Oct 02, 2015 at 05:44:27PM +0300, Pavel Fedin wrote:
> >>> Current KVM code has lots of old redundancies, which can be cleaned up.
> >>> This patchset is actually a better alternative to
> >>> http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
> >>> keep piggy-backed LRs. The idea is based on the fact that our code also
> >>> maintains LR state in elrsr, and this information is enough to track LR
> >>> usage.
> >>>
> >>> This patchset is made against linux-next of 02.10.2015. Thanks to Andre
> >>> for pointing out some 4.3 specifics.
> >>>
> >> I'm not opposed to these changes, they clean up the data structures
> >> which is definitely a good thing.
> >>
> >> I am a bit worries about how/if this is going to conflict with the ITS
> >> series and other patches in flight touchignt he vgic.
> >>
> >> Marc/Andre, any thoughts on this?
> > 
> > I don't mind the simplification (Andre was already removing the
> > piggybacking stuff as part of his ITS series). I'm a bit more cautious
> > about the sync_elrsr stuff, but that's mostly because I've only read the
> > patch in a superficial way.
> > 
> > But yes, this is probably going to clash, unless we make this part of an
> > existing series (/me looks at André... ;-)
> 
> Yes, I am looking at merging this. From the discussion with Pavel I
> remember some things that I disagreed with, so I may propose a follow-up
> patch. I will give this a try tomorrow.
> 
>From a maintainer perspective I'd much prefer Andre take these patches
as part of his series and send everything together in one go.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Pavel Fedin
 Hello!

> I don't mind the simplification (Andre was already removing the
> piggybacking stuff as part of his ITS series). I'm a bit more cautious
> about the sync_elrsr stuff, but that's mostly because I've only read the
> patch in a superficial way.

 If you are really afraid of it, you can omit 2/2. The reason why i've done it 
is exactly what i
said in commit message - LR setting and ELRSR sync *always* go together.
 The main part of the cleanup is 1/2.

> But yes, this is probably going to clash, unless we make this part of an
> existing series (/me looks at André... ;-)

 It a little bit clashes, patch No 0012 needs small modifications, but i'd say 
they are trivial. If
you want, i could rebase the whole thing on top of current kvmarm.git by myself.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Marc Zyngier
On 08/10/15 11:14, Christoffer Dall wrote:
> Hi Pavel,
> 
> On Fri, Oct 02, 2015 at 05:44:27PM +0300, Pavel Fedin wrote:
>> Current KVM code has lots of old redundancies, which can be cleaned up.
>> This patchset is actually a better alternative to
>> http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
>> keep piggy-backed LRs. The idea is based on the fact that our code also
>> maintains LR state in elrsr, and this information is enough to track LR
>> usage.
>>
>> This patchset is made against linux-next of 02.10.2015. Thanks to Andre
>> for pointing out some 4.3 specifics.
>>
> I'm not opposed to these changes, they clean up the data structures
> which is definitely a good thing.
> 
> I am a bit worries about how/if this is going to conflict with the ITS
> series and other patches in flight touchignt he vgic.
> 
> Marc/Andre, any thoughts on this?

I don't mind the simplification (Andre was already removing the
piggybacking stuff as part of his ITS series). I'm a bit more cautious
about the sync_elrsr stuff, but that's mostly because I've only read the
patch in a superficial way.

But yes, this is probably going to clash, unless we make this part of an
existing series (/me looks at André... ;-)

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Pavel Fedin
 Hello!

> I am a bit worries about how/if this is going to conflict with the ITS
> series and other patches in flight touchignt he vgic.

 Just to note: of course i test them together. This works fine at least with 
vITS v2, where it
replaces patch 0001 from the original series. I guess it should also work fine 
with v3, replacing
patches 0001 and 0002. Merging this at the moment.
 Yes, vgic_unqueue_lpi() falls out of use, but this is temporary, because it 
will be very useful for
live migration.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread charlie.song
Dear KVM Developers: 
I am Xiang Song from UCloud company. We currently encounter a weird 
phenomenon about Qemu-KVM IOthread. 
We recently try to use Linux AIO from guest OS and find that the IOthread 
mechanism of Qemu-KVM will reorder I/O requests from guest OS 
even when the AIO write requests are issued from a single thread in order. This 
does not happen on the host OS however.
We are not sure whether this is a feature of Qemu-KVM IOthread mechanism or 
a Bug.
 
The testbd is as following: (the guest disk device cache is configured to 
writethrough.)
CPU: Intel(R) Xeon(R) CPU E5-2650
QEMU version: 1.5.3
Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
Simplified Guest OS qemu cmd:  
/usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 -smp 
8,sockets=8,cores=1,threads=1 
-drive 
file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
 
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4

The test code triggerring this phenomenon work as following: it use linux aio 
API to issue concurrent async write requests to a file. During exection it will 
continuously write data into target test file. There are total 'X' jobs, and 
each job is assigned a job id JOB_ID which starts from 0. Each job will write 
16 * 512
Byte data into the target file at offset =  JOB_ID * 512. (the data is repeated 
uint64_t  JOB_ID). 
There is only one thread handling 'X' jobs one by one through Linux AIO 
(io_submit) cmd. When handling jobs, it will continuously 
issuing AIO requests without waiting for AIO Callbacks. When it finishes, the 
file should look like:
 [00][1...1][2...2][3...3]...[X-1...X-1]
Then we use a check program to test the resulting file, it can continuously 
read the first 8 byte (uint64_t) of each sector and print it out. In normal 
cases,
 it's output is like:
  0 1 2 3  X-1

Exec  output: (Set X=32)
In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
It can be seen that job20~job24 are overwrited by job19.
In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.


I can provide the example code if needed.

Best regards, song

2015-10-08


charlie.song 
  
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Christoffer Dall
Hi Pavel,

On Fri, Oct 02, 2015 at 05:44:27PM +0300, Pavel Fedin wrote:
> Current KVM code has lots of old redundancies, which can be cleaned up.
> This patchset is actually a better alternative to
> http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
> keep piggy-backed LRs. The idea is based on the fact that our code also
> maintains LR state in elrsr, and this information is enough to track LR
> usage.
> 
> This patchset is made against linux-next of 02.10.2015. Thanks to Andre
> for pointing out some 4.3 specifics.
> 
I'm not opposed to these changes, they clean up the data structures
which is definitely a good thing.

I am a bit worries about how/if this is going to conflict with the ITS
series and other patches in flight touchignt he vgic.

Marc/Andre, any thoughts on this?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Marc Zyngier
On 08/10/15 10:28, Christoffer Dall wrote:
> On Thu, Oct 08, 2015 at 12:10:35PM +0300, Pavel Fedin wrote:
>>  Hello!
>>
> [...]
>>
>>> The architecture defines how to address a specific CPU, and that's using
>>> the MPIDR, not inventing our own scheme, so that's what we should do.
>>
>>  But doesn't the same apply to GICv2 then? It just happened so that we had 
>> Aff0 == idx, therefore
>> looks like nobody cared.
> 
> I don't think it's about caring, but we (I) didn't consider it when
> designing that API.
> 
>>  My argument is to have GICv3 API as similar to GICv2 as possible, IMHO that 
>> would make it easier to
>> maintain the code, and perhaps give some way to reusing it.
> 
> Plenty of things are broken about the VGICv2 implementation and API, so
> my goal is not to have GICv3 be similar to GICv2, but to design a good
> API.
> 
>>
>>> For example, I don't think you had the full 32-bits available to address
>>> a CPU in your proposal, so if nothing else, that proposal had less
>>> expressive power than what the architecture defines.
>>
>>  My proposal was:
>>
>> --- cut ---
>>   KVM_DEV_ARM_VGIC_GRP_DIST_REGS
>>   Attributes:
>> The attr field of kvm_device_attr encodes two values:
>> bits: |  63  | 62 .. 52 | 51 ..  32  |  31   0 |
>> values:   | size | reserved |  cpu idx   |  offset |
>>
>> All distributor regs can be accessed as (rw, 32-bit)
>> For GICv3 some regsisters are actually (rw, 64-bit) according to the
>> specification. In order to perform full 64-bit access 'size' bit should 
>> be
>> set to 1. KVM_DEV_ARM_VGIC_64BIT flag value is provided for this purpose.
>> --- cut ---
>>
>>  Bit 63 was meant to specify access size (u64 or u32), and bits 62 - 52 were 
>> reserved just in order
>> to match ARM64_SYS_REG() macro, which uses these bits for own purpose.
>>
>>  But, since your proposal suggests that all GICv3 accesses are 64-bit wide, 
>> and we use own system
>> register encoding (i don't see any serious problems with this), it is 
>> possible just to use bits
>> 63...32 for vCPU index. So, maximum number of CPUs would be the same 
>> 0x as in your proposal,
>> and the API would be fully compatible with GICv2 one (well, except access 
>> size), and we would use
>> the same definitions and the same approach to handle both. This would bring 
>> more consistency and
>> less confusion to userspace developers who use the API.
> 
> I don't agree; using the same API with slightly different semantics
> depending on which device you created is much more error prone than
> having a clear separation between the two different APIs, IMHO.
> 
>>
>>  By the way, the rest of KVM API (KVM_IRQ_LINE, for example), also uses vCPU 
>> index.
>>
>>  That's all my arguments for vCPU index instead of affinity value
> 
> I'm not convinced that we should be compatible with GICv2 at all.  (The
> deeper argument here is that GICv2 had an architectural limitation to 8
> CPUs so all the CPU addressing is simple in that case.  This is a
> fundamental evolution from GICv2 to GICv3 so it is only natural that
> there are substantial changes in this area.)
> 
> I'll let Marc or Peter chime in if they disagree.

Well, compatibility with GICv2 is the biggest mistake we made when
designing the GICv3 architecture. And that's why our emulation doesn't
give a damn about v2 compatibility.

Designing the kernel API in terms of GICv2 is nothing short of
revolting, if you ask me. We've already shoe-horned GICv3 in GICv2 data
structures in KVM, and that's an absolute mess. I can't wait to simple
nuke the thing. Once v2 is out of the picture, everything is much simpler.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] KVM: arm/arm64: Clean up some obsolete code

2015-10-08 Thread Andre Przywara
Hi,

On 08/10/15 11:56, Marc Zyngier wrote:
> On 08/10/15 11:14, Christoffer Dall wrote:
>> Hi Pavel,
>>
>> On Fri, Oct 02, 2015 at 05:44:27PM +0300, Pavel Fedin wrote:
>>> Current KVM code has lots of old redundancies, which can be cleaned up.
>>> This patchset is actually a better alternative to
>>> http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
>>> keep piggy-backed LRs. The idea is based on the fact that our code also
>>> maintains LR state in elrsr, and this information is enough to track LR
>>> usage.
>>>
>>> This patchset is made against linux-next of 02.10.2015. Thanks to Andre
>>> for pointing out some 4.3 specifics.
>>>
>> I'm not opposed to these changes, they clean up the data structures
>> which is definitely a good thing.
>>
>> I am a bit worries about how/if this is going to conflict with the ITS
>> series and other patches in flight touchignt he vgic.
>>
>> Marc/Andre, any thoughts on this?
> 
> I don't mind the simplification (Andre was already removing the
> piggybacking stuff as part of his ITS series). I'm a bit more cautious
> about the sync_elrsr stuff, but that's mostly because I've only read the
> patch in a superficial way.
> 
> But yes, this is probably going to clash, unless we make this part of an
> existing series (/me looks at André... ;-)

Yes, I am looking at merging this. From the discussion with Pavel I
remember some things that I disagreed with, so I may propose a follow-up
patch. I will give this a try tomorrow.

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Christoffer Dall
On Thu, Oct 08, 2015 at 03:28:40PM +0300, Pavel Fedin wrote:
>  Hello!
> 
> > Well, compatibility with GICv2 is the biggest mistake we made when
> > designing the GICv3 architecture. And that's why our emulation doesn't
> > give a damn about v2 compatibility.
> 
>  Ok, i see your arguments, and after that it becomes a matter of personal 
> taste. Three beat one, i
> go your way. :) Don't know if i'll be able to roll out v5 tomorrow, but i 
> think on monday i'll
> definitely do.
> 
There's no rush, I don't think this will make it into v4.4 anyhow, as
we're already on -rc4 and there's a bunch of other stuff in flight, and
I haven't configured a way to test these patches yet.

Speaking of which, does the QEMU side of this patch set require first
adding the GICv3 emulation for the data structures or is there a
stand-alone migration patch set somewhere?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Marc Zyngier
On 08/10/15 13:45, Pavel Fedin wrote:
>  Hello!
> 
>> There's no rush, I don't think this will make it into v4.4 anyhow
> 
>  Did you mean 4.3 here?

No, that'd be really 4.4. 4.3 has closed 4 weeks ago, and 4.4 is about
to open. This work is unlikely to make it before 4.5 TBH.

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Peter Maydell
On 8 October 2015 at 13:45, Pavel Fedin  wrote:

>> Speaking of which, does the QEMU side of this patch set require first
>> adding the GICv3 emulation for the data structures or is there a
>> stand-alone migration patch set somewhere?
>
>  I rolled it out a week ago: 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg325331.html. I
> tried to get reviewed/accepted/whatever at least data structure part, but 
> Peter replied that he
> isn't interested before we have the kernel.

More specifically, I wanted us to agree on the kernel API for
migration, because that significantly affects how the QEMU
code looks.

A quick look at your patch suggests you still have data
structures like

+struct gicv3_irq_state {
+/* The enable bits are only banked for per-cpu interrupts.  */
+unsigned long *enabled;
+unsigned long *pending;
+unsigned long *active;
+unsigned long *level;
+unsigned long *group;
+bool edge_trigger; /* true: edge-triggered, false: level-triggered  */
+uint32_t mask_size; /* Size of bitfields in long's, for migration */
+};

I think it's probably going to be better to have an array
of redistributor structures (one per CPU), and then keep
the state that in hardware is in each redistributor inside
those sub-structures. Similarly for state that in hardware
is inside the distributor, and inside each cpu interface.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: arm/arm64: Add VGICv3 save/restore API documentation

2015-10-08 Thread Pavel Fedin
 Hello!

> There's no rush, I don't think this will make it into v4.4 anyhow

 Did you mean 4.3 here?

> Speaking of which, does the QEMU side of this patch set require first
> adding the GICv3 emulation for the data structures or is there a
> stand-alone migration patch set somewhere?

 I rolled it out a week ago: 
https://www.mail-archive.com/qemu-devel@nongnu.org/msg325331.html. I
tried to get reviewed/accepted/whatever at least data structure part, but Peter 
replied that he
isn't interested before we have the kernel.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/15] arm64: kvm: Fix {V}TCR_EL2_TG0 mask

2015-10-08 Thread Christoffer Dall
On Tue, Sep 15, 2015 at 04:41:19PM +0100, Suzuki K. Poulose wrote:
> From: "Suzuki K. Poulose" 
> 
> {V}TCR_EL2_TG0 is a 2bit wide field, where:
> 
>  00 - 4K
>  01 - 64K
>  10 - 16K
> 
> But we use only 1 bit, which has worked well so far since
> we never cared about 16K. Fix it for 16K support.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Marc Zyngier 
> Cc: Christoffer Dall 
> Cc: kvm...@lists.cs.columbia.edu
> Acked-by: Mark Rutland 
> Signed-off-by: Suzuki K. Poulose 
> ---
>  arch/arm64/include/asm/kvm_arm.h |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h 
> b/arch/arm64/include/asm/kvm_arm.h
> index 7605e09..bdf139e 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -98,7 +98,7 @@
>  #define TCR_EL2_TBI  (1 << 20)
>  #define TCR_EL2_PS   (7 << 16)
>  #define TCR_EL2_PS_40B   (2 << 16)
> -#define TCR_EL2_TG0  (1 << 14)
> +#define TCR_EL2_TG0  (3 << 14)
>  #define TCR_EL2_SH0  (3 << 12)
>  #define TCR_EL2_ORGN0(3 << 10)
>  #define TCR_EL2_IRGN0(3 << 8)
> @@ -110,7 +110,7 @@
>  
>  /* VTCR_EL2 Registers bits */
>  #define VTCR_EL2_PS_MASK (7 << 16)
> -#define VTCR_EL2_TG0_MASK(1 << 14)
> +#define VTCR_EL2_TG0_MASK(3 << 14)
>  #define VTCR_EL2_TG0_4K  (0 << 14)
>  #define VTCR_EL2_TG0_64K (1 << 14)
>  #define VTCR_EL2_SH0_MASK(3 << 12)
> -- 
> 1.7.9.5
> 

Reviewed-by: Christoffer Dall 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] arm64: Introduce helpers for page table levels

2015-10-08 Thread Christoffer Dall
On Wed, Oct 07, 2015 at 10:26:14AM +0100, Marc Zyngier wrote:
> On 07/10/15 09:26, Christoffer Dall wrote:
> > Hi Suzuki,
> > 
> > On Tue, Sep 15, 2015 at 04:41:12PM +0100, Suzuki K. Poulose wrote:
> >> From: "Suzuki K. Poulose" 
> >>
> >> Introduce helpers for finding the number of page table
> >> levels required for a given VA width, shift for a particular
> >> page table level.
> >>
> >> Convert the existing users to the new helpers. More users
> >> to follow.
> >>
> >> Cc: Ard Biesheuvel 
> >> Cc: Mark Rutland 
> >> Cc: Catalin Marinas 
> >> Cc: Will Deacon 
> >> Signed-off-by: Suzuki K. Poulose 
> >> Reviewed-by: Ard Biesheuvel 
> >> Tested-by: Ard Biesheuvel 
> >> ---
> >>  arch/arm64/include/asm/pgtable-hwdef.h |   15 ---
> >>  1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h 
> >> b/arch/arm64/include/asm/pgtable-hwdef.h
> >> index 24154b0..ce18389 100644
> >> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> >> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> >> @@ -16,13 +16,21 @@
> >>  #ifndef __ASM_PGTABLE_HWDEF_H
> >>  #define __ASM_PGTABLE_HWDEF_H
> >>  
> >> +/*
> >> + * Number of page-table levels required to address 'va_bits' wide
> >> + * address, without section mapping
> >> + */
> >> +#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 
> >> 3))
> > 
> > I don't understand the '(va_bits) - 4' here, can you explain it (and add a
> > comment to that effect) ?
> 
> I just had a chat with Catalin, who did shed some light on this.
> It all has to do with rounding up. What you would like to have here is:
> 
> #define ARM64_HW_PGTABLE_LEVELS(va_bits) DIV_ROUND_UP(va_bits - PAGE_SHIFT, 
> PAGE_SHIFT - 3)
> 
> where (va_bits - PAGE_SHIFT) is the total number of bits we deal
> with during a page table walk, and (PAGE_SHIFT - 3) is the number
> of bits we deal with per level.
> 
> The clue is in how DIV_ROUND_UP is written:
> 
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> 
> which gives you Suzuki's magic formula.
> 
> I'd vote for the DIV_ROUND_UP(), which will make things a lot more readable.
> 
Thanks for the explanation, I vote for DIV_ROUND_UP too.

You can stash this away for a cryptic interview question ;)

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread Fam Zheng
On Thu, 10/08 19:59, charlie.song wrote:
> Dear KVM Developers: 
> I am Xiang Song from UCloud company. We currently encounter a weird 
> phenomenon about Qemu-KVM IOthread. 
> We recently try to use Linux AIO from guest OS and find that the IOthread 
> mechanism of Qemu-KVM will reorder I/O requests from guest OS 
> even when the AIO write requests are issued from a single thread in order. 
> This does not happen on the host OS however.
> We are not sure whether this is a feature of Qemu-KVM IOthread mechanism 
> or a Bug.
>  
> The testbd is as following: (the guest disk device cache is configured to 
> writethrough.)
> CPU: Intel(R) Xeon(R) CPU E5-2650
> QEMU version: 1.5.3
> Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
> Simplified Guest OS qemu cmd:  
> /usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 -smp 
> 8,sockets=8,cores=1,threads=1 
> -drive 
> file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
>  
> -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4

You mentioned iothread above but it's not in your command line?

> 
> The test code triggerring this phenomenon work as following: it use linux aio 
> API to issue concurrent async write requests to a file. During exection it 
> will 
> continuously write data into target test file. There are total 'X' jobs, and 
> each job is assigned a job id JOB_ID which starts from 0. Each job will write 
> 16 * 512
> Byte data into the target file at offset =  JOB_ID * 512. (the data is 
> repeated uint64_t  JOB_ID). 
> There is only one thread handling 'X' jobs one by one through Linux AIO 
> (io_submit) cmd. When handling jobs, it will continuously 
> issuing AIO requests without waiting for AIO Callbacks. When it finishes, the 
> file should look like:
>  [00][1...1][2...2][3...3]...[X-1...X-1]
> Then we use a check program to test the resulting file, it can 
> continuously read the first 8 byte (uint64_t) of each sector and print it 
> out. In normal cases,
>  it's output is like:
>   0 1 2 3  X-1
> 
> Exec  output: (Set X=32)
> In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
> 15 16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
> It can be seen that job20~job24 are overwrited by job19.
> In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
> 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.

I'm not 100% sure but I don't think the returning of io_submit guarantees any
ordering, usually you need to wait for the callback to ensure that.

Fam

> 
> 
> I can provide the example code if needed.
> 
> Best regards, song
> 
> 2015-10-08
> 
> 
> charlie.song 
>   
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] arm64: Introduce helpers for page table levels

2015-10-08 Thread Catalin Marinas
On Thu, Oct 08, 2015 at 06:22:34PM +0100, Suzuki K. Poulose wrote:
> On 08/10/15 15:45, Christoffer Dall wrote:
> >On Wed, Oct 07, 2015 at 10:26:14AM +0100, Marc Zyngier wrote:
> >>I just had a chat with Catalin, who did shed some light on this.
> >>It all has to do with rounding up. What you would like to have here is:
> >>
> >>#define ARM64_HW_PGTABLE_LEVELS(va_bits) DIV_ROUND_UP(va_bits - PAGE_SHIFT, 
> >>PAGE_SHIFT - 3)
> >>
> >>where (va_bits - PAGE_SHIFT) is the total number of bits we deal
> >>with during a page table walk, and (PAGE_SHIFT - 3) is the number
> >>of bits we deal with per level.
> >>
> >>The clue is in how DIV_ROUND_UP is written:
> >>
> >>#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> >>
> >>which gives you Suzuki's magic formula.
> >>
> >>I'd vote for the DIV_ROUND_UP(), which will make things a lot more readable.
> >>
> >Thanks for the explanation, I vote for DIV_ROUND_UP too.
> 
> Btw, DIV_ROUND_UP is defined in linux/kernel.h, including which in the 
> required
> headers breaks the build. I could add the definition of the same locally.

Or just keep the original magic formula and add the DIV_ROUND_UP one in
a comment.

-- 
Catalin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] arm64: Introduce helpers for page table levels

2015-10-08 Thread Suzuki K. Poulose

On 08/10/15 15:45, Christoffer Dall wrote:

On Wed, Oct 07, 2015 at 10:26:14AM +0100, Marc Zyngier wrote:

On 07/10/15 09:26, Christoffer Dall wrote:

Hi Suzuki,





I just had a chat with Catalin, who did shed some light on this.
It all has to do with rounding up. What you would like to have here is:

#define ARM64_HW_PGTABLE_LEVELS(va_bits) DIV_ROUND_UP(va_bits - PAGE_SHIFT, 
PAGE_SHIFT - 3)

where (va_bits - PAGE_SHIFT) is the total number of bits we deal
with during a page table walk, and (PAGE_SHIFT - 3) is the number
of bits we deal with per level.

The clue is in how DIV_ROUND_UP is written:

#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))

which gives you Suzuki's magic formula.

I'd vote for the DIV_ROUND_UP(), which will make things a lot more readable.


Thanks for the explanation, I vote for DIV_ROUND_UP too.


Btw, DIV_ROUND_UP is defined in linux/kernel.h, including which in the required
headers breaks the build. I could add the definition of the same locally.



You can stash this away for a cryptic interview question ;)


;)


Suzuki

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread Fam Zheng
On Fri, 10/09 11:25, charlie.song wrote:
> At 2015-10-08 23:37:02, "Fam Zheng"  wrote:
> >On Thu, 10/08 19:59, charlie.song wrote:
> >> Dear KVM Developers: 
> >> I am Xiang Song from UCloud company. We currently encounter a weird 
> >> phenomenon about Qemu-KVM IOthread. 
> >> We recently try to use Linux AIO from guest OS and find that the 
> >> IOthread mechanism of Qemu-KVM will reorder I/O requests from guest OS 
> >> even when the AIO write requests are issued from a single thread in order. 
> >> This does not happen on the host OS however.
> >> We are not sure whether this is a feature of Qemu-KVM IOthread 
> >> mechanism or a Bug.
> >>  
> >> The testbd is as following: (the guest disk device cache is configured to 
> >> writethrough.)
> >> CPU: Intel(R) Xeon(R) CPU E5-2650
> >> QEMU version: 1.5.3
> >> Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
> >> Simplified Guest OS qemu cmd:  
> >> /usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 -smp 
> >> 8,sockets=8,cores=1,threads=1 
> >> -drive 
> >> file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
> >>  
> >> -device 
> >> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4
> >
> >You mentioned iothread above but it's not in your command line?
> I means the thread pool mechanism used by qemu-kvm to accelerate I/O
> processing.This is used by paio_submit (block/raw-posix.c) by default and
> with pool->max_threads = 64 as I know. (qemu-kvm version 1.5.3)

The thread pool parallism may reorder non-overlapping requests, but it
shouldn't cause any reordering of overlapping requests like the case in your io
pattern. QEMU ensures that.

Do you see this with aio=native?

Fam

> 
> >
> >> 
> >> The test code triggerring this phenomenon work as following: it use linux 
> >> aio API to issue concurrent async write requests to a file. During 
> >> exection it will 
> >> continuously write data into target test file. There are total 'X' jobs, 
> >> and each job is assigned a job id JOB_ID which starts from 0. Each job 
> >> will write 16 * 512
> >> Byte data into the target file at offset =  JOB_ID * 512. (the data is 
> >> repeated uint64_t  JOB_ID). 
> >> There is only one thread handling 'X' jobs one by one through Linux 
> >> AIO (io_submit) cmd. When handling jobs, it will continuously 
> >> issuing AIO requests without waiting for AIO Callbacks. When it finishes, 
> >> the file should look like:
> >>  [00][1...1][2...2][3...3]...[X-1...X-1]
> >> Then we use a check program to test the resulting file, it can 
> >> continuously read the first 8 byte (uint64_t) of each sector and print it 
> >> out. In normal cases,
> >>  it's output is like:
> >>   0 1 2 3  X-1
> >> 
> >> Exec  output: (Set X=32)
> >> In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
> >> 14 15 16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
> >> It can be seen that job20~job24 are overwrited by job19.
> >> In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
> >> 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.
> >
> >I'm not 100% sure but I don't think the returning of io_submit guarantees any
> >ordering, usually you need to wait for the callback to ensure that.
> Is there any proof or artical about the ordering of io_submit requests?
> >
> >Fam
> >
> >> 
> >> 
> >> I can provide the example code if needed.
> >> 
> >> Best regards, song
> >> 
> >> 2015-10-08
> >> 
> >> 
> >> charlie.song 
> >>   
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c

2015-10-08 Thread Kosuke Tatsukawa
async_pf_execute() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.

async_pf_executekvm_vcpu_block

spin_lock(>async_pf.lock);
if (waitqueue_active(>wq))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
prepare_to_wait(>wq, ,
  TASK_INTERRUPTIBLE);
/*if (kvm_vcpu_check_block(vcpu) < 0) */
 /*if (kvm_arch_vcpu_runnable(vcpu)) { */
  ...
  return (vcpu->arch.mp_state == 
KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
|| 
!list_empty_careful(>async_pf.done)
 ...
 return 0;
list_add_tail(>link,
  >async_pf.done);
spin_unlock(>async_pf.lock);
waited = true;
schedule();


The attached patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
---
 virt/kvm/async_pf.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 44660ae..303fc7f 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -94,6 +94,11 @@ static void async_pf_execute(struct work_struct *work)
 
trace_kvm_async_pf_completed(addr, gva);
 
+   /*
+* Memory barrier is required here to make sure change to
+* vcpu->async_pf.done is visible from other CPUs
+*/
+   smp_mb();
if (waitqueue_active(>wq))
wake_up_interruptible(>wq);
 
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread charlie.song
At 2015-10-08 23:37:02, "Fam Zheng"  wrote:
>On Thu, 10/08 19:59, charlie.song wrote:
>> Dear KVM Developers: 
>> I am Xiang Song from UCloud company. We currently encounter a weird 
>> phenomenon about Qemu-KVM IOthread. 
>> We recently try to use Linux AIO from guest OS and find that the 
>> IOthread mechanism of Qemu-KVM will reorder I/O requests from guest OS 
>> even when the AIO write requests are issued from a single thread in order. 
>> This does not happen on the host OS however.
>> We are not sure whether this is a feature of Qemu-KVM IOthread mechanism 
>> or a Bug.
>>  
>> The testbd is as following: (the guest disk device cache is configured to 
>> writethrough.)
>> CPU: Intel(R) Xeon(R) CPU E5-2650
>> QEMU version: 1.5.3
>> Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
>> Simplified Guest OS qemu cmd:  
>> /usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 -smp 
>> 8,sockets=8,cores=1,threads=1 
>> -drive 
>> file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
>>  
>> -device 
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4
>
>You mentioned iothread above but it's not in your command line?
I means the thread pool mechanism used by qemu-kvm to accelerate I/O 
processing.This is used by paio_submit (block/raw-posix.c) by default and with
pool->max_threads = 64 as I know. (qemu-kvm version 1.5.3)

>
>> 
>> The test code triggerring this phenomenon work as following: it use linux 
>> aio API to issue concurrent async write requests to a file. During exection 
>> it will 
>> continuously write data into target test file. There are total 'X' jobs, and 
>> each job is assigned a job id JOB_ID which starts from 0. Each job will 
>> write 16 * 512
>> Byte data into the target file at offset =  JOB_ID * 512. (the data is 
>> repeated uint64_t  JOB_ID). 
>> There is only one thread handling 'X' jobs one by one through Linux AIO 
>> (io_submit) cmd. When handling jobs, it will continuously 
>> issuing AIO requests without waiting for AIO Callbacks. When it finishes, 
>> the file should look like:
>>  [00][1...1][2...2][3...3]...[X-1...X-1]
>> Then we use a check program to test the resulting file, it can 
>> continuously read the first 8 byte (uint64_t) of each sector and print it 
>> out. In normal cases,
>>  it's output is like:
>>   0 1 2 3  X-1
>> 
>> Exec  output: (Set X=32)
>> In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 
>> 15 16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
>> It can be seen that job20~job24 are overwrited by job19.
>> In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
>> 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.
>
>I'm not 100% sure but I don't think the returning of io_submit guarantees any
>ordering, usually you need to wait for the callback to ensure that.
Is there any proof or artical about the ordering of io_submit requests?
>
>Fam
>
>> 
>> 
>> I can provide the example code if needed.
>> 
>> Best regards, song
>> 
>> 2015-10-08
>> 
>> 
>> charlie.song 
>>   
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Re: AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread charlie.song
At 2015-10-09 12:33:20, "charlie.song"  wrote:
>At 2015-10-09 12:16:03, "Fam Zheng"  wrote:
>>On Fri, 10/09 11:25, charlie.song wrote:
>>> At 2015-10-08 23:37:02, "Fam Zheng"  wrote:
>>> >On Thu, 10/08 19:59, charlie.song wrote:
>>> >> Dear KVM Developers: 
>>> >> I am Xiang Song from UCloud company. We currently encounter a weird 
>>> >> phenomenon about Qemu-KVM IOthread. 
>>> >> We recently try to use Linux AIO from guest OS and find that the 
>>> >> IOthread mechanism of Qemu-KVM will reorder I/O requests from guest OS 
>>> >> even when the AIO write requests are issued from a single thread in 
>>> >> order. This does not happen on the host OS however.
>>> >> We are not sure whether this is a feature of Qemu-KVM IOthread 
>>> >> mechanism or a Bug.
>>> >>  
>>> >> The testbd is as following: (the guest disk device cache is configured 
>>> >> to writethrough.)
>>> >> CPU: Intel(R) Xeon(R) CPU E5-2650
>>> >> QEMU version: 1.5.3
>>> >> Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
>>> >> Simplified Guest OS qemu cmd:  
>>> >> /usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 
>>> >> -smp 8,sockets=8,cores=1,threads=1 
>>> >> -drive 
>>> >> file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
>>> >>  
>>> >> -device 
>>> >> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4
>>> >
>>> >You mentioned iothread above but it's not in your command line?
>>> I means the thread pool mechanism used by qemu-kvm to accelerate I/O
>>> processing.This is used by paio_submit (block/raw-posix.c) by default and
>>> with pool->max_threads = 64 as I know. (qemu-kvm version 1.5.3)
>>
>>The thread pool parallism may reorder non-overlapping requests, but it
>>shouldn't cause any reordering of overlapping requests like the case in your 
>>io
>>pattern. QEMU ensures that.
>>
>>Do you see this with aio=native?
>We see this with both aio=threads and aio=native.
>Are you sure "QEMU ensures the ordering of overlapping requests"? 
>I can provide the example code if needed.
If I configure the guest disk IO with "cache=directsync,aio=native", this 
phenomenon disappears.

>
>>
>>Fam
>>
>>> 
>>> >
>>> >> 
>>> >> The test code triggerring this phenomenon work as following: it use 
>>> >> linux aio API to issue concurrent async write requests to a file. During 
>>> >> exection it will 
>>> >> continuously write data into target test file. There are total 'X' jobs, 
>>> >> and each job is assigned a job id JOB_ID which starts from 0. Each job 
>>> >> will write 16 * 512
>>> >> Byte data into the target file at offset =  JOB_ID * 512. (the data is 
>>> >> repeated uint64_t  JOB_ID). 
>>> >> There is only one thread handling 'X' jobs one by one through Linux 
>>> >> AIO (io_submit) cmd. When handling jobs, it will continuously 
>>> >> issuing AIO requests without waiting for AIO Callbacks. When it 
>>> >> finishes, the file should look like:
>>> >>  [00][1...1][2...2][3...3]...[X-1...X-1]
>>> >> Then we use a check program to test the resulting file, it can 
>>> >> continuously read the first 8 byte (uint64_t) of each sector and print 
>>> >> it out. In normal cases,
>>> >>  it's output is like:
>>> >>   0 1 2 3  X-1
>>> >> 
>>> >> Exec  output: (Set X=32)
>>> >> In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
>>> >> 14 15 16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
>>> >> It can be seen that job20~job24 are overwrited by job19.
>>> >> In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 
>>> >> 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.
>>> >
>>> >I'm not 100% sure but I don't think the returning of io_submit guarantees 
>>> >any
>>> >ordering, usually you need to wait for the callback to ensure that.
>>> Is there any proof or artical about the ordering of io_submit requests?
>>> >
>>> >Fam
>>> >
>>> >> 
>>> >> 
>>> >> I can provide the example code if needed.
>>> >> 
>>> >> Best regards, song
>>> >> 
>>> >> 2015-10-08
>>> >> 
>>> >> 
>>> >> charlie.song 
>>> >>   
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> the body of a message to majord...@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: Re: AIO requests may be disordered by Qemu-kvm iothread with disk cache=writethrough, Bug or Feature?

2015-10-08 Thread charlie.song
At 2015-10-09 12:16:03, "Fam Zheng"  wrote:
>On Fri, 10/09 11:25, charlie.song wrote:
>> At 2015-10-08 23:37:02, "Fam Zheng"  wrote:
>> >On Thu, 10/08 19:59, charlie.song wrote:
>> >> Dear KVM Developers: 
>> >> I am Xiang Song from UCloud company. We currently encounter a weird 
>> >> phenomenon about Qemu-KVM IOthread. 
>> >> We recently try to use Linux AIO from guest OS and find that the 
>> >> IOthread mechanism of Qemu-KVM will reorder I/O requests from guest OS 
>> >> even when the AIO write requests are issued from a single thread in 
>> >> order. This does not happen on the host OS however.
>> >> We are not sure whether this is a feature of Qemu-KVM IOthread 
>> >> mechanism or a Bug.
>> >>  
>> >> The testbd is as following: (the guest disk device cache is configured to 
>> >> writethrough.)
>> >> CPU: Intel(R) Xeon(R) CPU E5-2650
>> >> QEMU version: 1.5.3
>> >> Host/Guest Kernel:  Both Linux 4.1.8 & Linux 2.6.32, OS type CentOS 6.5
>> >> Simplified Guest OS qemu cmd:  
>> >> /usr/libexec/qemu-kvm -machine rhel6.3.0,accel=kvm,usb=off -cpu kvm64 
>> >> -smp 8,sockets=8,cores=1,threads=1 
>> >> -drive 
>> >> file=/var/lib/libvirt/images/song-disk.img,if=none,id=drive-virtio-disk0,format=qcow2,serial=UCLOUD_DISK_VDA,cache=writethrough
>> >>  
>> >> -device 
>> >> virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:d5:52,bus=pci.0,addr=0x4
>> >
>> >You mentioned iothread above but it's not in your command line?
>> I means the thread pool mechanism used by qemu-kvm to accelerate I/O
>> processing.This is used by paio_submit (block/raw-posix.c) by default and
>> with pool->max_threads = 64 as I know. (qemu-kvm version 1.5.3)
>
>The thread pool parallism may reorder non-overlapping requests, but it
>shouldn't cause any reordering of overlapping requests like the case in your io
>pattern. QEMU ensures that.
>
>Do you see this with aio=native?
We see this with both aio=threads and aio=native.
Are you sure "QEMU ensures the ordering of overlapping requests"? 
I can provide the example code if needed.

>
>Fam
>
>> 
>> >
>> >> 
>> >> The test code triggerring this phenomenon work as following: it use linux 
>> >> aio API to issue concurrent async write requests to a file. During 
>> >> exection it will 
>> >> continuously write data into target test file. There are total 'X' jobs, 
>> >> and each job is assigned a job id JOB_ID which starts from 0. Each job 
>> >> will write 16 * 512
>> >> Byte data into the target file at offset =  JOB_ID * 512. (the data is 
>> >> repeated uint64_t  JOB_ID). 
>> >> There is only one thread handling 'X' jobs one by one through Linux 
>> >> AIO (io_submit) cmd. When handling jobs, it will continuously 
>> >> issuing AIO requests without waiting for AIO Callbacks. When it finishes, 
>> >> the file should look like:
>> >>  [00][1...1][2...2][3...3]...[X-1...X-1]
>> >> Then we use a check program to test the resulting file, it can 
>> >> continuously read the first 8 byte (uint64_t) of each sector and print it 
>> >> out. In normal cases,
>> >>  it's output is like:
>> >>   0 1 2 3  X-1
>> >> 
>> >> Exec  output: (Set X=32)
>> >> In our guest OS, the output is abnormal: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 
>> >> 14 15 16 17 18 18 18 18 18 18 24 25 26 27 28 29 30 31. 
>> >> It can be seen that job20~job24 are overwrited by job19.
>> >> In our host OS, the output is as expected, 0 1 2 3 4 5 6 7 8 9 10 11 12 
>> >> 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31.
>> >
>> >I'm not 100% sure but I don't think the returning of io_submit guarantees 
>> >any
>> >ordering, usually you need to wait for the callback to ensure that.
>> Is there any proof or artical about the ordering of io_submit requests?
>> >
>> >Fam
>> >
>> >> 
>> >> 
>> >> I can provide the example code if needed.
>> >> 
>> >> Best regards, song
>> >> 
>> >> 2015-10-08
>> >> 
>> >> 
>> >> charlie.song 
>> >>   
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] KVM: x86: fix edge EOI and IOAPIC reconfig race

2015-10-08 Thread Radim Krčmář
KVM uses eoi_exit_bitmap to track vectors that need an action on EOI.
The problem is that IOAPIC can be reconfigured while an interrupt with
old configuration is pending and eoi_exit_bitmap only remembers the
newest configuration;  thus EOI from the pending interrupt is not
recognized.

(Reconfiguration is not a problem for level interrupts, because IOAPIC
 sends interrupt with the new configuration.)

For an edge interrupt with ACK notifiers, like i8254 timer; things can
happen in this order
 1) IOAPIC inject a vector from i8254
 2) guest reconfigures that vector's VCPU and therefore eoi_exit_bitmap
on original VCPU gets cleared
 3) guest's handler for the vector does EOI
 4) KVM's EOI handler doesn't pass that vector to IOAPIC because it is
not in that VCPU's eoi_exit_bitmap
 5) i8254 stops working

A simple solution is to set the IOAPIC vector in eoi_exit_bitmap if the
vector is in PIR/IRR/ISR.

This creates an unwanted situation if the vector is reused by a
non-IOAPIC source, but I think it is so rare that we don't want to make
the solution more sophisticated.  The simple solution also doesn't work
if we are reconfiguring the vector.  (Shouldn't happen in the wild and
I'd rather fix users of ACK notifiers instead of working around that.)

The are no races because ioapic injection and reconfig are locked.

Fixes: b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference")
[Before b053b2aef25d, this bug happened only with APICv.]
Fixes: c7c9c56ca26f ("x86, apicv: add virtual interrupt delivery support")
Cc: 
Signed-off-by: Radim Krčmář 
---
 v2: moved sync_pir_to_irr out of kvm_ioapic_scan_entry [Paolo]

 arch/x86/kvm/ioapic.c | 4 +++-
 arch/x86/kvm/x86.c| 4 +++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index 2dcda0f188ba..88d0a92d3f94 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -246,7 +246,9 @@ void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 
*eoi_exit_bitmap)
kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC, 
index) ||
index == RTC_GSI) {
if (kvm_apic_match_dest(vcpu, NULL, 0,
-   e->fields.dest_id, e->fields.dest_mode))
+e->fields.dest_id, e->fields.dest_mode) ||
+   (e->fields.trig_mode == IOAPIC_EDGE_TRIG &&
+kvm_apic_pending_eoi(vcpu, e->fields.vector)))
__set_bit(e->fields.vector,
(unsigned long *)eoi_exit_bitmap);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2c9bb0d6d6..7ed88020d414 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6201,8 +6201,10 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
 
if (irqchip_split(vcpu->kvm))
kvm_scan_ioapic_routes(vcpu, vcpu->arch.eoi_exit_bitmap);
-   else
+   else {
+   kvm_x86_ops->sync_pir_to_irr(vcpu);
kvm_ioapic_scan_entry(vcpu, vcpu->arch.eoi_exit_bitmap);
+   }
kvm_x86_ops->load_eoi_exitmap(vcpu);
 }
 
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: don't notify userspace IOAPIC on edge EOI

2015-10-08 Thread Radim Krčmář
2015-10-08 20:30+0200, Radim Krčmář:
> On real hardware, edge-triggered interrupts don't set a bit in TMR,
> which means that IOAPIC isn't notified on EOI.  Do the same here.
> 
> Staying in guest/kernel mode after edge EOI is what we want for most
> devices.  If some bugs could be nicely worked around with edge EOI
> notifications, we should invest in a better interface.
> 
> Signed-off-by: Radim Krčmář 
> ---
>  Completely untested.
>  
> diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
> @@ -389,13 +389,15 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 
> *eoi_exit_bitmap)
>   for (i = 0; i < nr_ioapic_pins; ++i) {
>   hlist_for_each_entry(entry, >map[i], link) {
>   u32 dest_id, dest_mode;
> + bool level;
>  
>   if (entry->type != KVM_IRQ_ROUTING_MSI)
>   continue;
>   dest_id = (entry->msi.address_lo >> 12) & 0xff;
>   dest_mode = (entry->msi.address_lo >> 2) & 0x1;
> - if (kvm_apic_match_dest(vcpu, NULL, 0, dest_id,
> - dest_mode)) {
> + level = entry->msi.data & MSI_DATA_TRIGGER_LEVEL;
> + if (level && kvm_apic_match_dest(vcpu, NULL, 0,

Describing that result is an overkill -- I'll send v2 if the idea is
accepted.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] KVM: x86: fix edge EOI and IOAPIC reconfig race

2015-10-08 Thread Radim Krčmář
v2:
 * rewritten [1/2] and
 * refactored [2/2], all thanks to Paolo's comments

This problem is not fixed for split userspace part as I think that it
would be better to solve that by excluding edge interrupts from
eoi_exit_bitmap (see the next patch in kvm-list for discussion).


Radim Krčmář (2):
  kvm: x86: set KVM_REQ_EVENT when updating IRR
  KVM: x86: fix edge EOI and IOAPIC reconfig race

 arch/x86/kvm/ioapic.c | 4 +++-
 arch/x86/kvm/lapic.c  | 2 ++
 arch/x86/kvm/x86.c| 4 +++-
 3 files changed, 8 insertions(+), 2 deletions(-)

-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] kvm: x86: set KVM_REQ_EVENT when updating IRR

2015-10-08 Thread Radim Krčmář
After moving PIR to IRR, the interrupt needs to be delivered manually.

Reported-by: Paolo Bonzini 
Cc: 
Signed-off-by: Radim Krčmář 
---
 v2: completely rewritten

 arch/x86/kvm/lapic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 944b38a56929..168b8759bd73 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -348,6 +348,8 @@ void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
struct kvm_lapic *apic = vcpu->arch.apic;
 
__kvm_apic_update_irr(pir, apic->regs);
+
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: don't notify userspace IOAPIC on edge EOI

2015-10-08 Thread Radim Krčmář
On real hardware, edge-triggered interrupts don't set a bit in TMR,
which means that IOAPIC isn't notified on EOI.  Do the same here.

Staying in guest/kernel mode after edge EOI is what we want for most
devices.  If some bugs could be nicely worked around with edge EOI
notifications, we should invest in a better interface.

Signed-off-by: Radim Krčmář 
---
 Completely untested.
 
 arch/x86/kvm/irq_comm.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c
index c89228980230..8f4499c7ffc1 100644
--- a/arch/x86/kvm/irq_comm.c
+++ b/arch/x86/kvm/irq_comm.c
@@ -389,13 +389,15 @@ void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 
*eoi_exit_bitmap)
for (i = 0; i < nr_ioapic_pins; ++i) {
hlist_for_each_entry(entry, >map[i], link) {
u32 dest_id, dest_mode;
+   bool level;
 
if (entry->type != KVM_IRQ_ROUTING_MSI)
continue;
dest_id = (entry->msi.address_lo >> 12) & 0xff;
dest_mode = (entry->msi.address_lo >> 2) & 0x1;
-   if (kvm_apic_match_dest(vcpu, NULL, 0, dest_id,
-   dest_mode)) {
+   level = entry->msi.data & MSI_DATA_TRIGGER_LEVEL;
+   if (level && kvm_apic_match_dest(vcpu, NULL, 0,
+   dest_id, dest_mode)) {
u32 vector = entry->msi.data & 0xff;
 
__set_bit(vector,
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html