from:"Jan Kiszka"

Re: [PATCH 0/3] Infinite loops in microcode while running guests

2015-11-12 Thread Jan Kiszka

On 2015-11-11 14:12, Austin S Hemmelgarn wrote:
> On 2015-11-11 08:07, Paolo Bonzini wrote:
>>
>>
>> On 11/11/2015 13:47, Austin S Hemmelgarn wrote:

>>> I just finished running a couple of tests in a KVM instance running
>>> nested on a Xen HVM instance, and found no issues, so for the set as a
>>> whole:
>>>
>>> Tested-by: Austin S. Hemmelgarn 
>>>
>>> Now to hope the equivalent fix for Xen gets into the Gentoo repositories
>>> soon, as the issue propagates down through nested virtualization and
>>> ties up the CPU regardless (and in turn triggers the watchdog).
>>
>> Note that nested guests should _not_ lock up the outer (L0) hypervisor
>> if the outer hypervisor has the fix.  At least this is the case for KVM:
>> a fixed outer KVM can protect any vulnerable nested (L1) hypervisor from
>> malicious nested guests.  A vulnerable outer KVM is also protected if
>> the nested hypervisor has the workaround.
>>
> I already knew this, I just hadn't remembered that I hadn't updated Xen
> since before the XSA and patch for this had been posted (and it took me
> a while to remember this when I accidentally panicked Xen :))
> 

As I'm lazy, both to search and to write something myself: is there
already a test case for the issue(s) circling around?

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] Infinite loops in microcode while running guests

2015-11-10 Thread Jan Kiszka

On 2015-11-10 13:22, Paolo Bonzini wrote:
> Yes, these can happen.  The issue is that benign exceptions are
> delivered serially, but two of them (#DB and #AC) can also happen
> during exception delivery itself.  The subsequent infinite stream
> of exceptions causes the processor to never exit guest mode.
> 
> Paolo
> 
> Eric Northup (1):
>   KVM: x86: work around infinite loop in microcode when #AC is delivered
> 
> Paolo Bonzini (2):
>   KVM: svm: unconditionally intercept #DB
>   KVM: x86: rename update_db_bp_intercept to update_bp_intercept
> 
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/include/uapi/asm/svm.h |  1 +
>  arch/x86/kvm/svm.c  | 22 +++---
>  arch/x86/kvm/vmx.c  |  7 +--
>  arch/x86/kvm/x86.c  |  2 +-
>  5 files changed, 19 insertions(+), 15 deletions(-)
> 

So this affects both Intel and AMD CPUs equally? Nice cross-vendor
"compatibility".

And it can only be triggered via #AC and #DB, or also other exceptions
(that KVM already happens to intercept)? You may guess why I'm asking...

Is any of the issues already documented in a vendor errata?

Thanks,
Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 0/6] virtio core DMA API conversion

2015-11-10 Thread Jan Kiszka

On 2015-11-10 03:18, Andy Lutomirski wrote:
> On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt
>> I thus go back to my original statement, it's a LOT easier to handle if
>> the device itself is self describing, indicating whether it is set to
>> bypass a host iommu or not. For L1->L2, well, that wouldn't be the
>> first time qemu/VFIO plays tricks with the passed through device
>> configuration space...
> 
> Which leaves the special case of Xen, where even preexisting devices
> don't bypass the IOMMU.  Can we keep this specific to powerpc and
> sparc?  On x86, this problem is basically nonexistent, since the IOMMU
> is properly self-describing.
> 
> IOW, I think that on x86 we should assume that all virtio devices
> honor the IOMMU.

>From the guest driver POV, that is OK because either there is no IOMMU
to program (the current situation with qemu), there can be one that
doesn't need it (the current situation with qemu and iommu=on) or there
is (Xen) or will be (future qemu) one that requires it.

> 
>>
>> Note that the above can be solved via some kind of compromise: The
>> device self describes the ability to honor the iommu, along with the
>> property (or ACPI table entry) that indicates whether or not it does.
>>
>> IE. We could use the revision or ProgIf field of the config space for
>> example. Or something in virtio config. If it's an "old" device, we
>> know it always bypass. If it's a new device, we know it only bypasses
>> if the corresponding property is in. I still would have to sort out the
>> openbios case for mac among others but it's at least a workable
>> direction.
>>
>> BTW. Don't you have a similar problem on x86 that today qemu claims
>> that everything honors the iommu in ACPI ?
> 
> Only on a single experimental configuration, and that can apparently
> just be fixed going forward without any real problems being caused.

BTW, I once tried to describe the current situation on QEMU x86 with
IOMMU enabled via ACPI. While you can easily add IOMMU device exceptions
to the static tables, the fun starts when considering device hotplug for
virtio. Unless I missed some trick, ACPI doesn't seem like being
designed for that level of flexibility.

You would have to reserve a complete PCI bus, declare that one as not
being IOMMU-governed, and then only add new virtio devices to that bus.
Possible, but a lot of restrictions that existing management software
would have to be aware of as well.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: vmx: fix VPID is 0000H in non-root operation

2015-09-16 Thread Jan Kiszka

On 2015-09-16 13:31, Wanpeng Li wrote:
> Reference SDM 28.1:
> 
> The current VPID is H in the following situations:
> — Outside VMX operation. (This includes operation in system-management 
>   mode under the default treatment of SMIs and SMM with VMX operation; 
>   see Section 34.14.)
> — In VMX root operation.
> — In VMX non-root operation when the “enable VPID” VM-execution control 
>   is 0.
> 
> The VPID should never be H in non-root operation when "enable VPID" 
> VM-execution control is 1. However, commit (34a1cd60: 'kvm: x86: vmx: 
> move some vmx setting from vmx_init() to hardware_setup()') remove the 
> codes which reserve H for VMX root operation. 
> 
> This patch fix it by reintroducing reserve H for VMX root operation.
> 
> Reported-by: Wincy Van 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..a63b9ca 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6056,6 +6056,8 @@ static __init int hardware_setup(void)
>   memcpy(vmx_msr_bitmap_longmode_x2apic,
>   vmx_msr_bitmap_longmode, PAGE_SIZE);
>  
> + set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
> +
>   if (enable_apicv) {
>   for (msr = 0x800; msr <= 0x8ff; msr++)
>   vmx_disable_intercept_msr_read_x2apic(msr);
> 

Good point.

BTW, what will happen if allocate_vpid runs out of free slots and
returns 0? Will we always fail then...?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-16 Thread Jan Kiszka

On 2015-09-16 09:19, Wanpeng Li wrote:
> Enhance allocate/free_vid to handle shadow vpid.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 23 +++
>  1 file changed, 11 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..c5222b8 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4155,29 +4155,28 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   return r;
>  }
>  
> -static void allocate_vpid(struct vcpu_vmx *vmx)
> +static int allocate_vpid(void)
>  {
>   int vpid;
>  
> - vmx->vpid = 0;
>   if (!enable_vpid)
> - return;
> + return 0;
>   spin_lock(&vmx_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> - if (vpid < VMX_NR_VPIDS) {
> - vmx->vpid = vpid;
> + if (vpid < VMX_NR_VPIDS)
>   __set_bit(vpid, vmx_vpid_bitmap);
> - }
> + else
> + vpid = 0;
>   spin_unlock(&vmx_vpid_lock);
> + return vpid;
>  }
>  
> -static void free_vpid(struct vcpu_vmx *vmx)
> +static void free_vpid(int vpid)
>  {
>   if (!enable_vpid)

|| vpid == 0

Otherwise you clear bit zero and cause the next allocate_vpid return 0 -
from the bitmap.

Jan

>   return;
>   spin_lock(&vmx_vpid_lock);
> - if (vmx->vpid != 0)
> - __clear_bit(vmx->vpid, vmx_vpid_bitmap);
> + __clear_bit(vpid, vmx_vpid_bitmap);
>   spin_unlock(&vmx_vpid_lock);
>  }
>  
> @@ -8482,7 +8481,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>   if (enable_pml)
>   vmx_disable_pml(vmx);
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   leave_guest_mode(vcpu);
>   vmx_load_vmcs01(vcpu);
>   free_nested(vmx);
> @@ -8501,7 +8500,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   if (!vmx)
>   return ERR_PTR(-ENOMEM);
>  
> - allocate_vpid(vmx);
> + vmx->vpid = allocate_vpid();
>  
>   err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
>   if (err)
> @@ -8577,7 +8576,7 @@ free_msrs:
>  uninit_vcpu:
>   kvm_vcpu_uninit(&vmx->vcpu);
>  free_vcpu:
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   kmem_cache_free(kvm_vcpu_cache, vmx);
>   return ERR_PTR(err);
>  }
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/2] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Jan Kiszka

pu_has_vpid(vmcs12)) {
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->nested.vpid02);
> + if (vmcs12->virtual_processor_id != 
> vmx->nested.last_vpid) {
> +         vmx->nested.last_vpid = 
> vmcs12->virtual_processor_id;
> + vmx_flush_tlb(vcpu);
> + }
> + } else {
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> + vmx_flush_tlb(vcpu);
> + }
> +
>   }
>  
>   if (nested_cpu_has_ept(vmcs12)) {
> 

Looks good to me.

Reviewed-by: Jan Kiszka 

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/2] KVM: nVMX: enhance allocate/free_vpid to handle shadow vpid

2015-09-15 Thread Jan Kiszka

On 2015-09-16 05:51, Wanpeng Li wrote:
> Enhance allocate/free_vid to handle shadow vpid.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 24 +++-
>  1 file changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 9ff6a3f..4956081 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4155,29 +4155,27 @@ static int alloc_identity_pagetable(struct kvm *kvm)
>   return r;
>  }
>  
> -static void allocate_vpid(struct vcpu_vmx *vmx)
> +static int allocate_vpid(void)
>  {
> - int vpid;
> + int vpid = 0;

Initialization is not pointless with the current code.

>  
> - vmx->vpid = 0;
>   if (!enable_vpid)
> - return;
> + return 0;
>   spin_lock(&vmx_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> - if (vpid < VMX_NR_VPIDS) {
> - vmx->vpid = vpid;
> + if (vpid < VMX_NR_VPIDS)
>   __set_bit(vpid, vmx_vpid_bitmap);
> - }
>   spin_unlock(&vmx_vpid_lock);
> + return vpid;

You should return 0 also if vpid == VMX_NR_VPIDS.

>  }
>  
> -static void free_vpid(struct vcpu_vmx *vmx)
> +static void free_vpid(int vpid)
>  {
>   if (!enable_vpid)

You could already test for vpid == 0 here...

>   return;
>   spin_lock(&vmx_vpid_lock);
> - if (vmx->vpid != 0)
> - __clear_bit(vmx->vpid, vmx_vpid_bitmap);
> + if (vpid != 0)

...then you could skip this.

> + __clear_bit(vpid, vmx_vpid_bitmap);
>   spin_unlock(&vmx_vpid_lock);
>  }
>  
> @@ -8482,7 +8480,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>  
>   if (enable_pml)
>   vmx_disable_pml(vmx);
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   leave_guest_mode(vcpu);
>   vmx_load_vmcs01(vcpu);
>   free_nested(vmx);
> @@ -8501,7 +8499,7 @@ static struct kvm_vcpu *vmx_create_vcpu(struct kvm 
> *kvm, unsigned int id)
>   if (!vmx)
>   return ERR_PTR(-ENOMEM);
>  
> - allocate_vpid(vmx);
> + vmx->vpid = allocate_vpid();
>  
>   err = kvm_vcpu_init(&vmx->vcpu, kvm, id);
>   if (err)
> @@ -8577,7 +8575,7 @@ free_msrs:
>  uninit_vcpu:
>   kvm_vcpu_uninit(&vmx->vcpu);
>  free_vcpu:
> - free_vpid(vmx);
> + free_vpid(vmx->vpid);
>   kmem_cache_free(kvm_vcpu_cache, vmx);
>   return ERR_PTR(err);
>  }
> 

Yes, this is what I had in mind.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 1/2] target-i386: disable LINT0 after reset

2015-09-15 Thread Jan Kiszka

On 2015-09-15 23:19, Alex Williamson wrote:
> On Mon, 2015-04-13 at 02:32 +0300, Nadav Amit wrote:
>> Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long 
>> gone
>> and therefore this hack is no longer needed.  Since it violates the
>> specifications, it is removed.
>>
>> Signed-off-by: Nadav Amit 
>> ---
>>  hw/intc/apic_common.c | 9 -
>>  1 file changed, 9 deletions(-)
> 
> Please see bug: https://bugs.launchpad.net/qemu/+bug/1488363
> 
> Is this bug perhaps not as long gone as we thought, or is there
> something else going on here?  Thanks,

I would say, someone needs to check if the SeaBIOS line that is supposed
to enable LINT0 is actually executed on one of the broken systems and,
if not, why not.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Jan Kiszka

On 2015-09-16 04:36, Wanpeng Li wrote:
> On 9/16/15 1:32 AM, Jan Kiszka wrote:
>> On 2015-09-15 12:14, Wanpeng Li wrote:
>>> On 9/14/15 10:54 PM, Jan Kiszka wrote:
>>>> Last but not least: the guest can now easily exhaust the host's pool of
>>>> vpid by simply spawning plenty of VCPUs for L2, no? Is this acceptable
>>>> or should there be some limit?
>>> I reuse the value of vpid02 while vpid12 changed w/ one invvpid in v2,
>>> and the scenario which you pointed out can be avoid.
>> I cannot yet follow why there is no chance for L1 to consume all vpids
>> that the host manages in that single, global bitmap by simply spawning a
>> lot of nested VCPUs for some L2. What is enforcing L1 to call nested
>> vmclear - apparently the only way, besides destructing nested VCPUs, to
>> release such vpids again?
> 
> In v2, there is no direct mapping between vpid02 and vpid12, the vpid02
> is per-vCPU for L0 and reused while the value of vpid12 is changed w/
> one invvpid during nested vmentry. The vpid12 is allocated by L1 for L2,
> so it will not influence global bitmap(for vpid01 and vpid02 allocation)
> even if spawn a lot of nested vCPUs.

Ah, I see, you limit allocation to one additional host-side vpid per
VCPU, for nesting. That looks better. That also means all vpids for L2
will be folded on that single vpid in hardware, right? So the major
benefit comes from having separate vpids when switching between L1 and
L2, in fact.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-15 Thread Jan Kiszka

On 2015-09-15 12:14, Wanpeng Li wrote:
> On 9/14/15 10:54 PM, Jan Kiszka wrote:
>> Last but not least: the guest can now easily exhaust the host's pool of
>> vpid by simply spawning plenty of VCPUs for L2, no? Is this acceptable
>> or should there be some limit?
> 
> I reuse the value of vpid02 while vpid12 changed w/ one invvpid in v2,
> and the scenario which you pointed out can be avoid.

I cannot yet follow why there is no chance for L1 to consume all vpids
that the host manages in that single, global bitmap by simply spawning a
lot of nested VCPUs for some L2. What is enforcing L1 to call nested
vmclear - apparently the only way, besides destructing nested VCPUs, to
release such vpids again?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: nested VPID emulation

2015-09-14 Thread Jan Kiszka

On 2015-09-14 14:52, Wanpeng Li wrote:
> VPID is used to tag address space and avoid a TLB flush. Currently L0 use 
> the same VPID to run L1 and all its guests. KVM flushes VPID when switching 
> between L1 and L2. 
> 
> This patch advertises VPID to the L1 hypervisor, then address space of L1 and 
> L2 can be separately treated and avoid TLB flush when swithing between L1 and 
> L2. This patch gets ~3x performance improvement for lmbench 8p/64k ctxsw.
> 
> Signed-off-by: Wanpeng Li 
> ---
>  arch/x86/kvm/vmx.c | 39 ---
>  1 file changed, 32 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index da1590e..06bc31e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -1157,6 +1157,11 @@ static inline bool 
> nested_cpu_has_virt_x2apic_mode(struct vmcs12 *vmcs12)
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE);
>  }
>  
> +static inline bool nested_cpu_has_vpid(struct vmcs12 *vmcs12)
> +{
> + return nested_cpu_has2(vmcs12, SECONDARY_EXEC_ENABLE_VPID);
> +}
> +
>  static inline bool nested_cpu_has_apic_reg_virt(struct vmcs12 *vmcs12)
>  {
>   return nested_cpu_has2(vmcs12, SECONDARY_EXEC_APIC_REGISTER_VIRT);
> @@ -2471,6 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>   SECONDARY_EXEC_RDTSCP |
>   SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
> + SECONDARY_EXEC_ENABLE_VPID |
>   SECONDARY_EXEC_APIC_REGISTER_VIRT |
>   SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>   SECONDARY_EXEC_WBINVD_EXITING |
> @@ -4160,7 +4166,7 @@ static void allocate_vpid(struct vcpu_vmx *vmx)
>   int vpid;
>  
>   vmx->vpid = 0;
> - if (!enable_vpid)
> + if (!enable_vpid || is_guest_mode(&vmx->vcpu))
>   return;
>   spin_lock(&vmx_vpid_lock);
>   vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
> @@ -6738,6 +6744,14 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
>   }
>   vmcs12 = kmap(page);
>   vmcs12->launch_state = 0;
> + if (enable_vpid) {
> + if (nested_cpu_has_vpid(vmcs12)) {
> + spin_lock(&vmx_vpid_lock);
> + if (vmcs12->virtual_processor_id != 0)
> + __clear_bit(vmcs12->virtual_processor_id, 
> vmx_vpid_bitmap);
> + spin_unlock(&vmx_vpid_lock);

Maybe enhance free_vpid (and also allocate_vpid) to work generically and
let the caller decide where to take the vpid from or where to store it?

> + }
> + }
>   kunmap(page);
>   nested_release_page(page);
>  
> @@ -9189,6 +9203,7 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>  {
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   u32 exec_control;
> + int vpid;
>  
>   vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>   vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> @@ -9438,13 +9453,21 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12)
>   else
>   vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>  
> +
>   if (enable_vpid) {
> - /*
> -  * Trivially support vpid by letting L2s share their parent
> -  * L1's vpid. TODO: move to a more elaborate solution, giving
> -  * each L2 its own vpid and exposing the vpid feature to L1.
> -  */
> - vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> + if (nested_cpu_has_vpid(vmcs12)) {
> + if (vmcs12->virtual_processor_id == 0) {
> + spin_lock(&vmx_vpid_lock);
> + vpid = find_first_zero_bit(vmx_vpid_bitmap, 
> VMX_NR_VPIDS);
> + if (vpid < VMX_NR_VPIDS)
> + __set_bit(vpid, vmx_vpid_bitmap);
> + spin_unlock(&vmx_vpid_lock);
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vpid);

It's a bit non-obvious that vpid == VMX_NR_VPIDS (no free vpids) will
lead to vpid == 0 when writing VIRTUAL_PROCESSOR_ID. You should leave at
least a comment. Or generalize allocate_vpid as that one is already
clearer in this regard.

> + } else
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, 
> vmcs12->virtual_processor_id);
> + } else
> + vmcs_write16(VIRTUAL_PROCESSOR_ID, vmx->vpid);
> +
>   vmx_flush_tlb(vcpu);
>   }
>  
> @@ -9973,6 +9996,8 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, 
> struct vmcs12 *vmcs12,
>   vmcs12_save_pending_event(vcpu, vmcs12);
>   }
>  
> + if (nested_cpu_has_vpid(vmcs12))
> + vmcs12->virtual_processor_id = 
> vmcs_read16(VIRTUAL_PROCESSOR_ID);
>   /*
>

Re: [PATCH 00/13] arm64: Virtualization Host Extension support

2015-08-26 Thread Jan Kiszka

On 2015-08-26 11:28, Antonios Motakis wrote:
> 
> 
> On 26-Aug-15 11:21, Jan Kiszka wrote:
>> On 2015-08-26 11:12, Antonios Motakis wrote:
>>> Hello Marc,
>>>
>>> On 08-Jul-15 18:19, Marc Zyngier wrote:
>>>> ARMv8.1 comes with the "Virtualization Host Extension" (VHE for
>>>> short), which enables simpler support of Type-2 hypervisors.
>>>>
>>>> This extension allows the kernel to directly run at EL2, and
>>>> significantly reduces the number of system registers shared between
>>>> host and guest, reducing the overhead of virtualization.
>>>>
>>>> In order to have the same kernel binary running on all versions of the
>>>> architecture, this series makes heavy use of runtime code patching.
>>>>
>>>> The first ten patches massage the KVM code to deal with VHE and enable
>>>> Linux to run at EL2.
>>>
>>> I am currently working on getting the Jailhouse hypervisor to work on 
>>> AArch64.
>>>
>>> I've been looking at your patches, trying to figure out the implications 
>>> for Jailhouse. It seems there are a few :)
>>>
>>> Jailhouse likes to be loaded by Linux into memory, and then to inject 
>>> itself at a higher level than Linux (demoting Linux into being the "root 
>>> cell"). This works on x86 and ARM (AArch32 and eventually AArch64 without 
>>> VHE). What this means in ARM, is that Jailhouse hooks into the HVC stub 
>>> exposed by Linux, and happily installs itself in EL2.
>>>
>>> With Linux running in EL2 though, that won't be as straightforward. It 
>>> looks like we can't just demote Linux to EL1 without breaking something. 
>>> Obviously it's OK for us that KVM won't work, but it looks like at least 
>>> the timer code will break horribly if we try to do something like that.
>>>
>>> Any comments on this? One work around would be to just remap the incoming 
>>> interrupt from the timer, so Linux never really realizes it's not running 
>>> in EL2 anymore. Then we would also have to deal with the intricacies of 
>>> removing and re-adding vCPUs to the Linux root cell, so we would have to 
>>> maintain the illusion of running in EL2 for each one of them.
>>
>> Without knowing any of the details, I would say there are two strategies
>> regarding this:
>>
>> - Disable KVM support in the Linux kernel - then we shouldn't boot into
>>   EL2 in the first place, should we?
> 
> We would have to ask the user to patch the kernel, to ignore VHE and keep all 
> the hyp stub magic that we rely on currently. It is an option of course.

Patch or reconfigure? CONFIG_KVM isn't mandatory for arm64, is it?

Jan

> 
>>
>> - Emulate what Linux is missing after take-over by Jailhouse (we do
>>   this on x86 with VT-d interrupt remapping which cannot be disabled
>>   anymore for Linux once it started with it, and we cannot boot without
>>   it when we want to use the x2APIC).
> 
> Essentially what I described above; let's call it nested virtualization 
> without the virtualization parts? :)
> 
>>
>> Jan
>>
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] arm64: Virtualization Host Extension support

2015-08-26 Thread Jan Kiszka

On 2015-08-26 11:12, Antonios Motakis wrote:
> Hello Marc,
> 
> On 08-Jul-15 18:19, Marc Zyngier wrote:
>> ARMv8.1 comes with the "Virtualization Host Extension" (VHE for
>> short), which enables simpler support of Type-2 hypervisors.
>>
>> This extension allows the kernel to directly run at EL2, and
>> significantly reduces the number of system registers shared between
>> host and guest, reducing the overhead of virtualization.
>>
>> In order to have the same kernel binary running on all versions of the
>> architecture, this series makes heavy use of runtime code patching.
>>
>> The first ten patches massage the KVM code to deal with VHE and enable
>> Linux to run at EL2.
> 
> I am currently working on getting the Jailhouse hypervisor to work on AArch64.
> 
> I've been looking at your patches, trying to figure out the implications for 
> Jailhouse. It seems there are a few :)
> 
> Jailhouse likes to be loaded by Linux into memory, and then to inject itself 
> at a higher level than Linux (demoting Linux into being the "root cell"). 
> This works on x86 and ARM (AArch32 and eventually AArch64 without VHE). What 
> this means in ARM, is that Jailhouse hooks into the HVC stub exposed by 
> Linux, and happily installs itself in EL2.
> 
> With Linux running in EL2 though, that won't be as straightforward. It looks 
> like we can't just demote Linux to EL1 without breaking something. Obviously 
> it's OK for us that KVM won't work, but it looks like at least the timer code 
> will break horribly if we try to do something like that.
> 
> Any comments on this? One work around would be to just remap the incoming 
> interrupt from the timer, so Linux never really realizes it's not running in 
> EL2 anymore. Then we would also have to deal with the intricacies of removing 
> and re-adding vCPUs to the Linux root cell, so we would have to maintain the 
> illusion of running in EL2 for each one of them.

Without knowing any of the details, I would say there are two strategies
regarding this:

- Disable KVM support in the Linux kernel - then we shouldn't boot into
  EL2 in the first place, should we?

- Emulate what Linux is missing after take-over by Jailhouse (we do
  this on x86 with VT-d interrupt remapping which cannot be disabled
  anymore for Linux once it started with it, and we cannot boot without
  it when we want to use the x2APIC).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v7 1/4] KVM: x86: Split the APIC from the rest of IRQCHIP.

2015-07-31 Thread Jan Kiszka

On 2015-07-30 23:19, Steve Rutherford wrote:
> On Thu, Jul 30, 2015 at 11:38:20AM +0200, Paolo Bonzini wrote:
>>
>>
>> On 30/07/2015 10:37, Steve Rutherford wrote:
>>> This looks a bit non-sensical, but is overprepared for the introduction
>>> IOAPIC hotplug, which two patches down the line. Changing it is fine,
>>> you'll just need to merge the very same change back.
>>
>> By "IOAPIC hotplug" you mean changing the number of reserved routes?  Is
>> it necessary?  You could just reserve a bunch of routes depending on the
>> maximum number of IOAPICs.
> Hmm. Yeah, I think that might be cleaner. Thinking about it, I'm a bit nervous
> about the idea of the number of reserved routes shrinking. We would have 
> needed
> to trigger an IOAPIC scan if the number of reserved routes changed.
> 
> Jan might have an opinion here.

A static preallocation is likely fine, given reasonable room. I have no
idea about a good limit, though. To be safe, we could pull in someone
from Intel, maybe the guy who worked on the IOAPIC refactorings in the
kernel to enable hotplugging.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v5 3/4] KVM: x86: Add EOI exit bitmap inference

2015-07-29 Thread Jan Kiszka

On 2015-07-29 22:27, Steve Rutherford wrote:
> On Wed, Jul 29, 2015 at 02:38:09PM +0200, Paolo Bonzini wrote:
>>
>>
>> On 28/07/2015 01:17, Steve Rutherford wrote:
>>> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
>>> index d8cc54b..f6ce112 100644
>>> --- a/arch/x86/kvm/ioapic.h
>>> +++ b/arch/x86/kvm/ioapic.h
>>> @@ -9,6 +9,7 @@ struct kvm;
>>>  struct kvm_vcpu;
>>>  
>>>  #define IOAPIC_NUM_PINS  KVM_IOAPIC_NUM_PINS
>>> +#define MAX_NR_RESERVED_IOAPIC_PINS 48
>>
>> Why is this needed?
> This constant is used to bound the number of IOAPIC pins that are
> reservable when enabling KVM_CAP_SPLIT_IRQCHIP. IIRC, x86 doesn't
> support more than 2 IOAPICs.  

Huh? Surely not. I've already seen boxes with at least three, and I
think you can even hot-plug them today via extension cards. Not saying
that QEMU supports that already, even without KVM, but we must not limit
ourselves in the kernel API.

So please remove such a static limit on how many IOAPICs userspace can
emulate or raise it to something sufficiently large that will last long
enough.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm/x86: add support for MONITOR_TRAP_FLAG

2015-07-09 Thread Jan Kiszka

On 2015-07-05 19:08, Mihai Donțu wrote:
> Allow a nested hypervisor to single step its guests.
> 
> Signed-off-by: Mihai Donțu 
> 
> ---
> 
> This patch applies on top of current linux-next.
> ---
>  arch/x86/include/asm/vmx.h  |  1 +
>  arch/x86/include/uapi/asm/vmx.h |  2 ++
>  arch/x86/kvm/vmx.c  | 10 +-
>  3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index da772ed..9299ae5 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -47,6 +47,7 @@
>  #define CPU_BASED_MOV_DR_EXITING0x0080
>  #define CPU_BASED_UNCOND_IO_EXITING 0x0100
>  #define CPU_BASED_USE_IO_BITMAPS0x0200
> +#define CPU_BASED_MONITOR_TRAP_FLAG 0x0800
>  #define CPU_BASED_USE_MSR_BITMAPS   0x1000
>  #define CPU_BASED_MONITOR_EXITING   0x2000
>  #define CPU_BASED_PAUSE_EXITING 0x4000
> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> index 1fe9218..37fee27 100644
> --- a/arch/x86/include/uapi/asm/vmx.h
> +++ b/arch/x86/include/uapi/asm/vmx.h
> @@ -58,6 +58,7 @@
>  #define EXIT_REASON_INVALID_STATE   33
>  #define EXIT_REASON_MSR_LOAD_FAIL   34
>  #define EXIT_REASON_MWAIT_INSTRUCTION   36
> +#define EXIT_REASON_MONITOR_TRAP_FLAG   37
>  #define EXIT_REASON_MONITOR_INSTRUCTION 39
>  #define EXIT_REASON_PAUSE_INSTRUCTION   40
>  #define EXIT_REASON_MCE_DURING_VMENTRY  41
> @@ -106,6 +107,7 @@
>   { EXIT_REASON_MSR_READ,  "MSR_READ" }, \
>   { EXIT_REASON_MSR_WRITE, "MSR_WRITE" }, \
>   { EXIT_REASON_MWAIT_INSTRUCTION, "MWAIT_INSTRUCTION" }, \
> + { EXIT_REASON_MONITOR_TRAP_FLAG, "MONITOR_TRAP_FLAG" }, \
>   { EXIT_REASON_MONITOR_INSTRUCTION,   "MONITOR_INSTRUCTION" }, \
>   { EXIT_REASON_PAUSE_INSTRUCTION, "PAUSE_INSTRUCTION" }, \
>   { EXIT_REASON_MCE_DURING_VMENTRY,"MCE_DURING_VMENTRY" }, \
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index e856dd5..6d7c650 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2443,7 +2443,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING |
>  #endif
>   CPU_BASED_MOV_DR_EXITING | CPU_BASED_UNCOND_IO_EXITING |
> - CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_EXITING |
> + CPU_BASED_USE_IO_BITMAPS | CPU_BASED_MONITOR_TRAP_FLAG | 
> CPU_BASED_MONITOR_EXITING |

Overlong line.

>   CPU_BASED_RDPMC_EXITING | CPU_BASED_RDTSC_EXITING |
>   CPU_BASED_PAUSE_EXITING | CPU_BASED_TPR_SHADOW |
>   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
> @@ -6246,6 +6246,11 @@ static int handle_mwait(struct kvm_vcpu *vcpu)
>   return handle_nop(vcpu);
>  }
>  
> +static int handle_monitor_trap(struct kvm_vcpu *vcpu)
> +{
> + return 1;
> +}
> +
>  static int handle_monitor(struct kvm_vcpu *vcpu)
>  {
>   printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n");
> @@ -7282,6 +7287,7 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
> kvm_vcpu *vcpu) = {
>   [EXIT_REASON_EPT_MISCONFIG]   = handle_ept_misconfig,
>   [EXIT_REASON_PAUSE_INSTRUCTION]   = handle_pause,
>   [EXIT_REASON_MWAIT_INSTRUCTION]   = handle_mwait,
> + [EXIT_REASON_MONITOR_TRAP_FLAG]   = handle_monitor_trap,
>   [EXIT_REASON_MONITOR_INSTRUCTION] = handle_monitor,
>   [EXIT_REASON_INVEPT]  = handle_invept,
>   [EXIT_REASON_INVVPID] = handle_invvpid,
> @@ -7542,6 +7548,8 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
> *vcpu)
>   return true;
>   case EXIT_REASON_MWAIT_INSTRUCTION:
>   return nested_cpu_has(vmcs12, CPU_BASED_MWAIT_EXITING);
> + case EXIT_REASON_MONITOR_TRAP_FLAG:
> + return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_TRAP_FLAG);
>   case EXIT_REASON_MONITOR_INSTRUCTION:
>   return nested_cpu_has(vmcs12, CPU_BASED_MONITOR_EXITING);
>   case EXIT_REASON_PAUSE_INSTRUCTION:
> 

Looks OK otherwise. If you fix up the style thing, you may add my

Reviewed-by: Jan Kiszka 

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: arm/arm64: Don't let userspace update CNTVOFF once guest is running

2015-07-09 Thread Jan Kiszka

On 2015-07-09 12:22, Christoffer Dall wrote:
> Hi Peter and Marc,
> 
> [cc'ing Paolo for his input on x86 timekeeping]
> 
> On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote:
>> On 8 July 2015 at 17:37, Marc Zyngier  wrote:
>>> On 08/07/15 17:06, Peter Maydell wrote:
 I'd prefer it if somebody could investigate to see why QEMU
 is actually doing this -- so far we just have speculation.
>>>
>>> I'd prefer that too, but so far people seem to be more comfortable
>>> waiting for the issue to fix itself. In the meantime, VMs are broken in
>>> weird and wonderful ways, and I don't think the current status-quo helps
>>> anyone.
>>
>> Putting in a patch which might not be the right fix isn't
>> necessarily a good plan either...
>>
>> Does has_run_once get cleared if we do a re-VCPU_INIT
>> of a CPU that's run before? (We need to allow rewriting
>> of guest state at that point so that "reset VM and
>> load migration state" behaves correctly.)
> 
> no, it does not, has_run_once is set the first time a VCPU is run and is
> currently *never* cleared.
> 
>>
>> I suspect Jan is right and we really need to distinguish
>> the KVM_PUT_*_STATE levels in ARM QEMU. This probably
>> implies some kind of whitelist/override mechanism, since
>> by and large we neither know nor want to know the
>> semantics for system registers, we leave that up to the
>> kernel.
>>
>> Q: if you have a running VM, and you pause it for
>> an hour, what should the CNTVCT register do? Presumably
>> it should not advance, but how do we arrange for that
>> to happen?
>>
> 
> I think the CNTVCT should not advance when the VM is not scheduled, so
> if we pause the VM or starve all the VCPUs for enough time, the guest
> should not see time progressing, since otherwise the guest scheduler
> cannot maintain fairness and you're bound to see spurious RCU stalls
> etc.
> 
> That's exactly why a guest can read both a virtual and physical counter
> and it is an area where you simply want some level of
> paravirtualization.  I haven't studied how/if Linux deals with this at
> all.
> 
> So I think adjusting CNTVOFF should be managed by the kernel for the
> pause/starvation scenario (which I think Avi once told me x86 does too -
> does anyone know the current state of the art?).
> 
> So the only situation where I think userspace should adjust the CNTVOFF
> value is for migration where we are talking about a brand new VM with
> has_run_once clear.
> 
> Thus, if we were designing this from scratch now, the API should
> be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the
> VM has run once, but it's too late for that as we would break userspace.
> The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF
> in the kernel side as well, and finally also fix QEMU so that it doesn't
> try to do the thing that future kernels will ignore.

Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be
done, but I don't think the approach for the kernel is generally right.
The kernel should not do any policing on user space requests to change
the VCPU or VM state unless

 - security is affected
 - userspace lacks information, thus cannot decide correctly
 - legacy userspace has a bug, we can detect it and want to fix that up
   without affecting future userspace that has a reason to do it
   differently

Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe
the last one, don't know. Just think of the hypothetical scenario that a
userspace VM debugger wants to inject certain register manipulations. If
you block this by some hidden VM state like proposed, that feature would
no longer be implementable easily.

Jan




signature.asc
Description: OpenPGP digital signature

Re: [RFC PATCH] KVM: arm/arm64: Don't let userspace update CNTVOFF once guest is running

2015-06-25 Thread Jan Kiszka

On 2015-06-25 11:25, Claudio Fontana wrote:
> On 25.06.2015 11:10, Peter Maydell wrote:
>> On 25 June 2015 at 09:59, Claudio Fontana  wrote:
>>> Once the VM is created, I think QEMU should not request kvm to
>>> change the virtual offset of the VM anymore: maybe an unexpected
>>> consequence of QEMU's target-arm/kvm64.c::kvm_arch_put_registers ?
>>
>> Hmm. In general we assume that we can:
>>  * stop the VM
>>  * read all the guest system registers
>>  * write those values back again
>>  * restart the VM
>>
>> if we need to. Is that what's happening here, or are we doing
>> something odder?
>>
>> -- PMM
>>
> 
> What I guess could be happening by looking at the code in linux
> 
> virt/kvm/arm/arch_timer.c::kvm_arm_timer_set_reg
> 
> is that QEMU tries to set the KVM_REG_ARM_TIMER_CNT register from exactly the 
> previous value,
> but just because of the fact that the set function is called, cntvoff is 
> updated,
> since the value provided by the user is apparently assumed to be _relative_ 
> to the physical timer.
> 
> This is apparent to me in the code in that function which says:
> 
> case KVM_REG_ARM_TIMER_CNT: {
> /* ... */
> u64 cntvoff = kvm_phys_timer_read() - value;
> /* ... */
> }
> 
> And this is matched by the corresponding get function kvm_arm_timer_get_reg 
> where it says:
> 
> case KVM_REG_ARM_TIMER_CNT:
>return kvm_phys_timer_read() - vcpu->kvm->arch.timer.cntvoff;
> 
> The time difference between when the GET is issued by QEMU and when the PUT 
> is issued then would account for the difference.

QEMU has the concept of write-back levels: KVM_PUT_RUNTIME_STATE,
KVM_PUT_RESET_STATE and KVM_PUT_FULL_STATE. I suspect this registers is
just sorted into the wrong category, thus written as part of the
RUNTIME_STATE. We had such bug patterns during the x86 maturing phase as
well.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v5] i386: Introduce ARAT CPU feature

2015-06-22 Thread Jan Kiszka

On 2015-06-23 04:50, Wanpeng Li wrote:
> 
> 
> On 6/22/15 1:38 AM, Jan Kiszka wrote:
>> On 2015-06-18 22:21, Eduardo Habkost wrote:
>>> On Sun, Jun 07, 2015 at 11:15:08AM +0200, Jan Kiszka wrote:
>>>> From: Jan Kiszka 
>>>>
>>>> ARAT signals that the APIC timer does not stop in power saving states.
>>>> As our APICs are emulated, it's fine to expose this feature to guests,
>>>> at least when asking for KVM host features or with CPU types that
>>>> include the flag. The exact model number that introduced the feature is
>>>> not known, but reports can be found that it's at least available since
>>>> Sandy Bridge.
>>>>
>>>> Signed-off-by: Jan Kiszka 
>>> The code looks good now, but: what are the real consequences of
>>> enabling/disabling the flag? What exactly guests use it for?
>>>
>>> Isn't this going to make guests have additional expectations about the
>>> APIC timer that may be broken when live-migrating or pausing the VM?
>> ARAT only refers to stopping of the timer in certain power states (which
>> we do not even emulate IIRC). In that case, the OS is under risk of
>> sleeping forever, thus need to look for a different wakeup source.
> 
> HPET will always be the default broadcast event device I think.

But it's unused (under Linux) if per-cpu clockevents are unaffected by
CLOCK_EVT_FEAT_C3STOP (x86-only "none-feature"), i.e. have ARAT set. And
other guests may have other strategies to deal with missing ARAT.

Again, the scenario for me was not a regular setup but some Jailhouse
boot of Linux where neither a HPET nor a PIT are available as broadcast
sources and Linux therefore refuses to switch to hires mode - in
contrast to running on real hardware.

Jan



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v5] i386: Introduce ARAT CPU feature

2015-06-21 Thread Jan Kiszka

On 2015-06-18 22:21, Eduardo Habkost wrote:
> On Sun, Jun 07, 2015 at 11:15:08AM +0200, Jan Kiszka wrote:
>> From: Jan Kiszka 
>>
>> ARAT signals that the APIC timer does not stop in power saving states.
>> As our APICs are emulated, it's fine to expose this feature to guests,
>> at least when asking for KVM host features or with CPU types that
>> include the flag. The exact model number that introduced the feature is
>> not known, but reports can be found that it's at least available since
>> Sandy Bridge.
>>
>> Signed-off-by: Jan Kiszka 
> 
> The code looks good now, but: what are the real consequences of
> enabling/disabling the flag? What exactly guests use it for?
> 
> Isn't this going to make guests have additional expectations about the
> APIC timer that may be broken when live-migrating or pausing the VM?

ARAT only refers to stopping of the timer in certain power states (which
we do not even emulate IIRC). In that case, the OS is under risk of
sleeping forever, thus need to look for a different wakeup source.
Live-migration or VM pausing are external effects on all timers of the
guest, not only the APIC. However, none of them cause a wakeup miss -
provided the host decides to resume the guest eventually.

Jan

signature.asc
Description: OpenPGP digital signature

vfio-pci + no-kvm-irqchip = oops

2015-06-11 Thread Jan Kiszka

Hi Alex,

just tried vfio-pci with user-space irqchip (qemu-system-x86_64 -device
vfio-pci,host=... -enable-kvm -no-kvm-irqchip). This ends up in the
following oops:

[   61.908453] BUG: unable to handle kernel NULL pointer dereference at 
0128
[   61.908462] IP: [] kvm_irq_map_gsi+0x7c/0xd7 [kvm]
[   61.908488] PGD 0 
[   61.908491] Oops:  [#1] PREEMPT SMP 
[   61.908496] Modules linked in: vfio_iommu_type1 vfio_pci vfio vfio_virqfd 
xt_tcpudp xt_pkttype xt_limit fuse af_packet snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 
ip6table_raw ipt_REJECT nf_reject_ipv4 iptable_raw iptable_filter 
ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast 
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack 
ip6table_filter ip6_tables x_tables ipv6 dm_mod snd_hda_codec_generic vhost_net 
vhost tun kvm_intel snd_hda_intel kvm snd_hda_controller snd_hda_codec i2c_i801 
lpc_ich sg snd_hda_core snd_pcm mfd_core snd_timer snd evdev psmouse soundcore 
pcspkr serio_raw e1000 intel_agp button intel_gtt virtio_scsi fan thermal_sys 
ata_generic ahci libahci
[   61.908563] CPU: 2 PID: 5322 Comm: qemu-system-x86 Not tainted 
4.1.0-rc6-dbg+ #95
[   61.908568] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
[   61.908574] task: 880031fe6a10 ti: 88001746 task.ti: 
88001746
[   61.908578] RIP: 0010:[]  [] 
kvm_irq_map_gsi+0x7c/0xd7 [kvm]
[   61.908589] RSP: 0018:880017463c58  EFLAGS: 00010046
[   61.908592] RAX:  RBX: 880031f94000 RCX: 0081c000
[   61.908596] RDX: 0001 RSI: 880031f94388 RDI: 0046
[   61.908600] RBP: 880017463c78 R08: 821d0f38 R09: 
[   61.908603] R10: 880031f94c98 R11: 0246 R12: 880017463c98
[   61.908607] R13:  R14:  R15: 88001a95de00
[   61.908613] FS:  7f05e2c3aae0() GS:88003fd0() 
knlGS:
[   61.908618] CS:  0010 DS:  ES:  CR0: 80050033
[   61.908634] CR2: 0128 CR3: 1a8ce000 CR4: 001427a0
[   61.908641] DR0: 8278f3d8 DR1:  DR2: 
[   61.908646] DR3:  DR6: 0ff0 DR7: 0600
[   61.908651] Stack:
[   61.908654]  88001a95de00 880031f94238 880031f94388 
880031f94c60
[   61.908662]  880017463d78 a0145a74 880017463d08 
81089fcc
[   61.908669]  0001 0006d950 00020001 
82159f50
[   61.908676] Call Trace:
[   61.908696]  [] irqfd_update+0x2a/0xaf [kvm]
[   61.908727]  [] ? __lock_acquire+0xa1f/0x12d6
[   61.908739]  [] ? kvm_irqfd+0x486/0x5d7 [kvm]
[   61.908750]  [] kvm_irqfd+0x4cd/0x5d7 [kvm]
[   61.908761]  [] ? kvm_irqfd+0x486/0x5d7 [kvm]
[   61.908772]  [] kvm_vm_ioctl+0x35d/0x662 [kvm]
[   61.908783]  [] ? debug_smp_processor_id+0x17/0x19
[   61.908793]  [] do_vfs_ioctl+0x3bb/0x47a
[   61.908798]  [] ? __fget+0x5/0x186
[   61.908803]  [] ? __fget_light+0x65/0x75
[   61.908808]  [] ? __fd_install+0x9a/0xa6
[   61.908814]  [] SyS_ioctl+0x53/0x81
[   61.908825]  [] system_call_fastpath+0x12/0x76
[   61.908830] Code: 00 e8 73 ff f3 e0 85 c0 75 1f 48 c7 c2 ff 3d 18 a0 be 35 
00 00 00 48 c7 c7 28 3e 18 a0 c6 05 91 a1 04 00 01 e8 a6 0b f4 e0 31 c0 <45> 3b 
b5 28 01 00 00 73 49 4b 8b 94 f5 30 01 00 00 48 85 d2 74 
[   61.908875] RIP  [] kvm_irq_map_gsi+0x7c/0xd7 [kvm]
[   61.908887]  RSP 
[   61.908890] CR2: 0128

This test was in QEMU, ie. nested, but the oops is reproducible on real
hw as well. And on older kernels, e.g. 3.18.

Known issue? Some idea what goes wrong?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5] i386: Introduce ARAT CPU feature

2015-06-07 Thread Jan Kiszka

From: Jan Kiszka 

ARAT signals that the APIC timer does not stop in power saving states.
As our APICs are emulated, it's fine to expose this feature to guests,
at least when asking for KVM host features or with CPU types that
include the flag. The exact model number that introduced the feature is
not known, but reports can be found that it's at least available since
Sandy Bridge.

Signed-off-by: Jan Kiszka 
---

Changes in v5:
 - rebased over master

 include/hw/i386/pc.h |  7 ++-
 target-i386/cpu.c| 33 -
 target-i386/cpu.h|  3 +++
 target-i386/kvm.c|  2 ++
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index bec6de1..3b0b30f 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -294,7 +294,12 @@ int e820_get_num_entries(void);
 bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
 #define PC_COMPAT_2_3 \
-HW_COMPAT_2_3
+HW_COMPAT_2_3 \
+{\
+.driver   = TYPE_X86_CPU,\
+.property = "arat",\
+.value= "off",\
+},
 
 #define PC_COMPAT_2_2 \
 PC_COMPAT_2_3 \
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 99ad551..b5b9fc2 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -284,6 +284,17 @@ static const char *cpuid_xsave_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_6_feature_name[] = {
+NULL, NULL, "arat", NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -339,6 +350,7 @@ static const char *cpuid_xsave_feature_name[] = {
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_APM_FEATURES 0
+#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
 
 typedef struct FeatureWordInfo {
@@ -408,6 +420,11 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EAX,
 .tcg_features = 0,
 },
+[FEAT_6_EAX] = {
+.feat_names = cpuid_6_feature_name,
+.cpuid_eax = 6, .cpuid_reg = R_EAX,
+.tcg_features = TCG_6_EAX_FEATURES,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1001,6 +1018,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Westmere E56xx/L56xx/X56xx (Nehalem-C)",
 },
@@ -1030,6 +1049,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1062,6 +1083,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E3-12xx v2 (Ivy Bridge)",
 },
@@ -1096,6 +1119,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell, no TSX)",
 },{
@@ -1130,6 +1155,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_RTM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell)",
 },
@@ -1166,6 +1193,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_SMAP,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Broadwell, no TSX)",
 },
@@ -1202,6 +1231,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_SMAP,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,

Re: [PATCH] KVM: x86: Allow ARAT CPU feature

2015-05-25 Thread Jan Kiszka

On 2015-05-26 03:37, Yong Wang wrote:
> On Mon, May 25, 2015 at 03:24:05PM +0200, Paolo Bonzini wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>>
>>
>> On 24/05/2015 17:22, Jan Kiszka wrote:
>>> From: Jan Kiszka 
>>>
>>> There is no reason to deny this feature to guests. We are
>>> emulating the APIC timer, thus are exposing it without stops in
>>> power-saving states.
>>>
>>> Signed-off-by: Jan Kiszka 
>>
>> Thanks, looks good.
>>
> 
> What's the motivation of exposing ARAT to guests?

First of all, another step towards feature correctness for real CPU
models. But I also have a setup where Linux only has APICs as
clockevents (Jailhouse non-root cells), thus has no broadcast source. In
that case it depends on ARAT to switch to highres mode.

Jan

signature.asc
Description: OpenPGP digital signature

[PATCH v3] i386: Introduce ARAT CPU feature

2015-05-25 Thread Jan Kiszka

From: Jan Kiszka 

ARAT signals that the APIC timer does not stop in power saving states.
As our APICs are emulated, it's fine to expose this feature to guests,
at least when asking for KVM host features or with CPU types that
include the flag. The exact model number that introduced the feature is
not known, but reports can be found that it's at least available since
Sandy Bridge.

Signed-off-by: Jan Kiszka 
---

Changes in v4:
 - followed suggestions by Eduardo, now using PC_COMPAT_2_3 define

 hw/i386/pc_piix.c|  4 
 hw/i386/pc_q35.c |  4 
 include/hw/i386/pc.h |  8 
 target-i386/cpu.c| 33 -
 target-i386/cpu.h|  3 +++
 target-i386/kvm.c|  2 ++
 6 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 212e263..b675d2c 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -543,6 +543,10 @@ static QEMUMachine pc_i440fx_machine_v2_3 = {
 PC_I440FX_2_3_MACHINE_OPTIONS,
 .name = "pc-i440fx-2.3",
 .init = pc_init_pci_2_3,
+.compat_props = (GlobalProperty[]) {
+PC_COMPAT_2_3,
+{ /* end of list */ }
+},
 };
 
 #define PC_I440FX_2_2_MACHINE_OPTIONS PC_I440FX_2_3_MACHINE_OPTIONS
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index e67f2de..38c3cf2 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -439,6 +439,10 @@ static QEMUMachine pc_q35_machine_v2_3 = {
 PC_Q35_2_3_MACHINE_OPTIONS,
 .name = "pc-q35-2.3",
 .init = pc_q35_init_2_3,
+.compat_props = (GlobalProperty[]) {
+PC_COMPAT_2_3,
+{ /* end of list */ }
+},
 };
 
 #define PC_Q35_2_2_MACHINE_OPTIONS PC_Q35_2_3_MACHINE_OPTIONS
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 1b35168..365af62 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -295,7 +295,15 @@ int e820_add_entry(uint64_t, uint64_t, uint32_t);
 int e820_get_num_entries(void);
 bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 
+#define PC_COMPAT_2_3 \
+{\
+.driver   = TYPE_X86_CPU,\
+.property = "arat",\
+.value= "off",\
+}
+
 #define PC_COMPAT_2_0 \
+PC_COMPAT_2_3, \
 HW_COMPAT_2_1, \
 {\
 .driver   = "virtio-scsi-pci",\
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index e38943e..c273d24 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -284,6 +284,17 @@ static const char *cpuid_xsave_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_6_feature_name[] = {
+NULL, NULL, "arat", NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -339,6 +350,7 @@ static const char *cpuid_xsave_feature_name[] = {
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_APM_FEATURES 0
+#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
 
 typedef struct FeatureWordInfo {
@@ -408,6 +420,11 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EAX,
 .tcg_features = 0,
 },
+[FEAT_6_EAX] = {
+.feat_names = cpuid_6_feature_name,
+.cpuid_eax = 6, .cpuid_reg = R_EAX,
+.tcg_features = TCG_6_EAX_FEATURES,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1001,6 +1018,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Westmere E56xx/L56xx/X56xx (Nehalem-C)",
 },
@@ -1030,6 +1049,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1062,6 +1083,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E3-12xx v2 (Ivy Bridge)",
 },
@@ -1096,6 +1119,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID,
 .features[FEAT_XSAVE] =
 CP

[PATCH v3] i386: Introduce ARAT CPU feature

2015-05-25 Thread Jan Kiszka

From: Jan Kiszka 

ARAT signals that the APIC timer does not stop in power saving states.
As our APICs are emulated, it's fine to expose this feature to guests,
at least when asking for KVM host features or with CPU types that
include the flag. The exact model number that introduced the feature is
not known, but reports can be found that it's at least available since
Sandy Bridge.

Signed-off-by: Jan Kiszka 
---

Changes in v3 (too quick...):
 - fix typo in cpu model name
 - also cover q35

 hw/i386/pc_piix.c | 10 ++
 hw/i386/pc_q35.c  | 10 ++
 target-i386/cpu.c | 33 -
 target-i386/cpu.h |  3 +++
 target-i386/kvm.c |  2 ++
 5 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 212e263..8a29af1 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -312,6 +312,16 @@ static void pc_init_pci(MachineState *machine)
 
 static void pc_compat_2_3(MachineState *machine)
 {
+x86_cpu_compat_set_features("Westmere", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("SandyBridge", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("IvyBridge", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell-noTSX", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell-noTSX", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
 }
 
 static void pc_compat_2_2(MachineState *machine)
diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index e67f2de..ca736d4 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -291,6 +291,16 @@ static void pc_q35_init(MachineState *machine)
 
 static void pc_compat_2_3(MachineState *machine)
 {
+x86_cpu_compat_set_features("Westmere", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("SandyBridge", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("IvyBridge", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell-noTSX", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell-noTSX", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
 }
 
 static void pc_compat_2_2(MachineState *machine)
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 3305e09..e435a08 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -284,6 +284,17 @@ static const char *cpuid_xsave_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_6_feature_name[] = {
+NULL, NULL, "arat", NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -339,6 +350,7 @@ static const char *cpuid_xsave_feature_name[] = {
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_APM_FEATURES 0
+#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
 
 typedef struct FeatureWordInfo {
@@ -408,6 +420,11 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EAX,
 .tcg_features = 0,
 },
+[FEAT_6_EAX] = {
+.feat_names = cpuid_6_feature_name,
+.cpuid_eax = 6, .cpuid_reg = R_EAX,
+.tcg_features = TCG_6_EAX_FEATURES,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1001,6 +1018,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Westmere E56xx/L56xx/X56xx (Nehalem-C)",
 },
@@ -1030,6 +1049,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1062,6 +1083,8 @@ static X86CPUDefinition builtin_x86_defs[] = {

[PATCH v2] i386: Introduce ARAT CPU feature

2015-05-25 Thread Jan Kiszka

From: Jan Kiszka 

ARAT signals that the APIC timer does not stop in power saving states.
As our APICs are emulated, it's fine to expose this feature to guests,
at least when asking for KVM host features or with CPU types that
include the flag. The exact model number that introduced the feature is
not known, but reports can be found that it's at least available since
Sandy Bridge.

Signed-off-by: Jan Kiszka 
---

Changes in v2:
 - remove feature from Intel CPU types in compat machines

 hw/i386/pc_piix.c | 10 ++
 target-i386/cpu.c | 33 -
 target-i386/cpu.h |  3 +++
 target-i386/kvm.c |  2 ++
 4 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c
index 212e263..83133fa 100644
--- a/hw/i386/pc_piix.c
+++ b/hw/i386/pc_piix.c
@@ -312,6 +312,16 @@ static void pc_init_pci(MachineState *machine)
 
 static void pc_compat_2_3(MachineState *machine)
 {
+x86_cpu_compat_set_features("Westmere", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("SandyBridge", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("IvyBridge", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell-noTSX", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Haswell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell-noTSC", FEAT_6_EAX, 0,
+CPUID_6_EAX_ARAT);
+x86_cpu_compat_set_features("Broadwell", FEAT_6_EAX, 0, CPUID_6_EAX_ARAT);
 }
 
 static void pc_compat_2_2(MachineState *machine)
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 3305e09..e435a08 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -284,6 +284,17 @@ static const char *cpuid_xsave_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_6_feature_name[] = {
+NULL, NULL, "arat", NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -339,6 +350,7 @@ static const char *cpuid_xsave_feature_name[] = {
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_APM_FEATURES 0
+#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
 
 typedef struct FeatureWordInfo {
@@ -408,6 +420,11 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EAX,
 .tcg_features = 0,
 },
+[FEAT_6_EAX] = {
+.feat_names = cpuid_6_feature_name,
+.cpuid_eax = 6, .cpuid_reg = R_EAX,
+.tcg_features = TCG_6_EAX_FEATURES,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1001,6 +1018,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Westmere E56xx/L56xx/X56xx (Nehalem-C)",
 },
@@ -1030,6 +1049,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1062,6 +1083,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E3-12xx v2 (Ivy Bridge)",
 },
@@ -1096,6 +1119,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell, no TSX)",
 },{
@@ -1130,6 +1155,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_RTM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell)",
 },
@@ -1166,6 +1193,8 @@ static X86CPUDefinition builtin_x86_d

Re: [PATCH v2] KVM: SVM: Sync g_pat with guest-written PAT value

2015-05-24 Thread Jan Kiszka

On 2015-04-21 14:21, Radim Krčmář wrote:
> 2015-04-21 13:09+0200, Paolo Bonzini:
>>
>>
>> On 20/04/2015 19:25, Jan Kiszka wrote:
>>> When hardware supports the g_pat VMCB field, we can use it for emulating
>>> the PAT configuration that the guest configures by writing to the
>>> corresponding MSR.
>>>
>>> Signed-off-by: Jan Kiszka 
>>
>> I'm not sure about this.  The problem is that, unlike Intel, AMD has no
>> way for the host to force its PAT value and ignore the guest's.  I'm
>> worried about potential performance problems in the guest.
> 
> We already set g_pat to 0x0007040600070406ULL in init_vmcb().
> This patch uses caching that the guest expects, which might improve
> performance as well.  I think it's a step in right direction even if we
> somehow optimize cache coherent cases later.

This topic is still open - and the patch still applies.

Jan



signature.asc
Description: OpenPGP digital signature

[PATCH] KVM: x86: Allow ARAT CPU feature

2015-05-24 Thread Jan Kiszka

From: Jan Kiszka 

There is no reason to deny this feature to guests. We are emulating the
APIC timer, thus are exposing it without stops in power-saving states.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/cpuid.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 59b69f6..6d84a9e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -411,6 +411,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
}
break;
}
+   case 6: /* Thermal management */
+   entry->eax = 0x4; /* allow ARAT */
+   entry->ebx = 0;
+   entry->ecx = 0;
+   entry->edx = 0;
+   break;
case 7: {
entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
/* Mask ebx against host capability word 9 */
@@ -587,7 +593,6 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
break;
case 3: /* Processor serial number */
case 5: /* MONITOR/MWAIT */
-   case 6: /* Thermal management */
case 0xC002:
case 0xC003:
case 0xC004:



signature.asc
Description: OpenPGP digital signature

[PATCH] i386: Introduce ARAT CPU feature

2015-05-24 Thread Jan Kiszka

From: Jan Kiszka 

ARAT signals that the APIC timer does not stop in power saving states.
As our APICs are emulated, it's fine to expose this feature to guests,
at least when asking for KVM host features or with CPU types that
include the flag. The exact model number that introduced the feature is
not known, but reports can be found that it's at least available since
Sandy Bridge.

Signed-off-by: Jan Kiszka 
---
 target-i386/cpu.c | 33 -
 target-i386/cpu.h |  3 +++
 target-i386/kvm.c |  2 ++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 3305e09..e435a08 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -284,6 +284,17 @@ static const char *cpuid_xsave_feature_name[] = {
 NULL, NULL, NULL, NULL,
 };
 
+static const char *cpuid_6_feature_name[] = {
+NULL, NULL, "arat", NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+NULL, NULL, NULL, NULL,
+};
+
 #define I486_FEATURES (CPUID_FP87 | CPUID_VME | CPUID_PSE)
 #define PENTIUM_FEATURES (I486_FEATURES | CPUID_DE | CPUID_TSC | \
   CPUID_MSR | CPUID_MCE | CPUID_CX8 | CPUID_MMX | CPUID_APIC)
@@ -339,6 +350,7 @@ static const char *cpuid_xsave_feature_name[] = {
   CPUID_7_0_EBX_ERMS, CPUID_7_0_EBX_INVPCID, CPUID_7_0_EBX_RTM,
   CPUID_7_0_EBX_RDSEED */
 #define TCG_APM_FEATURES 0
+#define TCG_6_EAX_FEATURES CPUID_6_EAX_ARAT
 
 
 typedef struct FeatureWordInfo {
@@ -408,6 +420,11 @@ static FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
 .cpuid_reg = R_EAX,
 .tcg_features = 0,
 },
+[FEAT_6_EAX] = {
+.feat_names = cpuid_6_feature_name,
+.cpuid_eax = 6, .cpuid_reg = R_EAX,
+.tcg_features = TCG_6_EAX_FEATURES,
+},
 };
 
 typedef struct X86RegisterInfo32 {
@@ -1001,6 +1018,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX,
 .features[FEAT_8000_0001_ECX] =
 CPUID_EXT3_LAHF_LM,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Westmere E56xx/L56xx/X56xx (Nehalem-C)",
 },
@@ -1030,6 +1049,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E312xx (Sandy Bridge)",
 },
@@ -1062,6 +1083,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_EXT3_LAHF_LM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Xeon E3-12xx v2 (Ivy Bridge)",
 },
@@ -1096,6 +1119,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_BMI2 | CPUID_7_0_EBX_ERMS | CPUID_7_0_EBX_INVPCID,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell, no TSX)",
 },{
@@ -1130,6 +1155,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_RTM,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Haswell)",
 },
@@ -1166,6 +1193,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_SMAP,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Broadwell, no TSX)",
 },
@@ -1202,6 +1231,8 @@ static X86CPUDefinition builtin_x86_defs[] = {
 CPUID_7_0_EBX_SMAP,
 .features[FEAT_XSAVE] =
 CPUID_XSAVE_XSAVEOPT,
+.features[FEAT_6_EAX] =
+CPUID_6_EAX_ARAT,
 .xlevel = 0x800A,
 .model_id = "Intel Core Processor (Broadwell)",
 },
@@ -2358,7 +2389,7 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 break;
 case 6:
 /* Thermal and Power Leaf */
-*eax = 0;
+*eax = env->features[FEAT_6_EAX];
 *ebx = 0;
 *ecx = 0;
 *edx = 0;
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 4ee12ca..800158e 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -412,6 +412,7 @@ typedef enum FeatureWord {
 FEAT_KVM,   /* CPUID[4000_0001].EAX (KVM_CPUID_FEATURES) */
 FEAT_S

Re: Announcing qboot, a minimal x86 firmware for QEMU

2015-05-21 Thread Jan Kiszka

On 2015-05-21 15:51, Paolo Bonzini wrote:
> Some of you may have heard about the "Clear Containers" initiative from
> Intel, which couple KVM with various kernel tricks to create extremely
> lightweight virtual machines.  The experimental Clear Containers setup
> requires only 18-20 MB to launch a virtual machine, and needs about 60
> ms to boot.
> 
> Now, as all of you probably know, "QEMU is great for running Windows or
> legacy Linux guests, but that flexibility comes at a hefty price. Not
> only does all of the emulation consume memory, it also requires some
> form of low-level firmware in the guest as well. All of this adds quite
> a bit to virtual-machine startup times (500 to 700 milliseconds is not
> unusual)".
> 
> Right?  In fact, it's for this reason that Clear Containers uses kvmtool
> instead of QEMU.
> 
> No, wrong!  In fact, reporting bad performance is pretty much the same
> as throwing down the gauntlet.
> 
> Enter qboot, a minimal x86 firmware that runs on QEMU and, together with
> a slimmed-down QEMU configuration, boots a virtual machine in 40
> milliseconds[2] on an Ivy Bridge Core i7 processor.
> 
> qboot is available at git://github.com/bonzini/qboot.git.  In all the
> glory of its 8KB of code, it brings together various existing open
> source components:
> 
> * a minimal (really minimal) 16-bit BIOS runtime based on kvmtool's own BIOS
> 
> * a couple hardware initialization routines written mostly from scratch
> but with good help from SeaBIOS source code
> 
> * a minimal 32-bit libc based on kvm-unit-tests
> 
> * the Linux loader from QEMU itself
> 
> The repository has more information on how to achieve fast boot times,
> and examples of using qboot.  Right now there is a limit of 8 MB for
> vmlinuz+initrd+cmdline, which however should be enough for initrd-less
> containers.
> 
> The first commit to qboot is more or less 24 hours old, so there is
> definitely more work to do, in particular to extract ACPI tables from
> QEMU and present them to the guest.  This is probably another day of
> work or so, and it will enable multiprocessor guests with little or no
> impact on the boot times.  SMBIOS information is also available from QEMU.
> 
> On the QEMU side, there is no support yet for persistent memory and the
> NFIT tables from ACPI 6.0.  Once that (and ACPI support) is added, qboot
> will automatically start using it.
> 
> Happy hacking!

Incidentally, I did something similar these days to get Linux booting in
Jailhouse non-root cells, i.e without BIOS and almost no hardware except
memory, cpus and pci devices. Yes, requires a bit pv for Linux, but
really little. Not aiming for speed (yet), just for less hypervisor
work. Maybe there are some milliseconds to save when cutting off more
hardware in an analogous way...

PV pat^Whacks are here:
http://git.kiszka.org/?p=linux.git;a=shortlog;h=refs/heads/queues/jailhouse.
The boot loader is a combination of a python script [1] (result can be
saved and reused - replaces ACPI) and really few lines of code [2][3].

Jan

[1]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/tools/jailhouse-cell-linux
[2]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/inmates/lib/x86/header.S
[3]
https://github.com/siemens/jailhouse/blob/wip/linux-x86-inmate/inmates/tools/x86/linux-loader.c

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 4/4] KVM: x86: Add support for local interrupt requests from userspace

2015-05-15 Thread Jan Kiszka

On 2015-05-14 00:41, Steve Rutherford wrote:
> On Wed, May 13, 2015 at 08:12:59AM +0200, Jan Kiszka wrote:
>> On 2015-05-13 03:47, Steve Rutherford wrote:
>>> In order to enable userspace PIC support, the userspace PIC needs to
>>> be able to inject local interrupt requests.
>>>
>>> This adds the ioctl KVM_REQUEST_LOCAL_INTERRUPT and kvm exit
>>> KVM_EXIT_GET_EXTINT.
>>>
>>> The vm ioctl KVM_REQUEST_LOCAL_INTERRUPT makes a KVM_REQ_EVENT request
>>> on the BSP, which causes the BSP to exit to userspace to fetch the
>>> vector of the underlying external interrupt, which the BSP then
>>> injects into the guest. This matches the PIC spec, and is necessary to
>>> boot Windows.
>>
>> The API name seems too generic, given the fairly specific use case. As
>> it only allows to kick the BSP, not any VCPU, that should be clarified.
>> Maybe call it KVM_REQUEST_PIC_INJECTION or so. Or allow userspace to
>> specify the target VCPU, then it's a bit more generic again.
>>
>> Actually, when looking at the MultiProcessor spec, you will find
>> multiple models for injecting PIC interrupts into CPU APICs. Just in the
>> PIC Mode, the BSP is the only possible target. In the other modes, all
>> APICs can receive PIC interrupts, and either the IOAPIC or the local
>> APIC's LINT0 configuration decide about the effective target. We should
>> allow to model all modes, based on userspace decisions.
>>
> 
> Supporting the other modes seems reasonable, but I'm not certain I have an 
> outlet for testing them for correctness. I'm not even certain which OSes use 
> the other modes. Unless there is a common OS that uses the other modes, and a 
> straightforward way for me to test the other modes, I'd rather just rename 
> the API to be less generic.

The OS has to configure what the hardware provides, I bet Windows does
so as well. This is a hardware property, thus something userspace (QEMU
& Co.) may want to configure.

> 
>>>
>>> Boots and passes the KVM unit tests on intel x86 with the
>>> PIC/PIT/IOAPIC in userspace (under a non-QEMU VMM). Boots and passes
>>> the KVM unit tests under normal conditions as well.
>>
>> Do you plan to provide a reference implementation for an open source
>> userspace VMM as well, once the kernel side is settled?
>>
> 
> Not at the moment (given that I'm not all that familiar with QEMU). I'm 
> definitely willing to help guide someone else through the process.

It would be fairly valuable to have this tested against a known, public
machine emulator so that we can validate all needs before setting the
new kernel ABI in stone.

I do have an interest in this API as well, sitting on IRQ remapping
hacks and their lacking x2APIC support for too long, but I unfortunately
can't promise bandwidth either. :/

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/4] KVM: x86: Add EOI exit bitmap inference

2015-05-13 Thread Jan Kiszka

On 2015-05-13 15:04, Paolo Bonzini wrote:
> 
> 
> On 13/05/2015 12:25, Jan Kiszka wrote:
>>>>> But perhaps when enabling KVM_SPLIT_IRQCHIP we can use args[0] to pass
>>>>> the number of IOAPIC routes that will cause EOI exits?
>>>>
>>>> And you need to ensure that their routes can be found in the table
>>>> directly. Given IOAPIC hotplug, that may not be the first ones there...
>>>
>>> Can you reserve a bunch of GSIs at the beginning of the GSI space, and
>>> use rt->map[] to access them and build the EOI exit bitmap?
>>
>> Ideally, userspace could give the kernel a hint where to look. It has a
>> copy of the routing table and manages it, no?
> 
> Well, reserving space at the beginning of the table (and communicating
> the size via KVM_ENABLE_CAP is a way to "give the kernel a hint where to
> look".  Perhaps not the best, but simple and supports hotplug to some
> extent.

But then we need at least a way to differentiate between IOAPIC and MSI
routes so that the loop can actually stop when hitting the first
non-IOAPIC entry. Right now, this is not possible. But even that would
be a kind of ugly interface.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/4] KVM: x86: Add EOI exit bitmap inference

2015-05-13 Thread Jan Kiszka

On 2015-05-13 11:24, Paolo Bonzini wrote:
> 
> 
> On 13/05/2015 10:10, Jan Kiszka wrote:
>>>>>> There can even be multiple IOAPICs (thanks to your patches overcoming
>>>>>> the single in-kernel instance).
>>>>
>>>> With multiple IOAPICs you have more than 24 GSIs per IOAPIC.  That means
>> I don't think that the number of pins per IOAPIC increases. At least not
>> in the devices I've seen so far.
> 
> Sorry, that was supposed to be "more than 24 GSIs for the IOAPICs".
> 
>>>> that the above loop is broken for multiple IOAPICs.
>> The worst case remains #IOAPIC * 24 iterations - if we have means to
>> stop after the IOAPIC entries, not iterating over all routes.
> 
> Yes.  Which is not too bad if VCPUs can process it in parallel.
> 
>>>> But perhaps when enabling KVM_SPLIT_IRQCHIP we can use args[0] to pass
>>>> the number of IOAPIC routes that will cause EOI exits?
>> And you need to ensure that their routes can be found in the table
>> directly. Given IOAPIC hotplug, that may not be the first ones there...
> 
> Can you reserve a bunch of GSIs at the beginning of the GSI space, and
> use rt->map[] to access them and build the EOI exit bitmap?

Ideally, userspace could give the kernel a hint where to look. It has a
copy of the routing table and manages it, no?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 3/4] KVM: x86: Add EOI exit bitmap inference

2015-05-13 Thread Jan Kiszka

On 2015-05-13 10:04, Paolo Bonzini wrote:
> 
> 
> On 13/05/2015 08:12, Jan Kiszka wrote:
>>> +void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
>>> +{
>>> +   struct kvm *kvm = vcpu->kvm;
>>> +   struct kvm_kernel_irq_routing_entry *entry;
>>> +   struct kvm_irq_routing_table *table;
>>> +   u32 i, nr_rt_entries;
>>> +
>>> +   mutex_lock(&kvm->irq_lock);
> 
> This only needs irq_srcu protection, not irq_lock, so the lookup cost
> becomes much smaller (all CPUs can proceed in parallel).
> 
> You would need to put an smp_mb here, to ensure that irq_routing is read
> after KVM_SCAN_IOAPIC is cleared.  You can introduce
> smb_mb__after_srcu_read_lock in order to elide it.
> 
> The matching memory barrier would be a smp_mb__before_atomic in
> kvm_make_scan_ioapic_request.
> 
>>> +   table = kvm->irq_routing;
>>> +   nr_rt_entries = min_t(u32, table->nr_rt_entries, IOAPIC_NUM_PINS);
>>> +   for (i = 0; i < nr_rt_entries; ++i) {
>>> +   hlist_for_each_entry(entry, &table->map[i], link) {
>>> +   u32 dest_id, dest_mode;
>>> +
>>> +   if (entry->type != KVM_IRQ_ROUTING_MSI)
>>> +   continue;
>>> +   dest_id = (entry->msi.address_lo >> 12) & 0xff;
>>> +   dest_mode = (entry->msi.address_lo >> 2) & 0x1;
>>> +   if (kvm_apic_match_dest(vcpu, NULL, 0, dest_id,
>>> +   dest_mode)) {
>>> +   u32 vector = entry->msi.data & 0xff;
>>> +
>>> +   __set_bit(vector,
>>> + (unsigned long *) eoi_exit_bitmap);
>>> +   }
>>> +   }
>>> +   }
>>> +   mutex_unlock(&kvm->irq_lock);
>>> +}
>>>
>>
>> This looks a bit frightening regarding the lookup costs. Do we really
>> have to run through the complete routing table to find the needed
>> information? There can be way more "real" MSI entries than IOAPIC pins.
> 
> It does at most IOAPIC_NUM_PINS iterations however.
> 
>> There can even be multiple IOAPICs (thanks to your patches overcoming
>> the single in-kernel instance).
> 
> With multiple IOAPICs you have more than 24 GSIs per IOAPIC.  That means

I don't think that the number of pins per IOAPIC increases. At least not
in the devices I've seen so far.

> that the above loop is broken for multiple IOAPICs.

The worst case remains #IOAPIC * 24 iterations - if we have means to
stop after the IOAPIC entries, not iterating over all routes.

> 
> But perhaps when enabling KVM_SPLIT_IRQCHIP we can use args[0] to pass
> the number of IOAPIC routes that will cause EOI exits?

And you need to ensure that their routes can be found in the table
directly. Given IOAPIC hotplug, that may not be the first ones there...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 4/4] KVM: x86: Add support for local interrupt requests from userspace

2015-05-12 Thread Jan Kiszka

On 2015-05-13 03:47, Steve Rutherford wrote:
> In order to enable userspace PIC support, the userspace PIC needs to
> be able to inject local interrupt requests.
> 
> This adds the ioctl KVM_REQUEST_LOCAL_INTERRUPT and kvm exit
> KVM_EXIT_GET_EXTINT.
> 
> The vm ioctl KVM_REQUEST_LOCAL_INTERRUPT makes a KVM_REQ_EVENT request
> on the BSP, which causes the BSP to exit to userspace to fetch the
> vector of the underlying external interrupt, which the BSP then
> injects into the guest. This matches the PIC spec, and is necessary to
> boot Windows.

The API name seems too generic, given the fairly specific use case. As
it only allows to kick the BSP, not any VCPU, that should be clarified.
Maybe call it KVM_REQUEST_PIC_INJECTION or so. Or allow userspace to
specify the target VCPU, then it's a bit more generic again.

Actually, when looking at the MultiProcessor spec, you will find
multiple models for injecting PIC interrupts into CPU APICs. Just in the
PIC Mode, the BSP is the only possible target. In the other modes, all
APICs can receive PIC interrupts, and either the IOAPIC or the local
APIC's LINT0 configuration decide about the effective target. We should
allow to model all modes, based on userspace decisions.

> 
> Boots and passes the KVM unit tests on intel x86 with the
> PIC/PIT/IOAPIC in userspace (under a non-QEMU VMM). Boots and passes
> the KVM unit tests under normal conditions as well.

Do you plan to provide a reference implementation for an open source
userspace VMM as well, once the kernel side is settled?

> 
> SVM support and device assignment are untested with this feature
> enabled, but testing for both is in the works.
> 
> Compiles for ARM/x86/PPC.
> 
> Signed-off-by: Steve Rutherford 
> ---
>  Documentation/virtual/kvm/api.txt |  9 +++
>  arch/x86/include/asm/kvm_host.h   |  1 +
>  arch/x86/kvm/irq.c|  6 -
>  arch/x86/kvm/lapic.c  |  7 ++
>  arch/x86/kvm/lapic.h  |  2 ++
>  arch/x86/kvm/x86.c| 53 
> ---
>  include/uapi/linux/kvm.h  |  6 +
>  7 files changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index dd92996..a650321 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2993,6 +2993,15 @@ the ioapic nor the pic in the kernel. Also, enables in 
> kernel routing of
>  interrupt requests. Fails if VCPU has already been created, or if the 
> irqchip is
>  already in the kernel.
>  
> +4.97 KVM_REQUEST_LOCAL_INTERRUPT
> +
> +Capability: KVM_CAP_SPLIT_IRQCHIP
> +Type: VM ioctl
> +Parameters: none
> +Returns: 0 on success, -1 on error.
> +
> +Informs the kernel that userspace has a pending external interrupt.
> +
>  
>  5. The kvm_run structure
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b1978f1..602ea70 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -542,6 +542,7 @@ struct kvm_vcpu_arch {
>  
>   u64 eoi_exit_bitmaps[4];
>   int pending_ioapic_eoi;
> + bool pending_external;
>  };
>  
>  struct kvm_lpage_info {
> diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
> index 706e47a..487b5f5 100644
> --- a/arch/x86/kvm/irq.c
> +++ b/arch/x86/kvm/irq.c
> @@ -43,7 +43,11 @@ EXPORT_SYMBOL(kvm_cpu_has_pending_timer);
>   */
>  static int kvm_cpu_has_extint(struct kvm_vcpu *v)
>  {
> - if (kvm_apic_accept_pic_intr(v))
> + u8 accept = kvm_apic_accept_pic_intr(v);
> +
> + if (accept && irqchip_split(v->kvm))
> + return v->arch.pending_external;
> + else if (accept)
>   return pic_irqchip(v->kvm)->output; /* PIC */
>   else
>   return 0;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 7533b87..9a021f7 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2089,3 +2089,10 @@ void kvm_lapic_init(void)
>   jump_label_rate_limit(&apic_hw_disabled, HZ);
>   jump_label_rate_limit(&apic_sw_disabled, HZ);
>  }
> +
> +void kvm_request_local_interrupt(struct kvm_vcpu *vcpu)
> +{
> + vcpu->arch.pending_external = true;
> + kvm_make_request(KVM_REQ_EVENT, vcpu);
> + kvm_vcpu_kick(vcpu);
> +}
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 71b150c..66bb780 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -63,6 +63,8 @@ int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct 
> kvm_lapic_irq *irq,
>   unsigned long *dest_map);
>  int kvm_apic_local_deliver(struct kvm_lapic *apic, int lvt_type);
>  
> +void kvm_request_local_interrupt(struct kvm_vcpu *vcpu);
> +
>  bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
>   struct kvm_lapic_irq *irq, int *r, unsigned long *dest_map);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6127f

Re: [RFC PATCH 3/4] KVM: x86: Add EOI exit bitmap inference

2015-05-12 Thread Jan Kiszka

On 2015-05-13 03:47, Steve Rutherford wrote:
> In order to support a userspace IOAPIC interacting with an in kernel
> APIC, the EOI exit bitmaps need to be configurable.
> 
> If the IOAPIC is in userspace (i.e. the irqchip has been split), the
> EOI exit bitmaps will be set whenever the GSI Routes are configured.
> In particular, for the low 24 MSI routes, the EOI Exit bit
> corresponding to the destination vector will be set for the
> destination VCPU.
> 
> The intention is for the userspace IOAPIC to use MSI routes [0,23] to
> inject interrupts into the guest.
> 
> This is a slight abuse of the notion of an MSI Route, given that MSIs
> classically bypass the IOAPIC. It might be worthwhile to add an
> additional route type to improve clarity.
> 
> Compile tested for Intel x86.
> 
> Signed-off-by: Steve Rutherford 
> ---
>  arch/x86/kvm/ioapic.c| 11 +++
>  arch/x86/kvm/ioapic.h|  1 +
>  arch/x86/kvm/lapic.c |  2 ++
>  arch/x86/kvm/x86.c   | 13 +++--
>  include/linux/kvm_host.h |  4 
>  virt/kvm/irqchip.c   | 32 
>  6 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index 856f791..3323c86 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -672,3 +672,14 @@ int kvm_set_ioapic(struct kvm *kvm, struct 
> kvm_ioapic_state *state)
>   spin_unlock(&ioapic->lock);
>   return 0;
>  }
> +
> +void kvm_vcpu_request_scan_userspace_ioapic(struct kvm *kvm)
> +{
> + struct kvm_ioapic *ioapic = kvm->arch.vioapic;
> +
> + if (ioapic)
> + return;
> + if (!lapic_in_kernel(kvm))
> + return;
> + kvm_make_scan_ioapic_request(kvm);
> +}
> diff --git a/arch/x86/kvm/ioapic.h b/arch/x86/kvm/ioapic.h
> index ca0b0b4..b7af71b 100644
> --- a/arch/x86/kvm/ioapic.h
> +++ b/arch/x86/kvm/ioapic.h
> @@ -123,4 +123,5 @@ int kvm_set_ioapic(struct kvm *kvm, struct 
> kvm_ioapic_state *state);
>  void kvm_ioapic_scan_entry(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap,
>   u32 *tmr);
>  
> +void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>  #endif
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 42fada6f..7533b87 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -211,6 +211,8 @@ out:
>  
>   if (!irqchip_split(kvm))
>   kvm_vcpu_request_scan_ioapic(kvm);
> + else
> + kvm_vcpu_request_scan_userspace_ioapic(kvm);
>  }
>  
>  static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cc27c35..6127fe7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6335,8 +6335,17 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   goto out;
>   }
>   }
> - if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
> - vcpu_scan_ioapic(vcpu);
> + if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu)) {
> + if (irqchip_split(vcpu->kvm)) {
> + memset(vcpu->arch.eoi_exit_bitmaps, 0, 32);
> + kvm_scan_ioapic_routes(
> + vcpu, vcpu->arch.eoi_exit_bitmaps);
> + kvm_x86_ops->load_eoi_exitmap(
> + vcpu, vcpu->arch.eoi_exit_bitmaps);
> +
> + } else
> + vcpu_scan_ioapic(vcpu);
> + }
>   if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
>   kvm_vcpu_reload_apic_access_page(vcpu);
>   }
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index cef20ad..678215a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -438,10 +438,14 @@ void vcpu_put(struct kvm_vcpu *vcpu);
>  
>  #ifdef __KVM_HAVE_IOAPIC
>  void kvm_vcpu_request_scan_ioapic(struct kvm *kvm);
> +void kvm_vcpu_request_scan_userspace_ioapic(struct kvm *kvm);
>  #else
>  static inline void kvm_vcpu_request_scan_ioapic(struct kvm *kvm)
>  {
>  }
> +static inline void kvm_vcpu_request_scan_userspace_ioapic(struct kvm *kvm)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_HAVE_KVM_IRQFD
> diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
> index 8aaceed..8a253aa 100644
> --- a/virt/kvm/irqchip.c
> +++ b/virt/kvm/irqchip.c
> @@ -205,6 +205,8 @@ int kvm_set_irq_routing(struct kvm *kvm,
>  
>   synchronize_srcu_expedited(&kvm->irq_srcu);
>  
> + kvm_vcpu_request_scan_userspace_ioapic(kvm);
> +
>   new = old;
>   r = 0;
>  
> @@ -212,3 +214,33 @@ out:
>   kfree(new);
>   return r;
>  }
> +
> +void kvm_scan_ioapic_routes(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> +{
> + struct kvm *kvm = vcpu->kvm;
> + struct kvm_kernel_irq_routing_entry *entry;
> + struct kvm_irq_routing_table *table;
> +

Re: [PATCH] KVM: x86: fix initial PAT value

2015-05-12 Thread Jan Kiszka

On 2015-05-13 07:02, Wanpeng Li wrote:
> Hi Radim,
> On Mon, Apr 27, 2015 at 03:11:25PM +0200, Radim Krčmář wrote:
>> PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
> 
> Could you point out where this value is described in SDM? :)

11.12.4 (revision from September 2014)

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] Jailhouse 0.5 released

2015-05-11 Thread Jan Kiszka

"Release often, release early" -- we did quite well on the latter but
there is room for improvements regarding the former. So let's do it:

After its first release 0.1, we are happy to announce the new version
0.5 of the Linux-based partitioning hypervisor Jailhouse. The project
made noteworthy progress over the past months which shall be underlined
with this version number jump. Some highlights of this release:

 - AMD64 support
 - ARMv7 support, running on several boards:
   - Banana Pi
   - NVIDIA Jetson TK1
   - Versatile Express
 - inter-cell communication foundations via ivshmem devices
 - improved isolation on x86
 - support for larger x86 machines

You can download the release from

https://github.com/siemens/jailhouse/archive/v0.5.tar.gz

then follow the README for first steps on recommended evaluation
platforms. Drop us a note on the mailing list if you run into trouble.
Jailhouse improved also in usability, but dealing with real hardware
still bears the risk that something requires fine-tuning and deeper
understanding.

Beyond this release, there are already several new features in our
incubator. Among them are:

- secure (measured) startup using TPM & Intel TXT [1]
- support for booting multiple Linux instances

While it always looked like that the latter is easier to achieve on ARM,
and there is progress on that right now [2], enabling static Linux
partitions on x86 appeared way more complex. But recent work proved the
concerns wrong: We now have single-core Linux booting in Jailhouse
cells! It is driving assigned PCI devices without any relevant
hypervisor interference [3][4]. Consequently, running cyclictest over a
-rt kernel in a cell gives native latencies. We were also able to host a
simple DPDK workload this way. We even turned off interrupts in the DPDK
cell because the test was only polling - true, 100% CPU occupation.

Thanks to all our contributors for the steady work on Jailhouse, letting
it progress that well. Special credits also go to QEMU/KVM as an
incredibly valuable toolset for development and testing on x86 - hope we
will have this on ARM as well in the near future.

Jan

[1] http://thread.gmane.org/gmane.linux.jailhouse/2692
[2] http://thread.gmane.org/gmane.linux.jailhouse/3016
[3] http://thread.gmane.org/gmane.linux.jailhouse/3032
[4] http://thread.gmane.org/gmane.linux.jailhouse/2956

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvm-unit-tests] x86: vmx: Remove bogus GUEST_RIP update from interrupt test

2015-05-04 Thread Jan Kiszka

When we get an EXTINT exit, the guest RIP already points to the
instruction after the one that sent it into HLT state. Moving
the RIP based on stale insn_len caused spurious L2 crashes.

Signed-off-by: Jan Kiszka 
---
 x86/vmx_tests.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index 4f8ace1..79552fd 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -1297,10 +1297,8 @@ static int interrupt_exit_handler(void)
asm volatile ("nop");
irq_disable();
}
-   if (vmx_get_test_stage() >= 2) {
+   if (vmx_get_test_stage() >= 2)
vmcs_write(GUEST_ACTV_STATE, ACTV_ACTIVE);
-   vmcs_write(GUEST_RIP, guest_rip + insn_len);
-   }
return VMX_TEST_RESUME;
default:
printf("Unknown exit reason, %d\n", reason);
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: nVMX: Fix host crash when loading MSRs with userspace irqchip

2015-05-03 Thread Jan Kiszka

vcpu->arch.apic is NULL when a userspace irqchip is active. But instead
of letting the test incorrectly depend on in-kernel irqchip mode,
open-code it to catch also userspace x2APICs.

Signed-off-by: Jan Kiszka 
---

Affects kernels since 3.19.

 arch/x86/kvm/vmx.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f7b6168..0ef4f96 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2170,8 +2170,7 @@ static void vmx_set_msr_bitmap(struct kvm_vcpu *vcpu)
 
if (is_guest_mode(vcpu))
msr_bitmap = vmx_msr_bitmap_nested;
-   else if (irqchip_in_kernel(vcpu->kvm) &&
-   apic_x2apic_mode(vcpu->arch.apic)) {
+   else if (vcpu->arch.apic_base & X2APIC_ENABLE) {
if (is_long_mode(vcpu))
msr_bitmap = vmx_msr_bitmap_longmode_x2apic;
else
@@ -8924,7 +8923,7 @@ static int nested_vmx_msr_check_common(struct kvm_vcpu 
*vcpu,
   struct vmx_msr_entry *e)
 {
/* x2APIC MSR accesses are not allowed */
-   if (apic_x2apic_mode(vcpu->arch.apic) && e->index >> 8 == 0x8)
+   if (vcpu->arch.apic_base & X2APIC_ENABLE && e->index >> 8 == 0x8)
return -EINVAL;
if (e->index == MSR_IA32_UCODE_WRITE || /* SDM Table 35-2 */
e->index == MSR_IA32_UCODE_REV)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: APIC_ID in apic_reg_write()

2015-04-29 Thread Jan Kiszka

Am 2015-04-30 um 00:21 schrieb Bandan Das:
> Jan Kiszka  writes:
> ...
>>>
>>> And I can verify on a SandyBridge and Haswell system that it's RO there too.
>>
>> So the APIC just ignores the writes, it doesn't through #GP at least.
>>
>>>
>>> In fact, that was one of the reasons I had submitted a patch to remove
>>> verify_local_APIC() from x86/kernel/apic.c (4399c03c678) If I am wrong we 
>>> need to
>>> revert atleast the associated commit message :)
>>
>> Well, we can't remove APIC ID modification support from KVM, though,
>> because older CPU types we may want to emulate correctly had that
>> feature. But we may have to make it configurable to ensure accurate
>> behaviour.
> 
> IMO we should just mark it as read-only. 10.4.6 2nd para says -
> 
> "In MP systems, the local APIC ID is also used as a processor ID by the
> BIOS and the operating system. Some processors permit software to modify
> the APIC ID. However, the ability of software to modify the APIC ID is
> processor model specific. Because of this, operating system software should
> avoid writing to the local APIC ID register."
> 
> Not that marking it read-only has any huge benefit, but a r/w ID reg
> could be a source of bugs with misbehaving guests. Or at the least, a

The current code has been there for quite a while, accepting writes even
for CPU models that don't do this on real hw, and nothing apparently
broke - or do you know stories?

> printk_once warning message when userspace attempts to modify it. Moreover,
> we do make an exception with enabling x2apic for guests.

The situation is different with x2apic because we even have to raise #GP
in case the guest attempts a write. That's mandated by the spec.

> 
> Setting r/w permissions on a per-model is little overkill, don't you think ?

If we want accurate behaviour, we should do this. If not, we probably
better leave the code alone to avoid surprises for preexisting
host/guest setups. Modern OSes do not care anyway, but special ones may
get unhappy.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: APIC_ID in apic_reg_write()

2015-04-29 Thread Jan Kiszka

Am 2015-04-29 um 20:54 schrieb Bandan Das:
> Jan Kiszka  writes:
> 
>> Am 2015-04-29 um 18:47 schrieb Bandan Das:
>>>
>>> Why do we allow writes to APIC_ID ? On all _newer_ processors, it's
>>> read only. The spec doesn't explicitly mention it though, or atleast
>>> I couldn't find it. Does userspace have a reason to modify it ?
>>
>> The APIC ID is read-only for x2APIC. It remains R/W for xAPIC.
> 
> Are you sure ? In 10.4 of the SDM, there is Note that says
> "In processors based on Intel microarchitecture code name Nehalem the Local 
> APIC ID
> Register is no longer Read/Write; it is Read Only."

Indeed - confusing, specifically as it is placed there without a direct
link to the corresponding row in the table. And it doesn't say "since"
or so.

> 
> And I can verify on a SandyBridge and Haswell system that it's RO there too.

So the APIC just ignores the writes, it doesn't through #GP at least.

> 
> In fact, that was one of the reasons I had submitted a patch to remove
> verify_local_APIC() from x86/kernel/apic.c (4399c03c678) If I am wrong we 
> need to
> revert atleast the associated commit message :)

Well, we can't remove APIC ID modification support from KVM, though,
because older CPU types we may want to emulate correctly had that
feature. But we may have to make it configurable to ensure accurate
behaviour.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: APIC_ID in apic_reg_write()

2015-04-29 Thread Jan Kiszka

Am 2015-04-29 um 18:47 schrieb Bandan Das:
> 
> Why do we allow writes to APIC_ID ? On all _newer_ processors, it's
> read only. The spec doesn't explicitly mention it though, or atleast
> I couldn't find it. Does userspace have a reason to modify it ?

The APIC ID is read-only for x2APIC. It remains R/W for xAPIC.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure

2015-04-29 Thread Jan Kiszka

Am 2015-04-29 um 14:55 schrieb Bandan Das:
> Jan Kiszka  writes:
> 
>> Am 2015-04-28 um 21:55 schrieb Bandan Das:
>>>
>>> If get_free_page() fails for nested bitmap area, it's evident that
>>> we are gonna get screwed anyway but returning failure because we failed
>>> allocating memory for a nested structure seems like an unnecessary big
>>> hammer. Also, save the call for later; after we are done with other
>>> non-nested allocations.
>>
>> Frankly, I prefer failures over automatic degradations. And, as you
>> noted, the whole system will probably explode anyway if allocation of a
>> single page already fails. So what does this buy us?
> 
> Yeah... I hear you. Ok, let me put it this way - Assume that we can
> defer this allocation up until the point that the nested subsystem is
> actually used i.e L1 tries running a guest and we try to allocate this
> area. If get_free_page() failed in that case, would we still want to
> kill L1 too ? I guess no.

We could block the hypervisor thread on the allocation, just like it
would block on faults for swapped out pages or new ones that have to be
reclaimed from the page cache first.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Don't return error on nested bitmap memory allocation failure

2015-04-29 Thread Jan Kiszka

Am 2015-04-28 um 21:55 schrieb Bandan Das:
> 
> If get_free_page() fails for nested bitmap area, it's evident that
> we are gonna get screwed anyway but returning failure because we failed
> allocating memory for a nested structure seems like an unnecessary big
> hammer. Also, save the call for later; after we are done with other
> non-nested allocations.

Frankly, I prefer failures over automatic degradations. And, as you
noted, the whole system will probably explode anyway if allocation of a
single page already fails. So what does this buy us?

What could makes sense is making the allocation of the vmread/write
bitmap depend on enable_shadow_vmcs, and that again depend on nested.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Guest handling of IA32_DEBUGCTL MSR

2015-04-28 Thread Jan Kiszka

Am 2015-04-28 um 13:43 schrieb Paolo Bonzini:
> 
> 
> On 28/04/2015 13:42, Nadav Amit wrote:
>> It seems strange that the guest is allowed to set IA32_DEBUGCTL MSR for the
>> nested VM and get this value to the physical IA32_DEBUGCTL (see
>> prepare_vmcs02), while it cannot set IA32_DEBUGCTL for itself (see
>> kvm_set_msr_common).
>>
>> Am I missing something?
> 
> No, it makes no sense.

Are you sure that vmx is not allowing direct access to that MSR while in
guest mode? We do save/restore it on all Intel CPUs, see
setup_vmcs_config. Not sure about the AMD situation, though.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [v6] kvm/fpu: Enable fully eager restore kvm FPU

2015-04-23 Thread Jan Kiszka

On 2015-04-23 12:40, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 23:13, Liang Li wrote:
>> Romove lazy FPU logic and use eager FPU entirely. Eager FPU does
>> not have performance regression, and it can simplify the code.
>>
>> When compiling kernel on westmere, the performance of eager FPU
>> is about 0.4% faster than lazy FPU.
>>
>> Signed-off-by: Liang Li 
>> Signed-off-by: Xudong Hao 
> 
> A patch like this requires much more benchmarking than what you have done.
> 
> First, what guest did you use?  A modern Linux guest will hardly ever exit
> to userspace: the scheduler uses the TSC deadline timer, which is handled
> in the kernel; the clocksource uses the TSC; virtio-blk devices are kicked
> via ioeventfd.
> 
> What happens if you time a Windows guest (without any Hyper-V enlightenments),
> or if you use clocksource=acpi_pm?
> 
> Second, "0.4%" by itself may not be statistically significant.  How did
> you gather the result?  How many times did you run the benchmark?  Did
> the guest report any stolen time?
> 
> 
> And finally, even if the patch was indeed a performance improvement,
> there is much more that you can remove.  fpu_active is always 1, 
> vmx_fpu_activate only has one call site that can be simplified just to
> 
> vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> vmcs_writel(CR0_GUEST_HOST_MASK, ~vcpu->arch.cr0_guest_owned_bits);
> 
> and so on.

And it would be good to know how the benchmarks look like on other CPUs
than the chosen Intel model. Including older ones.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x-tier code injection for VMI

2015-04-21 Thread Jan Kiszka

On 2015-04-21 15:52, Jonas Jelten wrote:
> Hai *!
> 
> We [0] are developing x-tier [1], a VMI system that injects code into a
> kvm guest from the hypervisor.
> 
> Currently we're using kernel modules to be executed in the context of
> the VM. The execution is carefully separated from the target VM so the
> injection remains stealthy (as always, except for timing attacks).
> 
> Using this method, we could even redirect system calls from the
> hypervisor into a VM transparently[2]. Programs running on the host are
> obtaining their data from the guest stealthily that way :D
> 
> 
> What I want to ask the kvm folks:
> Would there be interest integrating the kernel components upstream?
> Mainly it would provide guest os-independent code injection.
> 
> All implementation is free software already [3][4], of course it needs a
> lot of polishing before going upstream ;)
> 
> The userspace part is a modified qemu [5], we're trying to move all the
> injection procedures into the kernel though. Work is in progress..

You may have to advertise your feature for a broader audience: What is
the added value, low level and from a higher perspective? Who may be
interested in it: research, real-world applications, and which kind?

Then, how invasive will the extension be, e.g. which performance impact
will it have for non-users, how much code is added to the kernel and how
many new interfaces (both very sensitive from maintenance and security
perspective)?

Already considered submitting a talk about it for the next KVM Forum
(http://events.linuxfoundation.org/events/kvm-forum/)?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-21 Thread Jan Kiszka

On 2015-04-21 13:32, Paolo Bonzini wrote:
> Basically it's an optimization.  The guest can set the UC memory type on
> PCI BARs that are actually backed by RAM in QEMU, and then accesses to
> these BARs will be unnecessarily slow.  It would be particularly bad if,
> for example, access to ivshmem were slowed down because the guest PAT
> says the memory is uncacheable.

ivshmem is pv anyway - why shouldn't the guest driver take this room for
optimization into account and ask for a cached mapping?

Is that that only use case?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-21 Thread Jan Kiszka

On 2015-04-21 13:09, Paolo Bonzini wrote:
> 
> 
> On 20/04/2015 19:25, Jan Kiszka wrote:
>> When hardware supports the g_pat VMCB field, we can use it for emulating
>> the PAT configuration that the guest configures by writing to the
>> corresponding MSR.
>>
>> Signed-off-by: Jan Kiszka 
> 
> I'm not sure about this.  The problem is that, unlike Intel, AMD has no
> way for the host to force its PAT value and ignore the guest's.  I'm
> worried about potential performance problems in the guest.

I think the guest needs to get what it requests - see my remark in
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/135271.

> 
> This is not as bad as on ARM, because the guest cannot disable the cache

You mean AMD, I guess.

> snooping protocol and thus cache coherency is guaranteed (see tables
> 7-10 and 15-20 in the AMD docs), but still I think I'd prefer having
> some knob (module parameter) to enable/disable gPAT.  It's okay to make
> it enabled by default.

I still don't get the scenario where we want to override the guest
settings. Maybe you can help out - would be valuable for the reasoning
in code or commit logs as well.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-20 Thread Jan Kiszka

On 2015-04-20 20:33, Radim Krčmář wrote:
> 2015-04-20 19:45+0200, Jan Kiszka:
>> On 2015-04-20 19:37, Jan Kiszka wrote:
>>> On 2015-04-20 19:33, Radim Krčmář wrote:
>>>> 2015-04-20 19:21+0200, Jan Kiszka:
>>>>> On 2015-04-20 19:16, Radim Krčmář wrote:
>>>>>> 2015-04-20 18:14+0200, Radim Krčmář:
>>>>>>> Tested-by: Radim Krčmář 
>>>>>>
>>>>>> Uncached accesses were roughly 20x slower.
>>>>>> In case anyone wanted to reproduce, I used this as a kvm-unit-test:
>>>>>>
>>>>>> ---
>>>> | [code]
>>>>>
>>>>> Great, thanks. Will you push it to the unit tests? Could raise
>>>>> motivations to fix the !NPT/EPT case.
>>>>
>>>> It can't be included in `run_tests.sh`, because we intenionally ignore
>>>> PAT for normal RAM on VMX and the test does "fail" ...
>>>
>>> That ignoring is encoded into the EPT?
> 
> Yes, it's the VMX_EPT_IPAT_BIT.
> 
>> And do you also know why is it ignored on Intel? Side effects on the host?
> 
> I think it is an optimization exclusive to Intel.
> We know that the other side is not real hardware, which could avoid CPU
> caches when accessing memory, so there is little reason to slow the
> guest down.

If the guest pushes data for DMA into RAM, it may assume that it lands
there directly, without the need for explicit flushes, because it has
caching disabled - no?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-20 Thread Jan Kiszka

On 2015-04-20 19:37, Jan Kiszka wrote:
> On 2015-04-20 19:33, Radim Krčmář wrote:
>> 2015-04-20 19:21+0200, Jan Kiszka:
>>> On 2015-04-20 19:16, Radim Krčmář wrote:
>>>> 2015-04-20 18:14+0200, Radim Krčmář:
>>>>> Tested-by: Radim Krčmář 
>>>>
>>>> Uncached accesses were roughly 20x slower.
>>>> In case anyone wanted to reproduce, I used this as a kvm-unit-test:
>>>>
>>>> ---
>> | [code]
>>>
>>> Great, thanks. Will you push it to the unit tests? Could raise
>>> motivations to fix the !NPT/EPT case.
>>
>> It can't be included in `run_tests.sh`, because we intenionally ignore
>> PAT for normal RAM on VMX and the test does "fail" ...
> 
> That ignoring is encoded into the EPT? Hmm... Maybe we can create a
> ivshmem device and use that as test target.

And do you also know why is it ignored on Intel? Side effects on the host?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-20 Thread Jan Kiszka

On 2015-04-20 19:33, Radim Krčmář wrote:
> 2015-04-20 19:21+0200, Jan Kiszka:
>> On 2015-04-20 19:16, Radim Krčmář wrote:
>>> 2015-04-20 18:14+0200, Radim Krčmář:
>>>> Tested-by: Radim Krčmář 
>>>
>>> Uncached accesses were roughly 20x slower.
>>> In case anyone wanted to reproduce, I used this as a kvm-unit-test:
>>>
>>> ---
> | [code]
>>
>> Great, thanks. Will you push it to the unit tests? Could raise
>> motivations to fix the !NPT/EPT case.
> 
> It can't be included in `run_tests.sh`, because we intenionally ignore
> PAT for normal RAM on VMX and the test does "fail" ...

That ignoring is encoded into the EPT? Hmm... Maybe we can create a
ivshmem device and use that as test target.

> 
> I'll think how to make the test use fool-proof first, and also look how
> to fix the !NPT/EPT without affecting the case we care about too much.
> (And if we can do a similar trick with NPT.)
> 

OK.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x2apic issues with Solaris and Xen guests

2015-04-20 Thread Jan Kiszka

On 2015-04-20 19:07, Stefan Hajnoczi wrote:
> I wonder whether the following two x2apic issues are related:
> 
> Solaris 10 U11 network doesn't work
> https://bugzilla.redhat.com/show_bug.cgi?id=1040500
> 
> kvm - fails to setup timer interrupt via io-apic
> (Thanks to Michael Tokarev for posting this link)
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=528077#68
> 
> It seems KVM's x2apic emulation works with regular Linux and Windows
> guests, but not necessarily with other OSes.

KVM's x2apic is kind of paravirtual - without VT-d interrupt remapping.
That may confuse the guest, though it should work. But Xen already
refuses to pick it according to the second report:

| (XEN) Not enabling x2APIC: depends on iommu_supports_eim.

> 
> Has anyone looked into this?

Not yet. Is there a handy reproduction guest image? Or maybe someone
would like to start with tracing what the guest and the host do.

Jan

signature.asc
Description: OpenPGP digital signature

[PATCH v2] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-20 Thread Jan Kiszka

When hardware supports the g_pat VMCB field, we can use it for emulating
the PAT configuration that the guest configures by writing to the
corresponding MSR.

Signed-off-by: Jan Kiszka 
---

Changes in v2:
 - add mark_dirty as found missing by Radim

 arch/x86/kvm/svm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ce741b8..68fdddc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3245,6 +3245,16 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
case MSR_VM_IGNNE:
vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
ecx, data);
break;
+   case MSR_IA32_CR_PAT:
+   if (npt_enabled) {
+   if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
+   return 1;
+   svm->vmcb->save.g_pat = data;
+   mark_dirty(svm->vmcb, VMCB_NPT);
+   vcpu->arch.pat = data;
+   break;
+   }
+   /* fall through */
default:
return kvm_set_msr_common(vcpu, msr);
}
-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-20 Thread Jan Kiszka

On 2015-04-20 19:16, Radim Krčmář wrote:
> 2015-04-20 18:14+0200, Radim Krčmář:
>> Tested-by: Radim Krčmář 
> 
> Uncached accesses were roughly 20x slower.
> In case anyone wanted to reproduce, I used this as a kvm-unit-test:
> 
> ---
> #include "processor.h"
> 
> #define NR_TOP_LOOPS 24
> #define NR_MEM_LOOPS 10
> #define MEM_ELEMENTS 1024
> 
> static volatile u64 pat_test_memory[MEM_ELEMENTS];
> 
> static void flush_tlb(void)
> {
>   write_cr3(read_cr3());
> }
> 
> static void set_pat(u64 val)
> {
>   wrmsr(0x277, val);
>   flush_tlb();
> 
> }
> 
> static u64 time_memory_accesses(void)
> {
>   u64 tsc_before = rdtsc();
> 
>   for (unsigned loop = 0; loop < NR_MEM_LOOPS; loop++)
>   for (unsigned i = 0; i < MEM_ELEMENTS; i++)
>   pat_test_memory[i]++;
> 
>   return rdtsc() - tsc_before;
> }
> 
> int main(int argc, char **argv)
> {
>   unsigned error = 0;
> 
>   for (unsigned loop = 0; loop < NR_TOP_LOOPS; loop++) {
>   u64 time_uc, time_wb;
> 
>   set_pat(0);
>   time_uc = time_memory_accesses();
> 
>   set_pat(0x0606060606060606ULL);
>   time_wb = time_memory_accesses();
> 
>   if (time_uc < time_wb * 4)
>   error++;
> 
>   printf("%02d uc: %10lld wb: %8lld\n", loop, time_uc, time_wb);
>   }
> 
>   report("guest PAT", !error);
> 
>   return report_summary();
> }
> 

Great, thanks. Will you push it to the unit tests? Could raise
motivations to fix the !NPT/EPT case.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: How does is PAT emulation supposed to work?

2015-04-17 Thread Jan Kiszka

On 2015-04-17 18:43, Radim Krčmář wrote:
> 2015-04-13 07:16+0200, Jan Kiszka:
>> PS: If someone has a good idea for a simple test case on machines
>> without IOMMU (like my current boxes), thus without a chance to use
>> device pass-through to stress guest PAT settings, I would be all ears.
> 
> Not a good one:  KVM sets VMX_EPT_IPAT_BIT for RAM unless
> kvm_arch_has_noncoherent_dma().  You can comment the line in
> vmx_get_mt_mask(), or call kvm_arch_register_noncoherent_dma(),
> for guest PAT to work on normal memory.

That's for VMX (where I do have IOMMUs), but I would need something for
AMD. :)

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] vt-x: Preserve host CR4.MCE value while in guest mode.

2015-04-16 Thread Jan Kiszka

On 2015-04-16 18:41, Benjamin Serebrin wrote:
> The host's decision to enable machine check exceptions should remain
> in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
> and passed a slightly-modified 0 to the vmcs.guest_cr4 value.
> 
> Tested: Built.
> On earlier version, tested by injecting machine check while a guest is 
> spinning.
> Before the change, if guest CR4.MCE==0, then the machine check is
> escalated to Catastrophic Error (CATERR) and the machine dies.
> If guest CR4.MCE==1, then the machine check causes VMEXIT and is
> handled normally by host Linux. After the change, injecting a machine
> check causes normal Linux machine check handling.
> 
> Signed-off-by: Ben Serebrin 
> ---
>  arch/x86/kvm/vmx.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f5e8dce..f7b6168 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -3622,8 +3622,16 @@ static void vmx_set_cr3(struct kvm_vcpu *vcpu,
> unsigned long cr3)
> 
>  static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  {
> - unsigned long hw_cr4 = cr4 | (to_vmx(vcpu)->rmode.vm86_active ?
> -KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);
> + /*
> + * Pass through host's Machine Check Enable value to hw_cr4, which
> + * is in force while we are in guest mode.  Do not let guests control
> + * this bit, even if host CR4.MCE == 0.
> + */
> + unsigned long hw_cr4 =
> + (cr4_read_shadow() & X86_CR4_MCE) |
> + (cr4 & ~X86_CR4_MCE) |
> + (to_vmx(vcpu)->rmode.vm86_active ?
> + KVM_RMODE_VM_CR4_ALWAYS_ON : KVM_PMODE_VM_CR4_ALWAYS_ON);

You lost most of your whitespaces - in the webmailer? ;)

Jan

> 
>   if (cr4 & X86_CR4_VMXE) {
>   /*
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-14 Thread Jan Kiszka

On 2015-04-14 08:39, Valentine Sinitsyn wrote:
> Hi all,
> 
> On 13.04.2015 22:41, Avi Kivity wrote:
>> On 04/13/2015 08:35 PM, Jan Kiszka wrote:
>>> On 2015-04-13 19:29, Avi Kivity wrote:
>>>> On 04/13/2015 10:01 AM, Jan Kiszka wrote:
>>>>> On 2015-04-07 07:43, Jan Kiszka wrote:
>>>>>> On 2015-04-05 19:12, Valentine Sinitsyn wrote:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> On 05.04.2015 13:31, Jan Kiszka wrote:
>>>>>>>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>>>>>>>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently
>>>>>>>> use
>>>>>>>> these instructions unconditionally. However, I think both only need
>>>>>>>> GS.base, i.e. the per-cpu base address, to be saved and restored
>>>>>>>> if no
>>>>>>>> user space exit or no CPU migration is involved (both is always
>>>>>>>> true for
>>>>>>>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it
>>>>>>>> also
>>>>>>>> still uses rsp-based per-cpu variables.
>>>>>>>>
>>>>>>>> So the question boils down to what is generally faster:
>>>>>>>>
>>>>>>>> A) vmload
>>>>>>>>   vmrun
>>>>>>>>   vmsave
>>>>>>>>
>>>>>>>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>>>>   vmrun
>>>>>>>>   rdmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>>>>
>>>>>>>> Of course, KVM also has to take into account that heavyweight exits
>>>>>>>> still require vmload/vmsave, thus become more expensive with B)
>>>>>>>> due to
>>>>>>>> the additional MSR accesses.
>>>>>>>>
>>>>>>>> Any thoughts or results of previous experiments?
>>>>>>> That's a good question, I also thought about it when I was
>>>>>>> finalizing
>>>>>>> Jailhouse AMD port. I tried "lightweight exits" with apic-demo
>>>>>>> but it
>>>>>>> didn't seem to affect the latency in any noticeable way. That's
>>>>>>> why I
>>>>>>> decided not to push the patch (in fact, I was even unable to find it
>>>>>>> now).
>>>>>>>
>>>>>>> Note however that how AMD chips store host state during VM
>>>>>>> switches are
>>>>>>> implementation-specific. I did my quick experiments on one CPU
>>>>>>> only, so
>>>>>>> your mileage may vary.
>>>>>>>
>>>>>>> Regarding your question, I feel B will be faster anyways but again
>>>>>>> I'm
>>>>>>> afraid that the gain could be within statistical error of the
>>>>>>> experiment.
>>>>>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU,
>>>>>> more
>>>>>> towards 600 if they are colder (added some usleep to each loop in the
>>>>>> test).
>>>>>>
>>>>>> I've tested via vmmcall from guest userspace under Jailhouse. KVM
>>>>>> should
>>>>>> be adjustable in a similar way. Attached the benchmark, patch will
>>>>>> be in
>>>>>> the Jailhouse next branch soon. We need to check more CPU types,
>>>>>> though.
>>>>> Avi, I found some preparatory patches of yours from 2010 [1]. Do you
>>>>> happen to remember if it was never completed for a technical reason?
>>>> IIRC, I came to the conclusion that it was impossible.  Something about
>>>> TR.size not receiving a reasonable value.  Let me see.
>>> To my understanding, TR doesn't play a role until we leave ring 0 again.
>>> Or what could make the CPU look for any of the fields in the 64-bit TSS
>>> before that?
>>
>> Exceptions that utilize the IST.  I found a writeup [17] that describes
>> this, but I think it's even more impossible than that writeup implies.
> Pardon my slowness, but how does it affect Jailhouse running on AMD? For
> NMI, we do #VMEXIT, but we can disable IST (I'm not sure it's enabled
> already, in fact). Double faults don't cause #VMEXIT, so there is no
> VMLOAD/VMSAVE issue. I'm not sure about MCE, but for now they are sort
> of flawed in Jailhouse anyways IIRC.
> 
> What am I missing here?

Nothing. As I said in the other branch of this thread, Jailhouse is not
affected as it doesn't use the IST. Only KVM is because Linux - in host
mode - requires it for the cases Avi mentioned.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Jan Kiszka

On 2015-04-13 20:07, Avi Kivity wrote:
> On 04/13/2015 08:57 PM, Jan Kiszka wrote:
>> On 2015-04-13 19:48, Avi Kivity wrote:
>>> I think that Xen does (or did) something along the lines of disabling
>>> IST usage (by playing with the descriptors in the IDT) and then
>>> re-enabling them when exiting to userspace.
>> So we would reuse that active stack for the current IST users until
>> then.
> 
> Yes.
> 
>> But I bet there are subtle details that prevent a simple switch at
>> IDT level. Hmm, no low-hanging fruit it seems...
> 
> 
> For sure. It's not insurmountable, but fairly hard.
> 
>>>
>>>> [17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/
>> That thread proposed the complete IST removal. But, given that we still
>> have it 7 years later,
> 
> Well, it's not as if a crack team of kernel hackers was laboring night
> and day to remove it, but...
> 
>>   I suppose that was not very welcome in general.
> 
> Simply removing it is impossible, or an NMI happening immediately after
> SYSCALL will hit user-provided %rsp.
> 
>> Thanks,
>> Jan
>>
>> PS: For the Jailhouse readers: we don't use IST.
>>
> 
> You don't have userspace, yes?  Only guests?

Exactly. The day someone adds userspace, I guess I'll have to create a
new hypervisor.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Jan Kiszka

On 2015-04-13 19:48, Avi Kivity wrote:
> On 04/13/2015 08:41 PM, Avi Kivity wrote:
>> On 04/13/2015 08:35 PM, Jan Kiszka wrote:
>>> On 2015-04-13 19:29, Avi Kivity wrote:
>>>> On 04/13/2015 10:01 AM, Jan Kiszka wrote:
>>>>> On 2015-04-07 07:43, Jan Kiszka wrote:
>>>>>> On 2015-04-05 19:12, Valentine Sinitsyn wrote:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> On 05.04.2015 13:31, Jan Kiszka wrote:
>>>>>>>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>>>>>>>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently
>>>>>>>> use
>>>>>>>> these instructions unconditionally. However, I think both only need
>>>>>>>> GS.base, i.e. the per-cpu base address, to be saved and restored
>>>>>>>> if no
>>>>>>>> user space exit or no CPU migration is involved (both is always
>>>>>>>> true for
>>>>>>>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it
>>>>>>>> also
>>>>>>>> still uses rsp-based per-cpu variables.
>>>>>>>>
>>>>>>>> So the question boils down to what is generally faster:
>>>>>>>>
>>>>>>>> A) vmload
>>>>>>>>   vmrun
>>>>>>>>   vmsave
>>>>>>>>
>>>>>>>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>>>>   vmrun
>>>>>>>>   rdmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>>>>
>>>>>>>> Of course, KVM also has to take into account that heavyweight exits
>>>>>>>> still require vmload/vmsave, thus become more expensive with B)
>>>>>>>> due to
>>>>>>>> the additional MSR accesses.
>>>>>>>>
>>>>>>>> Any thoughts or results of previous experiments?
>>>>>>> That's a good question, I also thought about it when I was
>>>>>>> finalizing
>>>>>>> Jailhouse AMD port. I tried "lightweight exits" with apic-demo
>>>>>>> but it
>>>>>>> didn't seem to affect the latency in any noticeable way. That's
>>>>>>> why I
>>>>>>> decided not to push the patch (in fact, I was even unable to find it
>>>>>>> now).
>>>>>>>
>>>>>>> Note however that how AMD chips store host state during VM
>>>>>>> switches are
>>>>>>> implementation-specific. I did my quick experiments on one CPU
>>>>>>> only, so
>>>>>>> your mileage may vary.
>>>>>>>
>>>>>>> Regarding your question, I feel B will be faster anyways but
>>>>>>> again I'm
>>>>>>> afraid that the gain could be within statistical error of the
>>>>>>> experiment.
>>>>>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU,
>>>>>> more
>>>>>> towards 600 if they are colder (added some usleep to each loop in the
>>>>>> test).
>>>>>>
>>>>>> I've tested via vmmcall from guest userspace under Jailhouse. KVM
>>>>>> should
>>>>>> be adjustable in a similar way. Attached the benchmark, patch will
>>>>>> be in
>>>>>> the Jailhouse next branch soon. We need to check more CPU types,
>>>>>> though.
>>>>> Avi, I found some preparatory patches of yours from 2010 [1]. Do you
>>>>> happen to remember if it was never completed for a technical reason?
>>>> IIRC, I came to the conclusion that it was impossible. Something about
>>>> TR.size not receiving a reasonable value.  Let me see.
>>> To my understanding, TR doesn't play a role until we leave ring 0 again.
>>> Or what could make the CPU look for any of the fields in the 64-bit TSS
>>> before that?
>>
>> Exceptions that utilize the IST.  I found a writeup [17] that
>> describes this, but I think it's even more impossible than that
>> writeup implies.
>>
> 
> I think that Xen does (or did) something along the lines of disabling
> IST usage (by playing with the descriptors in the IDT) and then
> re-enabling them when exiting to userspace.

So we would reuse that active stack for the current IST users until
then. But I bet there are subtle details that prevent a simple switch at
IDT level. Hmm, no low-hanging fruit it seems...

> 
> 
>> [17] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/26712/

That thread proposed the complete IST removal. But, given that we still
have it 7 years later, I suppose that was not very welcome in general.

Thanks,
Jan

PS: For the Jailhouse readers: we don't use IST.

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Jan Kiszka

On 2015-04-13 19:29, Avi Kivity wrote:
> On 04/13/2015 10:01 AM, Jan Kiszka wrote:
>> On 2015-04-07 07:43, Jan Kiszka wrote:
>>> On 2015-04-05 19:12, Valentine Sinitsyn wrote:
>>>> Hi Jan,
>>>>
>>>> On 05.04.2015 13:31, Jan Kiszka wrote:
>>>>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>>>>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
>>>>> these instructions unconditionally. However, I think both only need
>>>>> GS.base, i.e. the per-cpu base address, to be saved and restored if no
>>>>> user space exit or no CPU migration is involved (both is always
>>>>> true for
>>>>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
>>>>> still uses rsp-based per-cpu variables.
>>>>>
>>>>> So the question boils down to what is generally faster:
>>>>>
>>>>> A) vmload
>>>>>  vmrun
>>>>>  vmsave
>>>>>
>>>>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>  vmrun
>>>>>  rdmsrl(MSR_GS_BASE, guest_gs_base)
>>>>>
>>>>> Of course, KVM also has to take into account that heavyweight exits
>>>>> still require vmload/vmsave, thus become more expensive with B) due to
>>>>> the additional MSR accesses.
>>>>>
>>>>> Any thoughts or results of previous experiments?
>>>> That's a good question, I also thought about it when I was finalizing
>>>> Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it
>>>> didn't seem to affect the latency in any noticeable way. That's why I
>>>> decided not to push the patch (in fact, I was even unable to find it
>>>> now).
>>>>
>>>> Note however that how AMD chips store host state during VM switches are
>>>> implementation-specific. I did my quick experiments on one CPU only, so
>>>> your mileage may vary.
>>>>
>>>> Regarding your question, I feel B will be faster anyways but again I'm
>>>> afraid that the gain could be within statistical error of the
>>>> experiment.
>>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
>>> towards 600 if they are colder (added some usleep to each loop in the
>>> test).
>>>
>>> I've tested via vmmcall from guest userspace under Jailhouse. KVM should
>>> be adjustable in a similar way. Attached the benchmark, patch will be in
>>> the Jailhouse next branch soon. We need to check more CPU types, though.
>> Avi, I found some preparatory patches of yours from 2010 [1]. Do you
>> happen to remember if it was never completed for a technical reason?
> 
> IIRC, I came to the conclusion that it was impossible.  Something about
> TR.size not receiving a reasonable value.  Let me see.

To my understanding, TR doesn't play a role until we leave ring 0 again.
Or what could make the CPU look for any of the fields in the 64-bit TSS
before that?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-13 Thread Jan Kiszka

On 2015-04-07 07:43, Jan Kiszka wrote:
> On 2015-04-05 19:12, Valentine Sinitsyn wrote:
>> Hi Jan,
>>
>> On 05.04.2015 13:31, Jan Kiszka wrote:
>>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
>>> these instructions unconditionally. However, I think both only need
>>> GS.base, i.e. the per-cpu base address, to be saved and restored if no
>>> user space exit or no CPU migration is involved (both is always true for
>>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
>>> still uses rsp-based per-cpu variables.
>>>
>>> So the question boils down to what is generally faster:
>>>
>>> A) vmload
>>> vmrun
>>> vmsave
>>>
>>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>>> vmrun
>>> rdmsrl(MSR_GS_BASE, guest_gs_base)
>>>
>>> Of course, KVM also has to take into account that heavyweight exits
>>> still require vmload/vmsave, thus become more expensive with B) due to
>>> the additional MSR accesses.
>>>
>>> Any thoughts or results of previous experiments?
>> That's a good question, I also thought about it when I was finalizing
>> Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it
>> didn't seem to affect the latency in any noticeable way. That's why I
>> decided not to push the patch (in fact, I was even unable to find it now).
>>
>> Note however that how AMD chips store host state during VM switches are
>> implementation-specific. I did my quick experiments on one CPU only, so
>> your mileage may vary.
>>
>> Regarding your question, I feel B will be faster anyways but again I'm
>> afraid that the gain could be within statistical error of the experiment.
> 
> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
> towards 600 if they are colder (added some usleep to each loop in the test).
> 
> I've tested via vmmcall from guest userspace under Jailhouse. KVM should
> be adjustable in a similar way. Attached the benchmark, patch will be in
> the Jailhouse next branch soon. We need to check more CPU types, though.

Avi, I found some preparatory patches of yours from 2010 [1]. Do you
happen to remember if it was never completed for a technical reason?

Joel, can you comment on the benefit of variant B) for the various AMD
CPUs? Is it always positive?

Thanks,
Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61455

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC][PATCH] KVM: SVM: Sync g_pat with guest-written PAT value

2015-04-12 Thread Jan Kiszka

When hardware supports the g_pat VMCB field, we can use it for emulating
the PAT configuration that the guest configures by writing to the
corresponding MSR.

Signed-off-by: Jan Kiszka 
---

RFC because it is only compile-tested.

 arch/x86/kvm/svm.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ce741b8..9439c6c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3245,6 +3245,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr)
case MSR_VM_IGNNE:
vcpu_unimpl(vcpu, "unimplemented wrmsr: 0x%x data 0x%llx\n", 
ecx, data);
break;
+   case MSR_IA32_CR_PAT:
+   if (npt_enabled) {
+   if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
+   return 1;
+   svm->vmcb->save.g_pat = data;
+   vcpu->arch.pat = data;
+   break;
+   }
+   /* fall through */
default:
return kvm_set_msr_common(vcpu, msr);
}
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM: How does is PAT emulation supposed to work?

2015-04-12 Thread Jan Kiszka

Hi all,

while digging into the PAT topic for Jailhouse, I also wondered how KVM
deals with it. And I'm still not getting it complete - or there is a bug:

KVM intercepts all guest writes to the PAT MSR and instead keeps the
guest value in vcpu->arch.pat. But, besides returning that value back on
read accesses, arch.pat has no other purpose.

On Intel, we only seem to have proper emulation - through hardware -
when VMX supports PAT switching (see vmx_set_msr). On AMD, the situation
is even worse as the g_pat save field is not updated at all on PAT
writes. That seems to be a low hanging fruit to bring svm on the same
support level as vmx.

Or am I missing something?

Jan

PS: If someone has a good idea for a simple test case on machines
without IOMMU (like my current boxes), thus without a chance to use
device pass-through to stress guest PAT settings, I would be all ears.

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: Question regarding the reset value of LINT0

2015-04-08 Thread Jan Kiszka

On 2015-04-08 19:40, Nadav Amit wrote:
> Jan Kiszka  wrote:
> 
>> On 2015-04-08 18:59, Nadav Amit wrote:
>>> Jan Kiszka  wrote:
>>>
>>>> On 2015-04-08 18:40, Nadav Amit wrote:
>>>>> Hi,
>>>>>
>>>>> I would appreciate if someone explains the reason for enabling LINT0 
>>>>> during
>>>>> APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local
>>>>> Vector Table” that says all LVT registers are reset to 0x1.
>>>>>
>>>>> In kvm_lapic_reset, I see:
>>>>>
>>>>>   apic_set_reg(apic, APIC_LVT0,
>>>>>   SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
>>>>>
>>>>> Which is actually pretty similar to QEMU’s apic_reset_common:
>>>>>
>>>>>   if (bsp) {
>>>>>   /*
>>>>>* LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
>>>>>* time typically by BIOS, so PIC interrupt can be delivered to the
>>>>>* processor when local APIC is enabled.
>>>>>*/
>>>>>   s->lvt[APIC_LVT_LINT0] = 0x700;
>>>>>   }
>>>>>
>>>>> Yet, in both cases, I miss the point - if it is typically done by the 
>>>>> BIOS,
>>>>> why does QEMU or KVM enable it?
>>>>>
>>>>> BTW: KVM seems to run fine without it, and I think setting it causes me
>>>>> problems in certain cases.
>>>>
>>>> I suspect it has some historic BIOS backgrounds. Already tried to find
>>>> more information in the git logs of both code bases? Or something that
>>>> indicates of SeaBIOS or BochsBIOS once didn't do this initialization?
>>> Thanks. I found no indication of such thing.
>>>
>>> QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says:
>>>
>>>Don't route PIC interrupts through the local APIC if the local APIC
>>>config says so. By Ari Kivity.
>>>
>>> Maybe Avi Kivity knows this guy.
>>
>> ths? That should have been Thiemo Seufer (IIRC), but he just committed
>> the code back then (and is no longer with us, sadly).
> Oh… I am sorry - I didn’t know about that.. (I tried to make an unfunny joke
> about Avi knowing “Ari”).

Ah. No problem. My brain apparently fixed that typo up unnoticed.

> 
>> But if that commit went in without any BIOS changes around it, QEMU
>> simply had to do the job of the latter to keep things working.
> So should I leave it as is? Can I at least disable in KVM during INIT (and
> leave it as is for RESET)?

No, I don't think there is a need to leave this inaccurate for QEMU if
our included BIOS gets it right. I don't know what the backward
bug-compatibility of KVM is, though. Maybe you can identify since when
our BIOS is fine so that we can discuss time frames.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: Question regarding the reset value of LINT0

2015-04-08 Thread Jan Kiszka

On 2015-04-08 18:59, Nadav Amit wrote:
> Jan Kiszka  wrote:
> 
>> On 2015-04-08 18:40, Nadav Amit wrote:
>>> Hi,
>>>
>>> I would appreciate if someone explains the reason for enabling LINT0 during
>>> APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local
>>> Vector Table” that says all LVT registers are reset to 0x1.
>>>
>>> In kvm_lapic_reset, I see:
>>>
>>> apic_set_reg(apic, APIC_LVT0,
>>> SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
>>>
>>> Which is actually pretty similar to QEMU’s apic_reset_common:
>>>
>>>if (bsp) {
>>>/*
>>> * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
>>> * time typically by BIOS, so PIC interrupt can be delivered to the
>>> * processor when local APIC is enabled.
>>> */
>>>s->lvt[APIC_LVT_LINT0] = 0x700;
>>>}
>>>
>>> Yet, in both cases, I miss the point - if it is typically done by the BIOS,
>>> why does QEMU or KVM enable it?
>>>
>>> BTW: KVM seems to run fine without it, and I think setting it causes me
>>> problems in certain cases.
>>
>> I suspect it has some historic BIOS backgrounds. Already tried to find
>> more information in the git logs of both code bases? Or something that
>> indicates of SeaBIOS or BochsBIOS once didn't do this initialization?
> Thanks. I found no indication of such thing.
> 
> QEMU’s commit message (0e21e12bb311c4c1095d0269dc2ef81196ccb60a) says:
> 
> Don't route PIC interrupts through the local APIC if the local APIC
> config says so. By Ari Kivity.
> 
> Maybe Avi Kivity knows this guy.

ths? That should have been Thiemo Seufer (IIRC), but he just committed
the code back then (and is no longer with us, sadly).

But if that commit went in without any BIOS changes around it, QEMU
simply had to do the job of the latter to keep things working.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: x86: Question regarding the reset value of LINT0

2015-04-08 Thread Jan Kiszka

On 2015-04-08 18:40, Nadav Amit wrote:
> Hi,
> 
> I would appreciate if someone explains the reason for enabling LINT0 during
> APIC reset. This does not correspond with Intel SDM Figure 10-8: “Local
> Vector Table” that says all LVT registers are reset to 0x1.
> 
> In kvm_lapic_reset, I see:
> 
>   apic_set_reg(apic, APIC_LVT0,
>   SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT));
> 
> Which is actually pretty similar to QEMU’s apic_reset_common:
> 
> if (bsp) {
> /*
>  * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
>  * time typically by BIOS, so PIC interrupt can be delivered to the
>  * processor when local APIC is enabled.
>  */
> s->lvt[APIC_LVT_LINT0] = 0x700;
> }
> 
> Yet, in both cases, I miss the point - if it is typically done by the BIOS,
> why does QEMU or KVM enable it?
> 
> BTW: KVM seems to run fine without it, and I think setting it causes me
> problems in certain cases.

I suspect it has some historic BIOS backgrounds. Already tried to find
more information in the git logs of both code bases? Or something that
indicates of SeaBIOS or BochsBIOS once didn't do this initialization?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-06 Thread Jan Kiszka

On 2015-04-07 08:29, Valentine Sinitsyn wrote:
> On 07.04.2015 11:23, Jan Kiszka wrote:
>> On 2015-04-07 08:19, Valentine Sinitsyn wrote:
>>> On 07.04.2015 11:13, Jan Kiszka wrote:
>>>>>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU,
>>>>>> more
>>>>>> towards 600 if they are colder (added some usleep to each loop in the
>>>>>> test).
>>>>> Great, thanks. Could you post absolute numbers, i.e how long do A
>>>>> and B
>>>>> take on your CPU?
>>>>
>>>> A is around 1910 cycles, B about 1750.
>>> It's with hot caches I guess? Not bad anyways, it's a pity I didn't
>>> observe this and didn't include this optimization from the day one.
>>
>> Yes, that is with the unmodified benchmark I sent. When I add, say
>> usleep(1000) to that loop body, the cycles jumped to 4k (IIRC).
>>
>> BTW, this is the Jailhouse patch:
>> https://github.com/siemens/jailhouse/commit/dbf2fe479ac07a677462dfa87e008e37a4e72858
>>
> I guess, it's getting off-topic here, but wouldn't it be cleaner to
> simply use wrmsr and rdmsr instead of vmload and vmsave in svm-vmexit.S?
> This would require less changes and will keep all entry/exit setup code
> in one place.

It's a tradeoff between assembly lines and C statements. My feeling is
that it's easier done in C, but you can prove me wrong.

Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-06 Thread Jan Kiszka

On 2015-04-07 08:19, Valentine Sinitsyn wrote:
> On 07.04.2015 11:13, Jan Kiszka wrote:
>>>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
>>>> towards 600 if they are colder (added some usleep to each loop in the
>>>> test).
>>> Great, thanks. Could you post absolute numbers, i.e how long do A and B
>>> take on your CPU?
>>
>> A is around 1910 cycles, B about 1750.
> It's with hot caches I guess? Not bad anyways, it's a pity I didn't
> observe this and didn't include this optimization from the day one.

Yes, that is with the unmodified benchmark I sent. When I add, say
usleep(1000) to that loop body, the cycles jumped to 4k (IIRC).

BTW, this is the Jailhouse patch:
https://github.com/siemens/jailhouse/commit/dbf2fe479ac07a677462dfa87e008e37a4e72858

Jan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-06 Thread Jan Kiszka

On 2015-04-07 08:10, Valentine Sinitsyn wrote:
> Hi Jan,
> 
> On 07.04.2015 10:43, Jan Kiszka wrote:
>> On 2015-04-05 19:12, Valentine Sinitsyn wrote:
>>> Hi Jan,
>>>
>>> On 05.04.2015 13:31, Jan Kiszka wrote:
>>>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>>>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
>>>> these instructions unconditionally. However, I think both only need
>>>> GS.base, i.e. the per-cpu base address, to be saved and restored if no
>>>> user space exit or no CPU migration is involved (both is always true
>>>> for
>>>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
>>>> still uses rsp-based per-cpu variables.
>>>>
>>>> So the question boils down to what is generally faster:
>>>>
>>>> A) vmload
>>>>  vmrun
>>>>  vmsave
>>>>
>>>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>>>>  vmrun
>>>>  rdmsrl(MSR_GS_BASE, guest_gs_base)
>>>>
>>>> Of course, KVM also has to take into account that heavyweight exits
>>>> still require vmload/vmsave, thus become more expensive with B) due to
>>>> the additional MSR accesses.
>>>>
>>>> Any thoughts or results of previous experiments?
>>> That's a good question, I also thought about it when I was finalizing
>>> Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it
>>> didn't seem to affect the latency in any noticeable way. That's why I
>>> decided not to push the patch (in fact, I was even unable to find it
>>> now).
>>>
>>> Note however that how AMD chips store host state during VM switches are
>>> implementation-specific. I did my quick experiments on one CPU only, so
>>> your mileage may vary.
>>>
>>> Regarding your question, I feel B will be faster anyways but again I'm
>>> afraid that the gain could be within statistical error of the
>>> experiment.
>>
>> It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
>> towards 600 if they are colder (added some usleep to each loop in the
>> test).
> Great, thanks. Could you post absolute numbers, i.e how long do A and B
> take on your CPU?

A is around 1910 cycles, B about 1750.

Jan

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SVM: vmload/vmsave-free VM exits?

2015-04-06 Thread Jan Kiszka

On 2015-04-05 19:12, Valentine Sinitsyn wrote:
> Hi Jan,
> 
> On 05.04.2015 13:31, Jan Kiszka wrote:
>> studying the VM exit logic of Jailhouse, I was wondering when AMD's
>> vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
>> these instructions unconditionally. However, I think both only need
>> GS.base, i.e. the per-cpu base address, to be saved and restored if no
>> user space exit or no CPU migration is involved (both is always true for
>> Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
>> still uses rsp-based per-cpu variables.
>>
>> So the question boils down to what is generally faster:
>>
>> A) vmload
>> vmrun
>> vmsave
>>
>> B) wrmsrl(MSR_GS_BASE, guest_gs_base)
>> vmrun
>> rdmsrl(MSR_GS_BASE, guest_gs_base)
>>
>> Of course, KVM also has to take into account that heavyweight exits
>> still require vmload/vmsave, thus become more expensive with B) due to
>> the additional MSR accesses.
>>
>> Any thoughts or results of previous experiments?
> That's a good question, I also thought about it when I was finalizing
> Jailhouse AMD port. I tried "lightweight exits" with apic-demo but it
> didn't seem to affect the latency in any noticeable way. That's why I
> decided not to push the patch (in fact, I was even unable to find it now).
> 
> Note however that how AMD chips store host state during VM switches are
> implementation-specific. I did my quick experiments on one CPU only, so
> your mileage may vary.
> 
> Regarding your question, I feel B will be faster anyways but again I'm
> afraid that the gain could be within statistical error of the experiment.

It is, at least 160 cycles with hot caches on an AMD A6-5200 APU, more
towards 600 if they are colder (added some usleep to each loop in the test).

I've tested via vmmcall from guest userspace under Jailhouse. KVM should
be adjustable in a similar way. Attached the benchmark, patch will be in
the Jailhouse next branch soon. We need to check more CPU types, though.

Jan

/*
 * VM exit benchmark using a hypercall
 *
 * Copyright (c) Siemens AG, 2015
 *
 * Authors:
 *  Jan Kiszka 
 *
 * This work is licensed under the terms of the GNU GPL, version 2.  See
 * the COPYING file in the top-level directory.
 */

#ifndef __x86_64__
#error only x86-64 supported
#endif

#include 
#include 

#define LOOPS			100

#define X86_FEATURE_VMX		(1UL << 5)

static inline unsigned long cpuid_ecx(void)
{
	unsigned long val;

	asm volatile("cpuid" : "=c" (val) : "a" (1) : "ebx", "edx");
	return val;
}

static inline __attribute__((always_inline)) unsigned long long read_tsc(void)
{
	unsigned long long hi, lo;

	asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
	return (hi << 32) | lo;
}

int main(int argc, char *argv[])
{
	bool use_vmcall = !!(cpuid_ecx() & X86_FEATURE_VMX);
	unsigned long long start, sum = 0;
	unsigned int n;

	for (n = 0; n < LOOPS; n++) {
		if (use_vmcall) {
			start = read_tsc();
			asm volatile("vmcall" : : "a" (-1));
			sum += read_tsc() - start;
		} else {
			start = read_tsc();
			asm volatile("vmmcall" : : "a" (-1));
			sum += read_tsc() - start;
		}
	}
	printf("Null hypercall: %llu cycles\n", sum / LOOPS);

	return 0;
}

SVM: vmload/vmsave-free VM exits?

2015-04-05 Thread Jan Kiszka

Hi,

studying the VM exit logic of Jailhouse, I was wondering when AMD's
vmload/vmsave can be avoided. Jailhouse as well as KVM currently use
these instructions unconditionally. However, I think both only need
GS.base, i.e. the per-cpu base address, to be saved and restored if no
user space exit or no CPU migration is involved (both is always true for
Jailhouse). Xen avoids vmload/vmsave on lightweight exits but it also
still uses rsp-based per-cpu variables.

So the question boils down to what is generally faster:

A) vmload
   vmrun
   vmsave

B) wrmsrl(MSR_GS_BASE, guest_gs_base)
   vmrun
   rdmsrl(MSR_GS_BASE, guest_gs_base)

Of course, KVM also has to take into account that heavyweight exits
still require vmload/vmsave, thus become more expensive with B) due to
the additional MSR accesses.

Any thoughts or results of previous experiments?

Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Jan Kiszka

From: Jan Kiszka 

If the guest CPU is supposed to support rdtscp and the host has rdtscp
enabled in the secondary execution controls, we can also expose this
feature to L1. Just extend nested_vmx_exit_handled to properly route
EXIT_REASON_RDTSCP.

Signed-off-by: Jan Kiszka 
---

Changes in v3:
 - avoid needlessly touching vmx->nested if nested is off

 arch/x86/include/uapi/asm/vmx.h | 1 +
 arch/x86/kvm/vmx.c  | 9 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index c5f1a1d..1fe9218 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -67,6 +67,7 @@
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_INVEPT  50
+#define EXIT_REASON_RDTSCP  51
 #define EXIT_REASON_PREEMPTION_TIMER52
 #define EXIT_REASON_INVVPID 53
 #define EXIT_REASON_WBINVD  54
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 50c675b..fdd9f8b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_secondary_ctls_low = 0;
vmx->nested.nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+   SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
@@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
case EXIT_REASON_RDPMC:
return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
-   case EXIT_REASON_RDTSC:
+   case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
@@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
exec_control);
}
}
+   if (nested && !vmx->rdtscp_enabled)
+   vmx->nested.nested_vmx_secondary_ctls_high &=
+   ~SECONDARY_EXEC_RDTSCP;
}
 
/* Exposing INVPCID only when PCID is exposed */
@@ -9146,8 +9150,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
exec_control &= ~SECONDARY_EXEC_RDTSCP;
/* Take the following fields only from vmcs12 */
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+ SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
-  SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT);
if (nested_cpu_has(vmcs12,
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
exec_control |= vmcs12->secondary_vm_exec_control;
-- 
2.1.4



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Jan Kiszka

On 2015-03-23 18:01, Bandan Das wrote:
> Jan Kiszka  writes:
> ...
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
>> *vmx)
>>  vmx->nested.nested_vmx_secondary_ctls_low = 0;
>>  vmx->nested.nested_vmx_secondary_ctls_high &=
>>  SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
>> +SECONDARY_EXEC_RDTSCP |
>>  SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
>>  SECONDARY_EXEC_APIC_REGISTER_VIRT |
>>  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>> @@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
>> *vcpu)
>>  return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
>>  case EXIT_REASON_RDPMC:
>>  return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
>> -case EXIT_REASON_RDTSC:
>> +case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
>>  return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
>>  case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
>>  case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
>> @@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
>>  exec_control);
>>  }
>>  }
>> +if (!vmx->rdtscp_enabled)
>> +vmx->nested.nested_vmx_secondary_ctls_high &=
>> +~SECONDARY_EXEC_RDTSCP;
> No need to do this if nested is not enabled ? Or just
> a "if (nested)" in the prior if else loop should be enough I think.

I can add this - but this is far away from being a hotpath. What would
be the benefit?

Thanks,
Jan




signature.asc
Description: OpenPGP digital signature

[PATCH v2] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Jan Kiszka

From: Jan Kiszka 

If the guest CPU is supposed to support rdtscp and the host has rdtscp
enabled in the secondary execution controls, we can also expose this
feature to L1. Just extend nested_vmx_exit_handled to properly route
EXIT_REASON_RDTSCP.

Signed-off-by: Jan Kiszka 
---

Changes in v2 (thinko in test scenario...):
 - respect L1's setting of SECONDARY_EXEC_RDTSCP

 arch/x86/include/uapi/asm/vmx.h | 1 +
 arch/x86/kvm/vmx.c  | 9 +++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index c5f1a1d..1fe9218 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -67,6 +67,7 @@
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_INVEPT  50
+#define EXIT_REASON_RDTSCP  51
 #define EXIT_REASON_PREEMPTION_TIMER52
 #define EXIT_REASON_INVVPID 53
 #define EXIT_REASON_WBINVD  54
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 50c675b..45e0a6b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_secondary_ctls_low = 0;
vmx->nested.nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+   SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
@@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
case EXIT_REASON_RDPMC:
return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
-   case EXIT_REASON_RDTSC:
+   case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
@@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
exec_control);
}
}
+   if (!vmx->rdtscp_enabled)
+   vmx->nested.nested_vmx_secondary_ctls_high &=
+   ~SECONDARY_EXEC_RDTSCP;
}
 
/* Exposing INVPCID only when PCID is exposed */
@@ -9146,8 +9150,9 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
exec_control &= ~SECONDARY_EXEC_RDTSCP;
/* Take the following fields only from vmcs12 */
exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+ SECONDARY_EXEC_RDTSCP |
  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
-  SECONDARY_EXEC_APIC_REGISTER_VIRT);
+ SECONDARY_EXEC_APIC_REGISTER_VIRT);
if (nested_cpu_has(vmcs12,
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
exec_control |= vmcs12->secondary_vm_exec_control;
-- 
2.1.4



signature.asc
Description: OpenPGP digital signature

[PATCH] KVM: nVMX: Add support for rdtscp

2015-03-23 Thread Jan Kiszka

From: Jan Kiszka 

If the guest CPU is supposed to support rdtscp and the host has rdtscp
enabled in the secondary execution controls, we can also expose this
feature to L1. Just extend nested_vmx_exit_handled to properly route
EXIT_REASON_RDTSCP.

Signed-off-by: Jan Kiszka 
---
 arch/x86/include/uapi/asm/vmx.h | 1 +
 arch/x86/kvm/vmx.c  | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index c5f1a1d..1fe9218 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -67,6 +67,7 @@
 #define EXIT_REASON_EPT_VIOLATION   48
 #define EXIT_REASON_EPT_MISCONFIG   49
 #define EXIT_REASON_INVEPT  50
+#define EXIT_REASON_RDTSCP  51
 #define EXIT_REASON_PREEMPTION_TIMER52
 #define EXIT_REASON_INVVPID 53
 #define EXIT_REASON_WBINVD  54
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 50c675b..7875e9b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2467,6 +2467,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
*vmx)
vmx->nested.nested_vmx_secondary_ctls_low = 0;
vmx->nested.nested_vmx_secondary_ctls_high &=
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+   SECONDARY_EXEC_RDTSCP |
SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
SECONDARY_EXEC_APIC_REGISTER_VIRT |
SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
@@ -7510,7 +7511,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
return nested_cpu_has(vmcs12, CPU_BASED_INVLPG_EXITING);
case EXIT_REASON_RDPMC:
return nested_cpu_has(vmcs12, CPU_BASED_RDPMC_EXITING);
-   case EXIT_REASON_RDTSC:
+   case EXIT_REASON_RDTSC: case EXIT_REASON_RDTSCP:
return nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
case EXIT_REASON_VMCALL: case EXIT_REASON_VMCLEAR:
case EXIT_REASON_VMLAUNCH: case EXIT_REASON_VMPTRLD:
@@ -8517,6 +8518,9 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
exec_control);
}
}
+   if (!vmx->rdtscp_enabled)
+   vmx->nested.nested_vmx_secondary_ctls_high &=
+   ~SECONDARY_EXEC_RDTSCP;
}
 
/* Exposing INVPCID only when PCID is exposed */
-- 
2.1.4



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3 0/2] kvm: x86: Implement handling of RH=1 for MSI delivery in KVM

2015-03-17 Thread Jan Kiszka

On 2015-03-17 02:30, James Sullivan wrote:
> Changes Since v1:
> * Reworked patches into two commits:
> 1) [Patch v2 1/2] Extended struct kvm_lapic_irq with bool
> msi_redir_hint
> * Initialize msi_redir_hint = true in kvm_set_msi_irq when RH=1
> * Initialize msi_redir_hint = false otherwise
> * Added value of msi_redir_hint to debug message dump of IRQ in
> apic_send_ipi
> 2) [Patch v2 2/2] Deliver to only lowest prio CPU if msi_redir_hint 
>is true 
> * Move kvm_is_dm_lowest_prio() -> lapic.h, rename to
> kvm_lowest_prio_delivery, set condition to
> (APIC_DM_LOWPRI || msi_redir_hint)
> * Change check in kvm_irq_delivery_to_apic_fast() for
> APIC_DM_LOWPRI or msi_redir_hint to a check for
> kvm_is_dm_lowest_prio() 
> Changes since v2:
> * Extend Patch 1/2 ("kvm: x86: Extended struct kvm_lapic_irq with 
> msi_redir_hint for MSI delivery") with older patch to set the value
> of dest_mode in kvm_set_msi_irq() to be APIC_DEST_LOGICAL only when 
> RH=1/DM=1, and APIC_DEST_PHYSICAL otherwise. 
> (<5502fedb.3030...@gmail.com>)
> This was done to decouple the patch dependency and to collect all
> efforts to implement RH bit handling into one submission.
> * Patch formatting
> 
> This series of patches extends the KVM interrupt delivery mechanism
> to correctly account for the MSI Redirection Hint bit. The RH bit is 
> used in logical destination mode to indicate that the delivery of the
> interrupt shall only be to the lowest priority candidate LAPIC.
> 
> Currently, there is no handling of the MSI RH bit in the KVM interrupt
> delivery mechanism. This patch implements the following logic:
> 
> * DM=0, RH=*  : Physical destination mode. Interrupt is delivered to
> the LAPIC with the matching APIC ID. (Subject to
> the usual restrictions, i.e. no broadcast dest)
> * DM=1, RH=0  : Logical destination mode without redirection. Interrupt
> is delivered to all LAPICs in the logical group 
> specified by the IRQ's destination map and delivery
> mode.
> * DM=1, RH=1  : Logical destination mode with redirection. Interrupt
> is delivered only to the lowest priority LAPIC in the 
> logical group specified by the dest map and the
> delivery mode. Delivery semantics are otherwise
> specified by the delivery_mode of the IRQ, which
> is unchanged.
> 
> In other words, the RH bit is ignored in physical destination mode, and
> when it is set in logical destination mode causes delivery to only apply
> to the lowest priority processor in the logical group. The IA32 manual
> is in slight contradiction with itself on this matter, but this patch
> agrees with this interpretation of the RH bit:
> 
> https://software.intel.com/en-us/forums/topic/23
> 
> This patch has passed some rudimentary tests using an SMP QEMU guest and
> virtio sourced MSIs, but I haven't done experiments with passing through 
> PCI hardware (intend to start working on this).

Once the kernel side is settled, would you mind looking into equivalent
support for QEMU's APIC as well? Just to keep it easy to switch between
both modes without too many functional restrictions.

Thanks!
Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM emulation failure with recent kernel and QEMU Seabios

2015-03-12 Thread Jan Kiszka

Am 2015-03-12 um 09:11 schrieb Gerd Hoffmann:
> On Do, 2015-03-12 at 09:09 +0100, Jan Kiszka wrote:
>> Hi,
>>
>> apparently since the latest QEMU updates I'm getting this once in a
>> while:
> 
> http://www.seabios.org/pipermail/seabios/2015-March/008897.html

OK... So we are waiting on a stable release to pull in new binaries with
this fix? A matter of days?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM emulation failure with recent kernel and QEMU Seabios

2015-03-12 Thread Jan Kiszka

Hi,

apparently since the latest QEMU updates I'm getting this once in a
while:

KVM internal error. Suberror: 1
emulation failure
EAX= EBX= ECX= EDX=000fd2bc
ESI= EDI= EBP= ESP=
EIP=000fd2c5 EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000f6a80 0037
IDT= 000f6abe 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb 3f 00 <72> f3 8b 25 
00 ff fb 3f e8 44 66 ff ff c7 05 04 ff fb 3f 00 00 00 00 f4 eb fd fa fc 66 b8
KVM internal error. Suberror: 1
emulation failure
EAX= EBX= ECX= EDX=000fd2bc
ESI= EDI= EBP= ESP=
EIP=000fd2bc EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000f6a80 0037
IDT= 000f6abe 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=0a 00 e8 a0 64 ff ff 0f aa 66 ba bc d2 0f 00 e9 a2 fe f3 90  0f ba 2d 
04 ff fb 3f 00 72 f3 8b 25 00 ff fb 3f e8 44 66 ff ff c7 05 04 ff fb 3f 00 00
KVM internal error. Suberror: 1
emulation failure
EAX= EBX= ECX= EDX=000fd2bc
ESI= EDI= EBP= ESP=
EIP=000fd2c5 EFL=00010007 [-PC] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000f6a80 0037
IDT= 000f6abe 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=66 ba bc d2 0f 00 e9 a2 fe f3 90 f0 0f ba 2d 04 ff fb 3f 00 <72> f3 8b 25 
00 ff fb 3f e8 44 66 ff ff c7 05 04 ff fb 3f 00 00 00 00 f4 eb fd fa fc 66 b8

The command line to trigger it:

qemu-system-x86_64 -m 1G -enable-kvm -s -cpu kvm64 -smp 4 -no-kvm-irqchip

The issue did not yet show up when using in-kernel irqchips or when
doing "git checkout 11d39a13 pc-bios", i.e. reverting the recent BIOS
updates.

I'm on QEMU master (with 04f56432 reverted) and either kernel 4.0.0-rc3
or kvm.git next.

Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvm-unit-test] x86: vmx: Check #UD triggering of vmmcall

2015-03-09 Thread Jan Kiszka

KVM tends to patch and emulated vmmcall on Intel. But that must not
happen for L2.

Signed-off-by: Jan Kiszka 
---

If the recently posted fixed for KVM are applied, this test passes.

 x86/vmx_tests.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index 41a9a82..4f8ace1 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -11,6 +11,7 @@
 #include "fwcfg.h"
 #include "isr.h"
 #include "apic.h"
+#include "types.h"
 
 u64 ia32_pat;
 u64 ia32_efer;
@@ -1493,6 +1494,44 @@ static int msr_switch_exit_handler()
return VMX_TEST_EXIT;
 }
 
+static int vmmcall_init(struct vmcs *vmcs  )
+{
+   vmcs_write(EXC_BITMAP, 1 << UD_VECTOR);
+   return VMX_TEST_START;
+}
+
+static void vmmcall_main(void)
+{
+   asm volatile(
+   "mov $0xABCD, %%rax\n\t"
+   "vmmcall\n\t"
+   ::: "rax");
+
+   report("VMMCALL", 0);
+}
+
+static int vmmcall_exit_handler()
+{
+   ulong reason;
+
+   reason = vmcs_read(EXI_REASON);
+   switch (reason) {
+   case VMX_VMCALL:
+   printf("here\n");
+   report("VMMCALL triggers #UD", 0);
+   break;
+   case VMX_EXC_NMI:
+   report("VMMCALL triggers #UD",
+  (vmcs_read(EXI_INTR_INFO) & 0xff) == UD_VECTOR);
+   break;
+   default:
+   printf("Unknown exit reason, %d\n", reason);
+   print_vmexit_info();
+   }
+
+   return VMX_TEST_VMEXIT;
+}
+
 /* name/init/guest_main/exit_handler/syscall_handler/guest_regs */
 struct vmx_test vmx_tests[] = {
{ "null", NULL, basic_guest_main, basic_exit_handler, NULL, {0} },
@@ -1516,5 +1555,6 @@ struct vmx_test vmx_tests[] = {
NULL, {0} },
{ "MSR switch", msr_switch_init, msr_switch_main,
msr_switch_exit_handler, NULL, {0} },
+   { "vmmcall", vmmcall_init, vmmcall_main, vmmcall_exit_handler, NULL, 
{0} },
{ NULL, NULL, NULL, NULL, NULL, {0} },
 };
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: nVMX: Do not emulate #UD while in guest mode

2015-03-09 Thread Jan Kiszka

While in L2, leave all #UD to L2 and do not try to emulate it. If L1 is
interested in doing this, it reports its interest via the exception
bitmap, and we never get into handle_exception of L0 anyway.

Signed-off-by: Jan Kiszka 
---

Noticed while wondering where the vmmcall of a misconfigured L2 went on
an Intel box: to nowhere. This bug caused a spurious fixup, and the
emulator bug did not even let it trigger a vmcall vmexit.

 arch/x86/kvm/vmx.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f7b20b4..fa0627c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5065,6 +5065,10 @@ static int handle_exception(struct kvm_vcpu *vcpu)
}
 
if (is_invalid_opcode(intr_info)) {
+   if (is_guest_mode(vcpu)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
er = emulate_instruction(vcpu, EMULTYPE_TRAP_UD);
if (er != EMULATE_DONE)
kvm_queue_exception(vcpu, UD_VECTOR);
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: x86: Fix re-execution of patched vmmcall

2015-03-09 Thread Jan Kiszka

For a very long time (since 2b3d2a20), the path handling a vmmcall
instruction of the guest on an Intel host only applied the patch but no
longer handled the hypercall. The reverse case, vmcall on AMD hosts, is
fine. As both em_vmcall and em_vmmcall actually have to do the same, we
can fix the issue by consolidating both into the same handler.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/emulate.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 106c015..c941abe 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3323,7 +3323,7 @@ static int em_clts(struct x86_emulate_ctxt *ctxt)
return X86EMUL_CONTINUE;
 }
 
-static int em_vmcall(struct x86_emulate_ctxt *ctxt)
+static int em_hypercall(struct x86_emulate_ctxt *ctxt)
 {
int rc = ctxt->ops->fix_hypercall(ctxt);
 
@@ -3395,17 +3395,6 @@ static int em_lgdt(struct x86_emulate_ctxt *ctxt)
return em_lgdt_lidt(ctxt, true);
 }
 
-static int em_vmmcall(struct x86_emulate_ctxt *ctxt)
-{
-   int rc;
-
-   rc = ctxt->ops->fix_hypercall(ctxt);
-
-   /* Disable writeback. */
-   ctxt->dst.type = OP_NONE;
-   return rc;
-}
-
 static int em_lidt(struct x86_emulate_ctxt *ctxt)
 {
return em_lgdt_lidt(ctxt, false);
@@ -3769,7 +3758,7 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt)
 
 static const struct opcode group7_rm0[] = {
N,
-   I(SrcNone | Priv | EmulateOnUD, em_vmcall),
+   I(SrcNone | Priv | EmulateOnUD, em_hypercall),
N, N, N, N, N, N,
 };
 
@@ -3781,7 +3770,7 @@ static const struct opcode group7_rm1[] = {
 
 static const struct opcode group7_rm3[] = {
DIP(SrcNone | Prot | Priv,  vmrun,  check_svme_pa),
-   II(SrcNone  | Prot | EmulateOnUD,   em_vmmcall, vmmcall),
+   II(SrcNone  | Prot | EmulateOnUD,   em_hypercall,   vmmcall),
DIP(SrcNone | Prot | Priv,  vmload, check_svme_pa),
DIP(SrcNone | Prot | Priv,  vmsave, check_svme_pa),
DIP(SrcNone | Prot | Priv,  stgi,   check_svme),
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH kvm-unit-test] x86/emulator: Fix inline assembler warning

2015-03-09 Thread Jan Kiszka

Code compiles to the same binary, but now with one warning less.

Signed-off-by: Jan Kiszka 
---
 x86/emulator.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/x86/emulator.c b/x86/emulator.c
index 0964e6a..e5c1c6b 100644
--- a/x86/emulator.c
+++ b/x86/emulator.c
@@ -838,7 +838,7 @@ static void test_jmp_noncanonical(uint64_t *mem)
 
exceptions = 0;
handle_exception(GP_VECTOR, advance_rip_by_3_and_note_exception);
-   asm volatile ("jmp %0" : : "m"(*mem));
+   asm volatile ("jmp *%0" : : "m"(*mem));
report("jump to non-canonical address", exceptions == 1);
handle_exception(GP_VECTOR, 0);
 }
-- 
2.1.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: mask unrestricted_guest if disabled on L0

2015-02-24 Thread Jan Kiszka

On 2015-02-24 17:30, Radim Krčmář wrote:
> 2015-02-23 19:05+0100, Kashyap Chamarthy:
>> Tested with the _correct_ Kernel[1] (that has Radim's patch) now --
>> applied it on both L0 and L1.
>>
>> Result: Same as before -- Booting L2 causes L1 to reboot. However, the
>> stack trace from `dmesg` on L0 is took slightly different path than
>> before -- it's using MSR handling:
> 
> Thanks, the problem was deeper ... L1 enabled unrestricted mode while L0
> had it disabled.  L1 could then vmrun a L2 state that L0 would have to
> emulate, but that doesn't work.  There are at least these solutions:
> 
>  1) don't expose unrestricted_guest when L0 doesn't have it

Reminds me of a patch called "KVM: nVMX: Disable unrestricted mode if
ept=0" by Bandan. I thought that would have caught it - apparently not.

>  2) fix unrestricted mode emulation code
>  3) handle the failure a without killing L1
> 
> I'd do just (1) -- emulating unrestricted mode is a loss.

Agreed.

Jan

> 
> I have done initial testing and at least qemu-sanity-check works now:
> 
> ---8<---
> If EPT was enabled, unrestricted_guest was allowed in L1 regardless of
> L0.  L1 triple faulted when running L2 guest that required emulation.
> 
> Another side effect was 'WARN_ON_ONCE(vmx->nested.nested_run_pending)'
> in L0's dmesg:
>   WARNING: CPU: 0 PID: 0 at arch/x86/kvm/vmx.c:9190 
> nested_vmx_vmexit+0x96e/0xb00 [kvm_intel] ()
> 
> Prevent this scenario by masking SECONDARY_EXEC_UNRESTRICTED_GUEST when
> the host doesn't have it enabled.
> 
> Fixes: 78051e3b7e35 ("KVM: nVMX: Disable unrestricted mode if ept=0")
> Signed-off-by: Radim Krčmář 
> ---
>  arch/x86/kvm/vmx.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index f7b20b417a3a..dbabea21357b 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2476,8 +2476,7 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   if (enable_ept) {
>   /* nested EPT: emulate EPT also to L1 */
>   vmx->nested.nested_vmx_secondary_ctls_high |=
> - SECONDARY_EXEC_ENABLE_EPT |
> - SECONDARY_EXEC_UNRESTRICTED_GUEST;
> + SECONDARY_EXEC_ENABLE_EPT;
>   vmx->nested.nested_vmx_ept_caps = VMX_EPT_PAGE_WALK_4_BIT |
>VMX_EPTP_WB_BIT | VMX_EPT_2MB_PAGE_BIT |
>VMX_EPT_INVEPT_BIT;
> @@ -2491,6 +2490,10 @@ static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx 
> *vmx)
>   } else
>   vmx->nested.nested_vmx_ept_caps = 0;
>  
> + if (enable_unrestricted_guest)
> + vmx->nested.nested_vmx_secondary_ctls_high |=
> + SECONDARY_EXEC_UNRESTRICTED_GUEST;
> +
>   /* miscellaneous data */
>   rdmsr(MSR_IA32_VMX_MISC,
>   vmx->nested.nested_vmx_misc_low,
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 rebooting

2015-02-17 Thread Jan Kiszka

On 2015-02-17 19:00, Bandan Das wrote:
> Kashyap Chamarthy  writes:
> ..
>>>
>>> Does enable_apicv make a difference?
>>
>> Actually, I did perform a test (on Paolo's suggestion on IRC) with
>> enable_apicv=0 on physical host, and it didn't make any difference:
>>
>> $ cat /proc/cmdline 
>> BOOT_IMAGE=/vmlinuz-3.20.0-0.rc0.git5.1.fc23.x86_64 
>> root=/dev/mapper/fedora--server_dell--per910--02-root ro 
>> console=ttyS1,115200n81 rd.lvm.lv=fedora-server_dell-per910-02/swap 
>> rd.lvm.lv=fedora-server_dell-per910-02/root LANG=en_US.UTF-8 enable_apicv=0
> 
> I am not sure if this works ? enable_apicv is a kvm_intel module parameter

Good point. Has to be kvm_intel.enable_apicv=0 (if the module is built in).

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [nVMX] With 3.20.0-0.rc0.git5.1 on L0, booting L2 guest results in L1 rebooting

2015-02-16 Thread Jan Kiszka

   9183   u32 exit_intr_info,
>9184   unsigned long exit_qualification)
>9185 {
>9186 struct vcpu_vmx *vmx = to_vmx(vcpu);
>9187 struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>9188 
>9189 /* trying to cancel vmlaunch/vmresume is a bug */
>9190 WARN_ON_ONCE(vmx->nested.nested_run_pending);
>9191 
>9192 leave_guest_mode(vcpu);
>9193 prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
>9194exit_qualification);
>9195 
>9196 vmx_load_vmcs01(vcpu);
>9197 
>9198 if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
>9199 && nested_exit_intr_ack_set(vcpu)) {
>9200 int irq = kvm_cpu_get_interrupt(vcpu);
>9201         WARN_ON(irq < 0);
>9202 vmcs12->vm_exit_intr_info = irq |
>9203 INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
>9204 }
> 
> 
> - The above line 9190 was introduced in this commt:
> 
>   $ git log -S'WARN_ON_ONCE(vmx->nested.nested_run_pending)' \
>   -- ./arch/x86/kvm/vmx.c
>   commit 5f3d5799974b89100268ba813cec8db7bd0693fb
>   Author: Jan Kiszka 
>   Date:   Sun Apr 14 12:12:46 2013 +0200
>   
>   KVM: nVMX: Rework event injection and recovery
>   
>   The basic idea is to always transfer the pending event injection on
>   vmexit into the architectural state of the VCPU and then drop it from
>   there if it turns out that we left L2 to enter L1, i.e. if we enter
>   prepare_vmcs12.
>   
>   vmcs12_save_pending_events takes care to transfer pending L0 events into
>   the queue of L1. That is mandatory as L1 may decide to switch the guest
>   state completely, invalidating or preserving the pending events for
>   later injection (including on a different node, once we support
>   migration).
>   
>   This concept is based on the rule that a pending vmlaunch/vmresume is
>   not canceled. Otherwise, we would risk to lose injected events or leak
>   them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
>   entry of nested_vmx_vmexit.
>   
>   Signed-off-by: Jan Kiszka 
>   Signed-off-by: Gleb Natapov 
> 
> 
> - `dmesg`, `dmidecode`, `x86info -a` details of L0 and L1 here
> 
> 
> https://kashyapc.fedorapeople.org/virt/Info-L0-Intel-Xeon-and-L1-nVMX-test/
> 

Does enable_apicv make a difference?

Is this a regression caused by the commit, or do you only see it with
very recent kvm.git?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vexpress: Framebuffer broken with KVM enabled

2015-02-16 Thread Jan Kiszka

On 2015-02-16 10:34, Alexander Spyridakis wrote:
> On Mon, Feb 16, 2015 at 2:43 PM, Jan Kiszka  wrote:
>> Hi,
>>
>> next issue related to KVM/QEMU on the TK1: The guest image I'm running
>> gives proper framebuffer output when in emulation mode. Once KVM is
>> enabled, the screen is - at best - only initially updated. Sometimes I
>> see the famous tux images and a bit of the console texts, but usually it
>> stays black. Explanations?
> 
> Hello Jan,
> 
> If you want to force rendering, you can do something similar with the
> following hack:
> https://github.com/virtualopensystems/qemu/commit/64dd1b3e3a2353433edb9c63d00271f515bd06fb
> 
> Of course expect performance to not be up to par.

Yep, confirmed - both that it works and that it's slow (better not try
this with SDL, exported via X...).

Thanks,
Jan




signature.asc
Description: OpenPGP digital signature

Re: vexpress: Framebuffer broken with KVM enabled

2015-02-16 Thread Jan Kiszka

On 2015-02-16 10:20, Anup Patel wrote:
> On Mon, Feb 16, 2015 at 2:43 PM, Jan Kiszka  wrote:
>> Hi,
>>
>> next issue related to KVM/QEMU on the TK1: The guest image I'm running
>> gives proper framebuffer output when in emulation mode. Once KVM is
>> enabled, the screen is - at best - only initially updated. Sometimes I
>> see the famous tux images and a bit of the console texts, but usually it
>> stays black. Explanations?
> 
> The QEMU accesses Guest Video RAM (or any portion of Guest RAM) as
> cacheable user space memory. The Guest Kernel might access Guest Video
> RAM as non-cacheable to maintain coherency with video device. If this is
> the case then all updates by Guest kernel to Guest Video RAM will not
> be visible to QEMU.

On x86, we manage such RAM as coalesced MMIO region, sync'ing it
periodically or on specific register accesses into the video card model.
I suppose there is nothing like this for the pl111 yet, right?

Jan




signature.asc
Description: OpenPGP digital signature

vexpress: Framebuffer broken with KVM enabled

2015-02-16 Thread Jan Kiszka

Hi,

next issue related to KVM/QEMU on the TK1: The guest image I'm running
gives proper framebuffer output when in emulation mode. Once KVM is
enabled, the screen is - at best - only initially updated. Sometimes I
see the famous tux images and a bit of the console texts, but usually it
stays black. Explanations?

Jan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-16 Thread Jan Kiszka

On 2015-02-16 09:57, Marc Zyngier wrote:
> On 15/02/15 19:03, Jan Kiszka wrote:
>> On 2015-02-15 19:01, Jan Kiszka wrote:
>>> On 2015-02-15 16:30, Marc Zyngier wrote:
>>>> On Sun, Feb 15 2015 at  3:07:50 pm GMT, Jan Kiszka
>>>>  wrote:
>>>>> On 2015-02-15 15:59, Marc Zyngier wrote:
>>>>>> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka
>>>>>>  wrote:
>>>>>>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>>>>>>> On Sun, Feb 15 2015 at 8:53:30 am GMT, Jan Kiszka
>>>>>>>>  wrote:
>>>>>>>>> I'm now throwing trace_printk at my broken KVM. Already
>>>>>>>>> found out that I get ARM_EXCEPTION_IRQ every few 10 µs.
>>>>>>>>> Not seeing any irq_* traces, though. Weird.
>>>>>>>>
>>>>>>>> This very much looks like a screaming interrupt. At such
>>>>>>>> a rate, no wonder your VM make much progress. Can you
>>>>>>>> find out which interrupt is screaming like this? Looking
>>>>>>>> at GICC_HPPIR should help, but you'll have to map the CPU
>>>>>>>> interface in HYP before being able to access it there.
>>>>>>>
>>>>>>> OK... let me figure this out. I had this suspect as well -
>>>>>>> the host gets a VM exit for each injected guest IRQ?
>>>>>>
>>>>>> Not exactly. There is a VM exit for each physical interrupt
>>>>>> that fires while the guest is running. Injecting an interrupt
>>>>>> also causes a VM exit, as we force the vcpu to reload its
>>>>>> context.
>>>>>
>>>>> Ah, GICC != GICV - you are referring to host-side pending IRQs.
>>>>> Any hints on how to get access to that register would
>>>>> accelerate the analysis (ARM KVM code is still new to me).
>>>>
>>>> Map the GICC region in HYP using create_hyp_io_mapping (see
>>>> vgic_v2_probe for an example of how we map GICH), and stash the
>>>> read of GICC_HPPIR before leaving HYP mode (and before saving the
>>>> guest timer).
>>
>>> Hacked on it until it started to work. The result delivered
>>> initially are 0x002 or 0x01e. Then, when the guest gets stuck, I
>>> have 0x01b most of the time (a few 0x01e arrive when there is a
>>> real host irq). The virtual timer on speed?
>>
>>> Wait, there is also early printk for ARM, but it was off in my
>>> guest! Turning it on confirms we have some problems here:
>>
>>> Architected timer frequency not available Division by zero in
>>> kernel.
>>
>>> When in emulation mode, I get:
>>
>>> Architected cp15 timer(s) running at 62.50MHz (virt).
>>
>>> Digging deeper.
>>
>> U-Boot didn't initialize CNTFRQ on cores 1..3. Fixing this, the guest
>> passes early boot reliably, now hangs much later (RCU stalls are
>> detected by the guest).
> 
> Right, that explains a lot of things. Can you describe a bit more what
> you're seeing now?

Sorry, should have updated this thread:

http://thread.gmane.org/gmane.comp.emulators.kvm.arm.devel/17

This issue is no longer KVM-related.

What might be KVM-related, or also a QEMU issue, is broken framebuffer
support once KVM is enable in QEMU. Not yet reported, will do soon on
qemu-devel.

Jan




signature.asc
Description: OpenPGP digital signature

vexpress: Horribly slow MMC emulation on ARM host

2015-02-15 Thread Jan Kiszka

Hi,

this basically concludes my problems of getting KVM running on the
Jetson TK1 board with QEMU: all fine now, provided I switch from

  qemu-system-arm -machine vexpress-a15 -sd disk.img ...

to

  qemu-system-arm -machine vexpress-a15 \
-drive file=disk.img,if=none,id=disk \
-device virtio-blk-device,drive=disk ...

This applies to both emulated and KVM accelerated mode. If I run the
same image (and guest kernel) emulated on my x86 box, there is still a
difference between both disk modes, but it's not that excessive. On ARM
the system requires minutes to boot from MMC - if it doesn't run into
timeouts earlier. It's seconds with virtio.

Known problem?

Jan



signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 2015-02-15 19:01, Jan Kiszka wrote:
> On 2015-02-15 16:30, Marc Zyngier wrote:
>> On Sun, Feb 15 2015 at  3:07:50 pm GMT, Jan Kiszka
>>  wrote:
>>> On 2015-02-15 15:59, Marc Zyngier wrote:
>>>> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka
>>>>  wrote:
>>>>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>>>>> On Sun, Feb 15 2015 at 8:53:30 am GMT, Jan Kiszka 
>>>>>>  wrote:
>>>>>>> I'm now throwing trace_printk at my broken KVM. Already
>>>>>>> found out that I get ARM_EXCEPTION_IRQ every few 10 µs.
>>>>>>> Not seeing any irq_* traces, though. Weird.
>>>>>> 
>>>>>> This very much looks like a screaming interrupt. At such
>>>>>> a rate, no wonder your VM make much progress. Can you
>>>>>> find out which interrupt is screaming like this? Looking
>>>>>> at GICC_HPPIR should help, but you'll have to map the CPU
>>>>>> interface in HYP before being able to access it there.
>>>>> 
>>>>> OK... let me figure this out. I had this suspect as well -
>>>>> the host gets a VM exit for each injected guest IRQ?
>>>> 
>>>> Not exactly. There is a VM exit for each physical interrupt
>>>> that fires while the guest is running. Injecting an interrupt
>>>> also causes a VM exit, as we force the vcpu to reload its
>>>> context.
>>> 
>>> Ah, GICC != GICV - you are referring to host-side pending IRQs.
>>> Any hints on how to get access to that register would
>>> accelerate the analysis (ARM KVM code is still new to me).
>> 
>> Map the GICC region in HYP using create_hyp_io_mapping (see 
>> vgic_v2_probe for an example of how we map GICH), and stash the
>> read of GICC_HPPIR before leaving HYP mode (and before saving the
>> guest timer).
> 
> Hacked on it until it started to work. The result delivered
> initially are 0x002 or 0x01e. Then, when the guest gets stuck, I
> have 0x01b most of the time (a few 0x01e arrive when there is a
> real host irq). The virtual timer on speed?
> 
> Wait, there is also early printk for ARM, but it was off in my
> guest! Turning it on confirms we have some problems here:
> 
> Architected timer frequency not available Division by zero in
> kernel.
> 
> When in emulation mode, I get:
> 
> Architected cp15 timer(s) running at 62.50MHz (virt).
> 
> Digging deeper.

U-Boot didn't initialize CNTFRQ on cores 1..3. Fixing this, the guest
passes early boot reliably, now hangs much later (RCU stalls are
detected by the guest).

Jan

-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iEYEARECAAYFAlTg7ZwACgkQitSsb3rl5xSvugCeMgPeNKFbdDBYP6Sl7NeeG+w5
V30AoNzKaFCYtaSVMsXKG2ILbXgWre0Q
=G/0z
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-15 16:30, Marc Zyngier wrote:
> On Sun, Feb 15 2015 at  3:07:50 pm GMT, Jan Kiszka  wrote:
>> On 2015-02-15 15:59, Marc Zyngier wrote:
>>> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka  
>>> wrote:
>>>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>>>> On Sun, Feb 15 2015 at 8:53:30 am GMT, Jan Kiszka
>>>>>  wrote:
>>>>>> I'm now throwing trace_printk at my broken KVM. Already found out that I
>>>>>> get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
>>>>>> though. Weird.
>>>>>
>>>>> This very much looks like a screaming interrupt. At such a rate, no
>>>>> wonder your VM make much progress. Can you find out which interrupt is
>>>>> screaming like this? Looking at GICC_HPPIR should help, but you'll have
>>>>> to map the CPU interface in HYP before being able to access it there.
>>>>
>>>> OK... let me figure this out. I had this suspect as well - the host gets
>>>> a VM exit for each injected guest IRQ?
>>>
>>> Not exactly. There is a VM exit for each physical interrupt that fires
>>> while the guest is running. Injecting an interrupt also causes a VM
>>> exit, as we force the vcpu to reload its context.
>>
>> Ah, GICC != GICV - you are referring to host-side pending IRQs. Any
>> hints on how to get access to that register would accelerate the
>> analysis (ARM KVM code is still new to me).
> 
> Map the GICC region in HYP using create_hyp_io_mapping (see
> vgic_v2_probe for an example of how we map GICH), and stash the read of
> GICC_HPPIR before leaving HYP mode (and before saving the guest timer).

Hacked on it until it started to work. The result delivered initially
are 0x002 or 0x01e. Then, when the guest gets stuck, I have 0x01b most
of the time (a few 0x01e arrive when there is a real host irq). The
virtual timer on speed?

Wait, there is also early printk for ARM, but it was off in my guest!
Turning it on confirms we have some problems here:

  Architected timer frequency not available
  Division by zero in kernel.

When in emulation mode, I get:

  Architected cp15 timer(s) running at 62.50MHz (virt).

Digging deeper.

Jan




signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-15 16:59, Christoffer Dall wrote:
> On Sun, Feb 15, 2015 at 04:35:14PM +0100, Jan Kiszka wrote:
>> On 2015-02-15 16:30, Marc Zyngier wrote:
>>> On Sun, Feb 15 2015 at  3:07:50 pm GMT, Jan Kiszka  
>>> wrote:
>>>> On 2015-02-15 15:59, Marc Zyngier wrote:
>>>>> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka  
>>>>> wrote:
>>>>>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>>>>>> On Sun, Feb 15 2015 at 8:53:30 am GMT, Jan Kiszka
>>>>>>>  wrote:
>>>>>>>> I'm now throwing trace_printk at my broken KVM. Already found out that 
>>>>>>>> I
>>>>>>>> get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
>>>>>>>> though. Weird.
>>>>>>>
>>>>>>> This very much looks like a screaming interrupt. At such a rate, no
>>>>>>> wonder your VM make much progress. Can you find out which interrupt is
>>>>>>> screaming like this? Looking at GICC_HPPIR should help, but you'll have
>>>>>>> to map the CPU interface in HYP before being able to access it there.
>>>>>>
>>>>>> OK... let me figure this out. I had this suspect as well - the host gets
>>>>>> a VM exit for each injected guest IRQ?
>>>>>
>>>>> Not exactly. There is a VM exit for each physical interrupt that fires
>>>>> while the guest is running. Injecting an interrupt also causes a VM
>>>>> exit, as we force the vcpu to reload its context.
>>>>
>>>> Ah, GICC != GICV - you are referring to host-side pending IRQs. Any
>>>> hints on how to get access to that register would accelerate the
>>>> analysis (ARM KVM code is still new to me).
>>>
>>> Map the GICC region in HYP using create_hyp_io_mapping (see
>>> vgic_v2_probe for an example of how we map GICH), and stash the read of
>>> GICC_HPPIR before leaving HYP mode (and before saving the guest timer).
>>
>> OK.
>>
>>>
>>> BTW, when you look at /proc/interrupts on the host, don't you see an
>>> interrupt that's a bit too eager to fire?
>>
>> No - but that makes sense given that we do not enter any interrupt
>> handler according to ftrace, thus there can't be any counter incrementation.
>>
>>>
>>>>>> BTW, I also tried with in-kernel GIC disabled (in the kernel config),
>>>>>> but I guess that's pointless. Linux seems to be stuck on a
>>>>>> non-functional architectural timer then, right?
>>>>>
>>>>> Yes. Useful for bringup, but nothing more.
>>>>
>>>> Maybe we should perform a feature check and issue a warning from QEMU?
>>>
>>> I'd assume this is already in place (but I almost never run QEMU, so I
>>> could be wrong here).
>>
>> Nope, QEMU starts up fine, just lets the guest starve while waiting for
>> jiffies to increase.
>>
> 
> you should be able to turn the in-kernel irqchip off with a QEMU
> command-line option and the that should prevent the kernel from adding
> an arch-timer.  This would only work on the vexpress guest model though,
> since the virt-board doesn't provide an emulated timer as a replacement.

I'm running vexpress, but I only tried legacy -no-kvm-irqchip so far
which was refused. -machine vexpress-a15,kernel_irqchip=off has an
effect: host practically locks up, dmesg - when I'm still able to start
on a different console - gives endless "Unexpected interrupt 19 on vcpu
ecd39670". Well, a different smell, but still very fishy.

Jan




signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-15 16:30, Marc Zyngier wrote:
> On Sun, Feb 15 2015 at  3:07:50 pm GMT, Jan Kiszka  wrote:
>> On 2015-02-15 15:59, Marc Zyngier wrote:
>>> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka  
>>> wrote:
>>>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>>>> On Sun, Feb 15 2015 at 8:53:30 am GMT, Jan Kiszka
>>>>>  wrote:
>>>>>> I'm now throwing trace_printk at my broken KVM. Already found out that I
>>>>>> get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
>>>>>> though. Weird.
>>>>>
>>>>> This very much looks like a screaming interrupt. At such a rate, no
>>>>> wonder your VM make much progress. Can you find out which interrupt is
>>>>> screaming like this? Looking at GICC_HPPIR should help, but you'll have
>>>>> to map the CPU interface in HYP before being able to access it there.
>>>>
>>>> OK... let me figure this out. I had this suspect as well - the host gets
>>>> a VM exit for each injected guest IRQ?
>>>
>>> Not exactly. There is a VM exit for each physical interrupt that fires
>>> while the guest is running. Injecting an interrupt also causes a VM
>>> exit, as we force the vcpu to reload its context.
>>
>> Ah, GICC != GICV - you are referring to host-side pending IRQs. Any
>> hints on how to get access to that register would accelerate the
>> analysis (ARM KVM code is still new to me).
> 
> Map the GICC region in HYP using create_hyp_io_mapping (see
> vgic_v2_probe for an example of how we map GICH), and stash the read of
> GICC_HPPIR before leaving HYP mode (and before saving the guest timer).

OK.

> 
> BTW, when you look at /proc/interrupts on the host, don't you see an
> interrupt that's a bit too eager to fire?

No - but that makes sense given that we do not enter any interrupt
handler according to ftrace, thus there can't be any counter incrementation.

> 
>>>> BTW, I also tried with in-kernel GIC disabled (in the kernel config),
>>>> but I guess that's pointless. Linux seems to be stuck on a
>>>> non-functional architectural timer then, right?
>>>
>>> Yes. Useful for bringup, but nothing more.
>>
>> Maybe we should perform a feature check and issue a warning from QEMU?
> 
> I'd assume this is already in place (but I almost never run QEMU, so I
> could be wrong here).

Nope, QEMU starts up fine, just lets the guest starve while waiting for
jiffies to increase.

> 
>>> I still wonder if the 4+1 design on the K1 is not playing tricks behind
>>> our back. Having talked to Ian Campbell earlier this week, he also can't
>>> manage to run guests in Xen on this platform, so there's something
>>> rather fishy here.
>>
>> Interesting. The announcements of his PSCI patches [1] sounded more
>> promising. Maybe it was only referring to getting the hypervisor itself
>> running...
> 
> This is my understanding so far.
> 
>> To my current (still limited understanding) of that platform would say
>> that this little core is parked after power-up of the main APs. And as
>> we do not power them down, there is no reason to perform a cluster
>> switch or anything similarly nasty, no?
> 
> I can't see why this would happen, but I've learned not to assume
> anything when it come to braindead creativity on the HW side...

True.

Jan




signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-15 15:59, Marc Zyngier wrote:
> On Sun, Feb 15 2015 at  2:40:40 pm GMT, Jan Kiszka  wrote:
>> On 2015-02-15 14:37, Marc Zyngier wrote:
>>> On Sun, Feb 15 2015 at  8:53:30 am GMT, Jan Kiszka  
>>> wrote:
>>>> I'm now throwing trace_printk at my broken KVM. Already found out that I
>>>> get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
>>>> though. Weird.
>>>
>>> This very much looks like a screaming interrupt. At such a rate, no
>>> wonder your VM make much progress. Can you find out which interrupt is
>>> screaming like this? Looking at GICC_HPPIR should help, but you'll have
>>> to map the CPU interface in HYP before being able to access it there.
>>
>> OK... let me figure this out. I had this suspect as well - the host gets
>> a VM exit for each injected guest IRQ?
> 
> Not exactly. There is a VM exit for each physical interrupt that fires
> while the guest is running. Injecting an interrupt also causes a VM
> exit, as we force the vcpu to reload its context.

Ah, GICC != GICV - you are referring to host-side pending IRQs. Any
hints on how to get access to that register would accelerate the
analysis (ARM KVM code is still new to me).

> 
>> BTW, I also tried with in-kernel GIC disabled (in the kernel config),
>> but I guess that's pointless. Linux seems to be stuck on a
>> non-functional architectural timer then, right?
> 
> Yes. Useful for bringup, but nothing more.

Maybe we should perform a feature check and issue a warning from QEMU?

> 
>>>
>>> Do you have an form of power-management on this system?
>>
>> Just killed every config that has PM for FREQ in its name, but that
>> makes no difference.
> 
> I still wonder if the 4+1 design on the K1 is not playing tricks behind
> our back. Having talked to Ian Campbell earlier this week, he also can't
> manage to run guests in Xen on this platform, so there's something
> rather fishy here.

Interesting. The announcements of his PSCI patches [1] sounded more
promising. Maybe it was only referring to getting the hypervisor itself
running...

To my current (still limited understanding) of that platform would say
that this little core is parked after power-up of the main APs. And as
we do not power them down, there is no reason to perform a cluster
switch or anything similarly nasty, no?

Jan

PS: For those with such a board in reach, newer U-Boot patches are
available at [2] now.

[1] http://permalink.gmane.org/gmane.comp.boot-loaders.u-boot/208034
[2] https://github.com/siemens/u-boot/commits/jetson-tk1-v2



signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-15 14:37, Marc Zyngier wrote:
> On Sun, Feb 15 2015 at  8:53:30 am GMT, Jan Kiszka  wrote:
>> I'm now throwing trace_printk at my broken KVM. Already found out that I
>> get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
>> though. Weird.
> 
> This very much looks like a screaming interrupt. At such a rate, no
> wonder your VM make much progress. Can you find out which interrupt is
> screaming like this? Looking at GICC_HPPIR should help, but you'll have
> to map the CPU interface in HYP before being able to access it there.

OK... let me figure this out. I had this suspect as well - the host gets
a VM exit for each injected guest IRQ?

BTW, I also tried with in-kernel GIC disabled (in the kernel config),
but I guess that's pointless. Linux seems to be stuck on a
non-functional architectural timer then, right?

> 
> Do you have an form of power-management on this system?

Just killed every config that has PM for FREQ in its name, but that
makes no difference.

Jan

signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-15 Thread Jan Kiszka

On 2015-02-13 07:53, Alex Bennée wrote:
> 
> Alex Bennée  writes:
> 
>> Christoffer Dall  writes:
> 
>>> On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote:
> 
>>>> BTW, KVM tracing support on ARM seems like it requires some care. E.g.:
>>>> kvm_exit does not report an exit reason. The in-kernel vgic also seems
>>>> to lack instrumentation. Unfortunate. Tracing is usually the first stop
>>>> when KVM is stuck on a guest.
>>>
>>> I know, the exit reason is on my todo list, and Alex B is sitting on
>>> trace patches for the gic.  Coming soon to a git repo near your.
>>
>> For the impatient the raw patches are in:
>>
>> git.linaro.org/people/alex.bennee/linux.git
>> migration/v3.19-rc7-improve-tracing
> 
> OK try tracing/kvm-exit-entry for something cleaner.

Doesn't build for ARM (vcpu_sys_reg is ARM64-only so far).

But the values traced seem useful. Wei Huang's patch in kvm.git queue
traces the exception class, but unfortunately nothing else. When would
we need that class? Do we need it at all?

In any case, please add symbolic printing of the magic values whenever
possible, just like on x86.

I'm now throwing trace_printk at my broken KVM. Already found out that I
get ARM_EXCEPTION_IRQ every few 10 µs. Not seeing any irq_* traces,
though. Weird.

Thanks,
Jan

signature.asc
Description: OpenPGP digital signature

Re: arm: warning at virt/kvm/arm/vgic.c:1468

2015-02-12 Thread Jan Kiszka

Hi Christoffer,

On 2015-02-13 05:46, Christoffer Dall wrote:
> Hi Jan,
> 
> On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote:
>> Hi,
>>
>> after fixing the VM_BUG_ON, my QEMU guest on the Jetson TK1 generally
>> refuses to boot. Once in a while it does, but quickly gets stuck again.
>> In one case I found this in the kernel log (never happened again so
>> far):
>>
>> [  762.022874] WARNING: CPU: 1 PID: 972 at 
>> ../arch/arm/kvm/../../../virt/kvm/arm/vgic.c:1468 
>> kvm_vgic_sync_hwstate+0x314/0x344()
>> [  762.022884] Modules linked in:
>> [  762.022902] CPU: 1 PID: 972 Comm: qemu-system-arm Not tainted 
>> 3.19.0-rc7-00221-gfd7a168-dirty #13
>> [  762.022911] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
>> [  762.022937] [] (unwind_backtrace) from [] 
>> (show_stack+0x10/0x14)
>> [  762.022958] [] (show_stack) from [] 
>> (dump_stack+0x98/0xd8)
>> [  762.022976] [] (dump_stack) from [] 
>> (warn_slowpath_common+0x80/0xb0)
>> [  762.022991] [] (warn_slowpath_common) from [] 
>> (warn_slowpath_null+0x1c/0x24)
>> [  762.023007] [] (warn_slowpath_null) from [] 
>> (kvm_vgic_sync_hwstate+0x314/0x344)
>> [  762.023024] [] (kvm_vgic_sync_hwstate) from [] 
>> (kvm_arch_vcpu_ioctl_run+0x210/0x400)
>> [  762.023041] [] (kvm_arch_vcpu_ioctl_run) from [] 
>> (kvm_vcpu_ioctl+0x2e4/0x6ec)
>> [  762.023059] [] (kvm_vcpu_ioctl) from [] 
>> (do_vfs_ioctl+0x40c/0x600)
>> [  762.023076] [] (do_vfs_ioctl) from [] 
>> (SyS_ioctl+0x34/0x5c)
>> [  762.023091] [] (SyS_ioctl) from [] 
>> (ret_fast_syscall+0x0/0x34)
> 
> so this means your guest caused a maintenance interrupt and the bit is
> set in the GICH_EISR for the LR in question but the link register state
> is not 0, which is in direct violation of the GIC spec.  H.
> 
> You're not doing any IRQ forwarding stuff or device passthrough here are
> you?

No, just boring emulation. The command line is

qemu-system-ar -machine vexpress-a15 -kernel zImage -serial mon:stdio
-append 'console=ttyAMA0 root=/dev/mmcblk0 rw' -snapshot -sd
OpenSuse13-1_arm.img -dtb vexpress-v2p-ca15-tc1.dtb -s -enable-kvm

> 
>>
>>
>> BTW, KVM tracing support on ARM seems like it requires some care. E.g.:
>> kvm_exit does not report an exit reason. The in-kernel vgic also seems
>> to lack instrumentation. Unfortunate. Tracing is usually the first stop
>> when KVM is stuck on a guest.
> 
> I know, the exit reason is on my todo list, and Alex B is sitting on
> trace patches for the gic.  Coming soon to a git repo near your.

Cool, looking forward.

Next thing I noticed is that guest debugging via qemu causes troubles in
kvm mode. For some reason, qemu is unable to write soft-breakpoints,
thus not even a single-step works. Also known?

Jan




signature.asc
Description: OpenPGP digital signature

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4875 matches

Mail list logo