dog harness, dog leash, dog clothes and dog carrier

2014-07-02 Thread 3U Pet Supply
Hi, this is Jeff Wu from China.
We are a factory which produces pet clothes, pet harness, pet leash.
Please let me know if you are interested to buy from us, thanks!

I will send the price lists and website to you then.

Best regards,
Jeff Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Wanpeng Li
This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 

If we didn't inject a still-pending event to L1 since nested_run_pending,
KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
event to L1. However, current log blindly request a KVM_REQ_EVENT even if 
there is no still-pending event to L1 which blocked by nested_run_pending. 
There is a race which lead to an interrupt will be injected to L2 which 
belong to L1 if L0 send an interrupt to L1 during this window. 

   VCPU0   another thread 

L1 intr not blocked on L2 first entry
vmx_vcpu_run req event 
kvm check request req event 
check_nested_events don't have any intr 
not nested exit 
intr occur (8254, lapic timer etc)
inject_pending_event now have intr 
inject interrupt 

This patch fix this race by introduced a l1_events_blocked field in nested_vmx 
which indicates there is still-pending event which blocked by 
nested_run_pending, 
and smart request a KVM_REQ_EVENT if there is a still-pending event which 
blocked 
by nested_run_pending.

Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
---
 arch/x86/kvm/vmx.c | 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f4e5aed..fe69c49 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -372,6 +372,7 @@ struct nested_vmx {
u64 vmcs01_tsc_offset;
/* L2 must run next, and mustn't decide to exit to L1. */
bool nested_run_pending;
+   bool l1_events_blocked;
/*
 * Guest pages referred to in vmcs02 with host-physical pointers, so
 * we must keep them pinned while L2 runs.
@@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 * we did not inject a still-pending event to L1 now because of
 * nested_run_pending, we need to re-enable this bit.
 */
-   if (vmx-nested.nested_run_pending)
+   if (to_vmx(vcpu)-nested.l1_events_blocked) {
+   to_vmx(vcpu)-nested.l1_events_blocked = false;
kvm_make_request(KVM_REQ_EVENT, vcpu);
+   }
 
vmx-nested.nested_run_pending = 0;
 
@@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu 
*vcpu, bool external_intr)
 
if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
vmx-nested.preemption_timer_expired) {
-   if (vmx-nested.nested_run_pending)
+   if (vmx-nested.nested_run_pending) {
+   vmx-nested.l1_events_blocked = true;
return -EBUSY;
+   }
nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
return 0;
}
 
if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
-   if (vmx-nested.nested_run_pending ||
-   vcpu-arch.interrupt.pending)
+   if (vmx-nested.nested_run_pending) {
+   vmx-nested.l1_events_blocked = true;
+   return -EBUSY;
+   }
+   if (vcpu-arch.interrupt.pending)
return -EBUSY;
nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
  NMI_VECTOR | INTR_TYPE_NMI_INTR |
@@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu 
*vcpu, bool external_intr)
 
if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
nested_exit_on_intr(vcpu)) {
-   if (vmx-nested.nested_run_pending)
+   if (vmx-nested.nested_run_pending) {
+   vmx-nested.l1_events_blocked = true;
return -EBUSY;
+   }
nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
}
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Hu, Robert
 -Original Message-
 From: Wanpeng Li [mailto:wanpeng...@linux.intel.com]
 Sent: Wednesday, July 2, 2014 2:54 PM
 To: Paolo Bonzini; Jan Kiszka; Gleb Natapov
 Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li
 Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since 
 race
 
 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381
 
 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if
 there is no still-pending event to L1 which blocked by nested_run_pending.
 There is a race which lead to an interrupt will be injected to L2 which
 belong to L1 if L0 send an interrupt to L1 during this window.
 
VCPU0   another thread
 
 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event
 kvm check request req event
 check_nested_events don't have any intr
 not nested exit
 intr occur (8254, lapic timer
 etc)
 inject_pending_event now have intr
 inject interrupt
 
 This patch fix this race by introduced a l1_events_blocked field in nested_vmx
 which indicates there is still-pending event which blocked by
 nested_run_pending,
 and smart request a KVM_REQ_EVENT if there is a still-pending event which
 blocked
 by nested_run_pending.
 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
Tested-by: Robert Hurobert...@intel.com
 ---
  arch/x86/kvm/vmx.c | 20 +++-
  1 file changed, 15 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f4e5aed..fe69c49 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -372,6 +372,7 @@ struct nested_vmx {
   u64 vmcs01_tsc_offset;
   /* L2 must run next, and mustn't decide to exit to L1. */
   bool nested_run_pending;
 + bool l1_events_blocked;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
* we must keep them pinned while L2 runs.
 @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu
 *vcpu)
* we did not inject a still-pending event to L1 now because of
* nested_run_pending, we need to re-enable this bit.
*/
 - if (vmx-nested.nested_run_pending)
 + if (to_vmx(vcpu)-nested.l1_events_blocked) {
 + to_vmx(vcpu)-nested.l1_events_blocked = false;
   kvm_make_request(KVM_REQ_EVENT, vcpu);
 + }
 
   vmx-nested.nested_run_pending = 0;
 
 @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct
 kvm_vcpu *vcpu, bool external_intr)
 
   if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
   vmx-nested.preemption_timer_expired) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
   return 0;
   }
 
   if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 - if (vmx-nested.nested_run_pending ||
 - vcpu-arch.interrupt.pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
 + return -EBUSY;
 + }
 + if (vcpu-arch.interrupt.pending)
   return -EBUSY;
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 NMI_VECTOR | INTR_TYPE_NMI_INTR |
 @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu
 *vcpu, bool external_intr)
 
   if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
   nested_exit_on_intr(vcpu)) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0,
 0);
   }
 
 --
 1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] kvm, mem-hotplug: Update apic access page when it is migrated.

2014-07-02 Thread Tang Chen
apic access page is pinned in memory, and as a result it cannot be
migrated/hot-removed.

Actually it doesn't need to be pinned in memory.

This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT. This requet
will be made when kvm_mmu_notifier_invalidate_page() is called when the page
is unmapped from the qemu user space to reset APIC_ACCESS_ADDR pointer in
each online vcpu to 0. And will also be made when ept violation happens to
reset APIC_ACCESS_ADDR to the new page phys_addr (host phys_addr).
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.c  | 15 +++
 arch/x86/kvm/vmx.c  |  9 -
 arch/x86/kvm/x86.c  | 20 
 include/linux/kvm_host.h|  1 +
 virt/kvm/kvm_main.c | 15 +++
 6 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8771c0f..f104b87 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -575,6 +575,7 @@ struct kvm_arch {
 
unsigned int tss_addr;
struct page *apic_access_page;
+   bool apic_access_page_migrated;
 
gpa_t wall_clock;
 
@@ -739,6 +740,7 @@ struct kvm_x86_ops {
void (*hwapic_isr_update)(struct kvm *kvm, int isr);
void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
+   void (*set_apic_access_page_addr)(struct kvm *kvm, hpa_t hpa);
void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector);
void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c0d72f6..a655444 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3436,6 +3436,21 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu);
}
 
+   if (gpa == VMX_APIC_ACCESS_PAGE_ADDR 
+   vcpu-kvm-arch.apic_access_page_migrated) {
+   int i;
+
+   vcpu-kvm-arch.apic_access_page_migrated = false;
+
+   /*
+* We need update APIC_ACCESS_ADDR pointer in each VMCS of
+* all the online vcpus.
+*/
+   for (i = 0; i  atomic_read(vcpu-kvm-online_vcpus); i++)
+   kvm_make_request(KVM_REQ_MIGRATE_APIC,
+vcpu-kvm-vcpus[i]);
+   }
+
spin_unlock(vcpu-kvm-mmu_lock);
 
return r;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c336cb3..abc152f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3988,7 +3988,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
if (r)
goto out;
 
-   page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR  PAGE_SHIFT);
+   page = gfn_to_page_no_pin(kvm, VMX_APIC_ACCESS_PAGE_ADDR  PAGE_SHIFT);
if (is_error_page(page)) {
r = -EFAULT;
goto out;
@@ -7075,6 +7075,12 @@ static void vmx_set_virtual_x2apic_mode(struct kvm_vcpu 
*vcpu, bool set)
vmx_set_msr_bitmap(vcpu);
 }
 
+static void vmx_set_apic_access_page_addr(struct kvm *kvm, hpa_t hpa)
+{
+   if (vm_need_virtualize_apic_accesses(kvm))
+   vmcs_write64(APIC_ACCESS_ADDR, hpa);
+}
+
 static void vmx_hwapic_isr_update(struct kvm *kvm, int isr)
 {
u16 status;
@@ -8846,6 +8852,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.enable_irq_window = enable_irq_window,
.update_cr8_intercept = update_cr8_intercept,
.set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,
+   .set_apic_access_page_addr = vmx_set_apic_access_page_addr,
.vm_has_apicv = vmx_vm_has_apicv,
.load_eoi_exitmap = vmx_load_eoi_exitmap,
.hwapic_irr_update = vmx_hwapic_irr_update,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a26524f..14e7174 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5943,6 +5943,24 @@ static void vcpu_migrated_page_update_ept(struct 
kvm_vcpu *vcpu)
}
 }
 
+static void vcpu_migrated_page_update_apic(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu-kvm;
+
+   if (kvm-arch.apic_access_page_migrated) {
+   if (kvm-arch.apic_access_page)
+   kvm-arch.apic_access_page = pfn_to_page(0);
+   kvm_x86_ops-set_apic_access_page_addr(kvm, 0x0ull);
+   } else {
+   struct page *page;
+   page = gfn_to_page_no_pin(kvm,
+   VMX_APIC_ACCESS_PAGE_ADDR  PAGE_SHIFT);
+   kvm-arch.apic_access_page = page;
+   kvm_x86_ops-set_apic_access_page_addr(kvm,
+  page_to_phys(page));
+   }
+}
+
 /*
  * Returns 1 to let __vcpu_run() 

[PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR

2014-07-02 Thread Tang Chen
Define guest phys_addr of apic access page.
---
 arch/x86/include/asm/vmx.h | 2 +-
 arch/x86/kvm/svm.c | 3 ++-
 arch/x86/kvm/vmx.c | 7 ---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 7004d21..c4672d1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -422,7 +422,7 @@ enum vmcs_field {
 #define VMX_EPT_DIRTY_BIT  (1ull  9)
 
 #define VMX_EPT_IDENTITY_PAGETABLE_ADDR0xfffbc000ul
-
+#define VMX_APIC_ACCESS_PAGE_ADDR  0xfee0ull
 
 #define ASM_VMX_VMCLEAR_RAX   .byte 0x66, 0x0f, 0xc7, 0x30
 #define ASM_VMX_VMLAUNCH  .byte 0x0f, 0x01, 0xc2
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ec8366c..22aa2ae 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1257,7 +1257,8 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-asid_generation = 0;
init_vmcb(svm);
 
-   svm-vcpu.arch.apic_base = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
+   svm-vcpu.arch.apic_base = VMX_APIC_ACCESS_PAGE_ADDR |
+  MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(svm-vcpu))
svm-vcpu.arch.apic_base |= MSR_IA32_APICBASE_BSP;
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332e..366b5b3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3982,13 +3982,13 @@ static int alloc_apic_access_page(struct kvm *kvm)
goto out;
kvm_userspace_mem.slot = APIC_ACCESS_PAGE_PRIVATE_MEMSLOT;
kvm_userspace_mem.flags = 0;
-   kvm_userspace_mem.guest_phys_addr = 0xfee0ULL;
+   kvm_userspace_mem.guest_phys_addr = VMX_APIC_ACCESS_PAGE_ADDR;
kvm_userspace_mem.memory_size = PAGE_SIZE;
r = __kvm_set_memory_region(kvm, kvm_userspace_mem);
if (r)
goto out;
 
-   page = gfn_to_page(kvm, 0xfee00);
+   page = gfn_to_page(kvm, VMX_APIC_ACCESS_PAGE_ADDR  PAGE_SHIFT);
if (is_error_page(page)) {
r = -EFAULT;
goto out;
@@ -4460,7 +4460,8 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 
vmx-vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
kvm_set_cr8(vmx-vcpu, 0);
-   apic_base_msr.data = 0xfee0 | MSR_IA32_APICBASE_ENABLE;
+   apic_base_msr.data = VMX_APIC_ACCESS_PAGE_ADDR |
+MSR_IA32_APICBASE_ENABLE;
if (kvm_vcpu_is_bsp(vmx-vcpu))
apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
apic_base_msr.host_initiated = true;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.

2014-07-02 Thread Tang Chen
ept identity pagetable is pinned in memory, and as a result it cannot be
migrated/hot-removed.

But actually it doesn't need to be pinned in memory.

This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept
indetity pagetable related variable. This request will be made when
kvm_mmu_notifier_invalidate_page() is called when the page is unmapped
from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL.
And will also be made when ept violation happens to reset
kvm-arch.ept_identity_pagetable to the new page.
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu.c  | 11 +++
 arch/x86/kvm/vmx.c  |  3 ++-
 arch/x86/kvm/x86.c  | 16 
 include/linux/kvm_host.h|  1 +
 virt/kvm/kvm_main.c |  6 ++
 6 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4931415..8771c0f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -581,6 +581,7 @@ struct kvm_arch {
struct page *ept_identity_pagetable;
bool ept_identity_pagetable_done;
gpa_t ept_identity_map_addr;
+   bool ept_identity_pagetable_migrated;
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..c0d72f6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3425,6 +3425,17 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
transparent_hugepage_adjust(vcpu, gfn, pfn, level);
r = __direct_map(vcpu, gpa, write, map_writable,
 level, gfn, pfn, prefault);
+
+   /*
+* Update ept identity pagetable page and apic access page if
+* they are migrated.
+*/
+   if (gpa == vcpu-kvm-arch.ept_identity_map_addr 
+   vcpu-kvm-arch.ept_identity_pagetable_migrated) {
+   vcpu-kvm-arch.ept_identity_pagetable_migrated = false;
+   kvm_make_request(KVM_REQ_MIGRATE_EPT, vcpu);
+   }
+
spin_unlock(vcpu-kvm-mmu_lock);
 
return r;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 366b5b3..c336cb3 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4018,7 +4018,8 @@ static int alloc_identity_pagetable(struct kvm *kvm)
if (r)
goto out;
 
-   page = gfn_to_page(kvm, kvm-arch.ept_identity_map_addr  PAGE_SHIFT);
+   page = gfn_to_page_no_pin(kvm,
+   kvm-arch.ept_identity_map_addr  PAGE_SHIFT);
if (is_error_page(page)) {
r = -EFAULT;
goto out;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f32a025..a26524f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5929,6 +5929,20 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
kvm_apic_update_tmr(vcpu, tmr);
 }
 
+static void vcpu_migrated_page_update_ept(struct kvm_vcpu *vcpu)
+{
+   struct kvm *kvm = vcpu-kvm;
+
+   if (kvm-arch.ept_identity_pagetable_migrated)
+   kvm-arch.ept_identity_pagetable = NULL;
+   else {
+   struct page *page;
+   page = gfn_to_page_no_pin(kvm,
+   kvm-arch.ept_identity_map_addr  PAGE_SHIFT);
+   kvm-arch.ept_identity_pagetable = page;
+   }
+}
+
 /*
  * Returns 1 to let __vcpu_run() continue the guest execution loop without
  * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -5989,6 +6003,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
kvm_deliver_pmi(vcpu);
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
vcpu_scan_ioapic(vcpu);
+   if (kvm_check_request(KVM_REQ_MIGRATE_EPT, vcpu))
+   vcpu_migrated_page_update_ept(vcpu);
}
 
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7c58d9d..4b7e51a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -136,6 +136,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_GLOBAL_CLOCK_UPDATE 22
 #define KVM_REQ_ENABLE_IBS23
 #define KVM_REQ_DISABLE_IBS   24
+#define KVM_REQ_MIGRATE_EPT   25
 
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6091849..d271e89 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -294,6 +294,12 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
if (need_tlb_flush)
kvm_flush_remote_tlbs(kvm);
 
+   if (address ==
+   gfn_to_hva(kvm, kvm-arch.ept_identity_map_addr  PAGE_SHIFT)) {
+   kvm-arch.ept_identity_pagetable_migrated = true;
+

[PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-07-02 Thread Tang Chen
Hi Gleb, Marcelo,

Please help to review this patch-set.

NOTE: This patch-set doesn't work properly.


ept identity pagetable and apic access page in kvm are pinned in memory.
As a result, they cannot be migrated/hot-removed.

But actually they don't need to be pinned in memory.

This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and 
KVM_REQ_MIGRATE_APIC.
These two requests are made when the two pages are migrated by the mmu_notifier
to reset the related variable to unusable value. And will also be made when
ept violation happens to reset new pages.


[Known problem]
After this patch-set applied, the two pages can be migrated/hot-removed.
But after migrating apic access page, the guest died.

The host physical address of apic access page is stored in VMCS. I reset
it to 0 to stop guest from accessing it when it is unmapped by
kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical
address in tdp_page_fault(). But it seems that guest will access apic page
directly by the host physical address.


Tang Chen (4):
  kvm: Add gfn_to_page_no_pin()
  kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR
  kvm, memory-hotplug: Update ept identity pagetable when it is
migrated.
  kvm, mem-hotplug: Update apic access page when it is migrated.

 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/include/asm/vmx.h  |  2 +-
 arch/x86/kvm/mmu.c  | 26 ++
 arch/x86/kvm/svm.c  |  3 ++-
 arch/x86/kvm/vmx.c  | 17 +
 arch/x86/kvm/x86.c  | 36 
 include/linux/kvm_host.h|  3 +++
 virt/kvm/kvm_main.c | 38 +-
 8 files changed, 121 insertions(+), 7 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] kvm: Add gfn_to_page_no_pin()

2014-07-02 Thread Tang Chen
Used by the followed patches.
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 17 -
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ec4e3bd..7c58d9d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -541,6 +541,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, 
struct page **pages,
int nr_pages);
 
 struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
+struct page *gfn_to_page_no_pin(struct kvm *kvm, gfn_t gfn);
 unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
 unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable);
 unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4b6c01b..6091849 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1371,9 +1371,24 @@ struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn)
 
return kvm_pfn_to_page(pfn);
 }
-
 EXPORT_SYMBOL_GPL(gfn_to_page);
 
+struct page *gfn_to_page_no_pin(struct kvm *kvm, gfn_t gfn)
+{
+   struct page *page = gfn_to_page(kvm, gfn);
+
+   /*
+* gfn_to_page() will finally call hva_to_pfn() to get the pfn, and pin
+* the page in memory by calling GUP functions. This function unpins
+* the page.
+*/
+   if (!is_error_page(page))
+   put_page(page);
+
+   return page;
+}
+EXPORT_SYMBOL_GPL(gfn_to_page_no_pin);
+
 void kvm_release_page_clean(struct page *page)
 {
WARN_ON(is_error_page(page));
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-07-02 Thread Tang Chen
Hi Gleb, Marcelo,

Please help to review this patch-set.

NOTE: This patch-set doesn't work properly.


ept identity pagetable and apic access page in kvm are pinned in memory.
As a result, they cannot be migrated/hot-removed.

But actually they don't need to be pinned in memory.

This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and 
KVM_REQ_MIGRATE_APIC.
These two requests are made when the two pages are migrated by the mmu_notifier
to reset the related variable to unusable value. And will also be made when
ept violation happens to reset new pages.


[Known problem]
After this patch-set applied, the two pages can be migrated/hot-removed.
But after migrating apic access page, the guest died.

The host physical address of apic access page is stored in VMCS. I reset
it to 0 to stop guest from accessing it when it is unmapped by
kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical
address in tdp_page_fault(). But it seems that guest will access apic page
directly by the host physical address.


Tang Chen (4):
  kvm: Add gfn_to_page_no_pin()
  kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR
  kvm, memory-hotplug: Update ept identity pagetable when it is
migrated.
  kvm, mem-hotplug: Update apic access page when it is migrated.

 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/include/asm/vmx.h  |  2 +-
 arch/x86/kvm/mmu.c  | 26 ++
 arch/x86/kvm/svm.c  |  3 ++-
 arch/x86/kvm/vmx.c  | 17 +
 arch/x86/kvm/x86.c  | 36 
 include/linux/kvm_host.h|  3 +++
 virt/kvm/kvm_main.c | 38 +-
 8 files changed, 121 insertions(+), 7 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Jan Kiszka
On 2014-07-02 08:54, Wanpeng Li wrote:
 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
 
 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 
 there is no still-pending event to L1 which blocked by nested_run_pending. 
 There is a race which lead to an interrupt will be injected to L2 which 
 belong to L1 if L0 send an interrupt to L1 during this window. 
 
VCPU0   another thread 
 
 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event 
 kvm check request req event 
 check_nested_events don't have any intr 
 not nested exit 
 intr occur (8254, lapic timer etc)
 inject_pending_event now have intr 
 inject interrupt 
 
 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx 
 which indicates there is still-pending event which blocked by 
 nested_run_pending, 
 and smart request a KVM_REQ_EVENT if there is a still-pending event which 
 blocked 
 by nested_run_pending.

There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why
aren't those able to trigger this scenario?

In any case, unconditionally setting KVM_REQ_EVENT seems strange and
should be changed.

Jan

 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
  arch/x86/kvm/vmx.c | 20 +++-
  1 file changed, 15 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f4e5aed..fe69c49 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -372,6 +372,7 @@ struct nested_vmx {
   u64 vmcs01_tsc_offset;
   /* L2 must run next, and mustn't decide to exit to L1. */
   bool nested_run_pending;
 + bool l1_events_blocked;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
* we must keep them pinned while L2 runs.
 @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
* we did not inject a still-pending event to L1 now because of
* nested_run_pending, we need to re-enable this bit.
*/
 - if (vmx-nested.nested_run_pending)
 + if (to_vmx(vcpu)-nested.l1_events_blocked) {
 + to_vmx(vcpu)-nested.l1_events_blocked = false;
   kvm_make_request(KVM_REQ_EVENT, vcpu);
 + }
  
   vmx-nested.nested_run_pending = 0;
  
 @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
   if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
   vmx-nested.preemption_timer_expired) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
   return 0;
   }
  
   if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 - if (vmx-nested.nested_run_pending ||
 - vcpu-arch.interrupt.pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
 + return -EBUSY;
 + }
 + if (vcpu-arch.interrupt.pending)
   return -EBUSY;
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 NMI_VECTOR | INTR_TYPE_NMI_INTR |
 @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
   if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
   nested_exit_on_intr(vcpu)) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
   }
  
 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Jan Kiszka
On 2014-07-02 09:20, Hu, Robert wrote:
 -Original Message-
 From: Wanpeng Li [mailto:wanpeng...@linux.intel.com]
 Sent: Wednesday, July 2, 2014 2:54 PM
 To: Paolo Bonzini; Jan Kiszka; Gleb Natapov
 Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng Li
 Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since 
 race

 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381

 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if
 there is no still-pending event to L1 which blocked by nested_run_pending.
 There is a race which lead to an interrupt will be injected to L2 which
 belong to L1 if L0 send an interrupt to L1 during this window.

VCPU0   another thread

 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event
 kvm check request req event
 check_nested_events don't have any intr
 not nested exit
 intr occur (8254, lapic timer
 etc)
 inject_pending_event now have intr
 inject interrupt

 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx
 which indicates there is still-pending event which blocked by
 nested_run_pending,
 and smart request a KVM_REQ_EVENT if there is a still-pending event which
 blocked
 by nested_run_pending.

 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 Tested-by: Robert Hurobert...@intel.com

Do you happen to have a kvm-unit-test for this race? Or how did you
test? Just curious, and if there is a test, it would be good to
integrate it.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Hu, Robert

 -Original Message-
 From: Jan Kiszka [mailto:jan.kis...@siemens.com]
 Sent: Wednesday, July 2, 2014 5:03 PM
 To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov
 Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since
 race
 
 On 2014-07-02 09:20, Hu, Robert wrote:
  -Original Message-
  From: Wanpeng Li [mailto:wanpeng...@linux.intel.com]
  Sent: Wednesday, July 2, 2014 2:54 PM
  To: Paolo Bonzini; Jan Kiszka; Gleb Natapov
  Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng
 Li
  Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since
 race
 
  This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381
 
  If we didn't inject a still-pending event to L1 since nested_run_pending,
  KVM_REQ_EVENT should be requested after the vmexit in order to inject the
  event to L1. However, current log blindly request a KVM_REQ_EVENT even if
  there is no still-pending event to L1 which blocked by nested_run_pending.
  There is a race which lead to an interrupt will be injected to L2 which
  belong to L1 if L0 send an interrupt to L1 during this window.
 
 VCPU0   another thread
 
  L1 intr not blocked on L2 first entry
  vmx_vcpu_run req event
  kvm check request req event
  check_nested_events don't have any intr
  not nested exit
  intr occur (8254, lapic timer
  etc)
  inject_pending_event now have intr
  inject interrupt
 
  This patch fix this race by introduced a l1_events_blocked field in 
  nested_vmx
  which indicates there is still-pending event which blocked by
  nested_run_pending,
  and smart request a KVM_REQ_EVENT if there is a still-pending event which
  blocked
  by nested_run_pending.
 
  Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
  Tested-by: Robert Hurobert...@intel.com
 
 Do you happen to have a kvm-unit-test for this race? Or how did you
 test? Just curious, and if there is a test, it would be good to
 integrate it.
I just apply the patch and test the reported scenario.
 
 Jan
 
 --
 Siemens AG, Corporate Technology, CT RTC ITP SES-DE
 Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Jan Kiszka
On 2014-07-02 11:13, Hu, Robert wrote:
 
 -Original Message-
 From: Jan Kiszka [mailto:jan.kis...@siemens.com]
 Sent: Wednesday, July 2, 2014 5:03 PM
 To: Hu, Robert; Wanpeng Li; Paolo Bonzini; Gleb Natapov
 Cc: kvm@vger.kernel.org; linux-ker...@vger.kernel.org
 Subject: Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 
 since
 race

 On 2014-07-02 09:20, Hu, Robert wrote:
 -Original Message-
 From: Wanpeng Li [mailto:wanpeng...@linux.intel.com]
 Sent: Wednesday, July 2, 2014 2:54 PM
 To: Paolo Bonzini; Jan Kiszka; Gleb Natapov
 Cc: Hu, Robert; kvm@vger.kernel.org; linux-ker...@vger.kernel.org; Wanpeng
 Li
 Subject: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since
 race

 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381

 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if
 there is no still-pending event to L1 which blocked by nested_run_pending.
 There is a race which lead to an interrupt will be injected to L2 which
 belong to L1 if L0 send an interrupt to L1 during this window.

VCPU0   another thread

 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event
 kvm check request req event
 check_nested_events don't have any intr
 not nested exit
 intr occur (8254, lapic timer
 etc)
 inject_pending_event now have intr
 inject interrupt

 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx
 which indicates there is still-pending event which blocked by
 nested_run_pending,
 and smart request a KVM_REQ_EVENT if there is a still-pending event which
 blocked
 by nested_run_pending.

 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 Tested-by: Robert Hurobert...@intel.com

 Do you happen to have a kvm-unit-test for this race? Or how did you
 test? Just curious, and if there is a test, it would be good to
 integrate it.
 I just apply the patch and test the reported scenario.

Ah, sorry, missed the referenced bug report.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops

2014-07-02 Thread Cornelia Huck
On Tue,  1 Jul 2014 15:45:15 +0100
Will Deacon will.dea...@arm.com wrote:

 kvm_ioctl_create_device currently has knowledge of all the device types
 and their associated ops. This is fairly inflexible when adding support
 for new in-kernel device emulations, so move what we currently have out
 into a table, which can support dynamic registration of ops by new
 drivers for virtual hardware.
 
 I didn't try to port all current drivers over, as it's not always clear
 which initialisation hook the ops should be registered from.

I think that last paragraph should rather go into a cover letter :)

 
 Cc: Cornelia Huck cornelia.h...@de.ibm.com
 Cc: Alex Williamson alex.william...@redhat.com
 Cc: Alex Graf ag...@suse.de
 Cc: Gleb Natapov g...@kernel.org
 Cc: Paolo Bonzini pbonz...@redhat.com
 Cc: Marc Zyngier marc.zyng...@arm.com
 Cc: Christoffer Dall christoffer.d...@linaro.org
 Signed-off-by: Will Deacon will.dea...@arm.com
 ---
 
 v1 - v2: Added enum for KVM_DEV_TYPE* IDs, changed limits to ARRAY_SIZE,
   removed stray semicolon, had a crack at porting VFIO, included
 Cornelia's s390 FLIC patch.

...and the changelog as well (or keep changelogs for individual
patches).

 
  include/linux/kvm_host.h |  1 +
  include/uapi/linux/kvm.h | 22 +++-
  virt/kvm/kvm_main.c  | 65 
 
  3 files changed, 55 insertions(+), 33 deletions(-)
 

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index e11d8f170a62..6875cc225dff 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -940,15 +940,25 @@ struct kvm_device_attr {
   __u64   addr;   /* userspace address of attr data */
  };
 
 -#define KVM_DEV_TYPE_FSL_MPIC_20 1
 -#define KVM_DEV_TYPE_FSL_MPIC_42 2
 -#define KVM_DEV_TYPE_XICS3
 -#define KVM_DEV_TYPE_VFIO4
  #define  KVM_DEV_VFIO_GROUP  1
  #define   KVM_DEV_VFIO_GROUP_ADD 1
  #define   KVM_DEV_VFIO_GROUP_DEL 2
 -#define KVM_DEV_TYPE_ARM_VGIC_V2 5
 -#define KVM_DEV_TYPE_FLIC6
 +
 +enum kvm_device_type {
 + KVM_DEV_TYPE_FSL_MPIC_20= 1,
 +#define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20
 + KVM_DEV_TYPE_FSL_MPIC_42,
 +#define KVM_DEV_TYPE_FSL_MPIC_42 KVM_DEV_TYPE_FSL_MPIC_42
 + KVM_DEV_TYPE_XICS,
 +#define KVM_DEV_TYPE_XICSKVM_DEV_TYPE_XICS
 + KVM_DEV_TYPE_VFIO,
 +#define KVM_DEV_TYPE_VFIOKVM_DEV_TYPE_VFIO
 + KVM_DEV_TYPE_ARM_VGIC_V2,
 +#define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2
 + KVM_DEV_TYPE_FLIC,
 +#define KVM_DEV_TYPE_FLICKVM_DEV_TYPE_FLIC
 + KVM_DEV_TYPE_MAX,

Do you want to add the individual values to the enum? A removal of a
type would be an API break (so we should be safe against renumbering),
but it's sometimes helpful if one can get the numeric value at a glance.

 +};
 
  /*
   * ioctls for VM fds

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops

2014-07-02 Thread Will Deacon
On Wed, Jul 02, 2014 at 10:17:20AM +0100, Cornelia Huck wrote:
 On Tue,  1 Jul 2014 15:45:15 +0100
 Will Deacon will.dea...@arm.com wrote:
 
  kvm_ioctl_create_device currently has knowledge of all the device types
  and their associated ops. This is fairly inflexible when adding support
  for new in-kernel device emulations, so move what we currently have out
  into a table, which can support dynamic registration of ops by new
  drivers for virtual hardware.
  
  I didn't try to port all current drivers over, as it's not always clear
  which initialisation hook the ops should be registered from.
 
 I think that last paragraph should rather go into a cover letter :)
 
  
  Cc: Cornelia Huck cornelia.h...@de.ibm.com
  Cc: Alex Williamson alex.william...@redhat.com
  Cc: Alex Graf ag...@suse.de
  Cc: Gleb Natapov g...@kernel.org
  Cc: Paolo Bonzini pbonz...@redhat.com
  Cc: Marc Zyngier marc.zyng...@arm.com
  Cc: Christoffer Dall christoffer.d...@linaro.org
  Signed-off-by: Will Deacon will.dea...@arm.com
  ---
  
  v1 - v2: Added enum for KVM_DEV_TYPE* IDs, changed limits to ARRAY_SIZE,
removed stray semicolon, had a crack at porting VFIO, included
Cornelia's s390 FLIC patch.
 
 ...and the changelog as well (or keep changelogs for individual
 patches).

Yeah... this has grown to be bigger than one patch now. I can include that
for v3.

  diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
  index e11d8f170a62..6875cc225dff 100644
  --- a/include/uapi/linux/kvm.h
  +++ b/include/uapi/linux/kvm.h
  @@ -940,15 +940,25 @@ struct kvm_device_attr {
  __u64   addr;   /* userspace address of attr data */
   };
  
  -#define KVM_DEV_TYPE_FSL_MPIC_20   1
  -#define KVM_DEV_TYPE_FSL_MPIC_42   2
  -#define KVM_DEV_TYPE_XICS  3
  -#define KVM_DEV_TYPE_VFIO  4
   #define  KVM_DEV_VFIO_GROUP1
   #define   KVM_DEV_VFIO_GROUP_ADD   1
   #define   KVM_DEV_VFIO_GROUP_DEL   2
  -#define KVM_DEV_TYPE_ARM_VGIC_V2   5
  -#define KVM_DEV_TYPE_FLIC  6
  +
  +enum kvm_device_type {
  +   KVM_DEV_TYPE_FSL_MPIC_20= 1,
  +#define KVM_DEV_TYPE_FSL_MPIC_20   KVM_DEV_TYPE_FSL_MPIC_20
  +   KVM_DEV_TYPE_FSL_MPIC_42,
  +#define KVM_DEV_TYPE_FSL_MPIC_42   KVM_DEV_TYPE_FSL_MPIC_42
  +   KVM_DEV_TYPE_XICS,
  +#define KVM_DEV_TYPE_XICS  KVM_DEV_TYPE_XICS
  +   KVM_DEV_TYPE_VFIO,
  +#define KVM_DEV_TYPE_VFIO  KVM_DEV_TYPE_VFIO
  +   KVM_DEV_TYPE_ARM_VGIC_V2,
  +#define KVM_DEV_TYPE_ARM_VGIC_V2   KVM_DEV_TYPE_ARM_VGIC_V2
  +   KVM_DEV_TYPE_FLIC,
  +#define KVM_DEV_TYPE_FLIC  KVM_DEV_TYPE_FLIC
  +   KVM_DEV_TYPE_MAX,
 
 Do you want to add the individual values to the enum? A removal of a
 type would be an API break (so we should be safe against renumbering),
 but it's sometimes helpful if one can get the numeric value at a glance.

Could do, but then I think the advantage of the enum is questionable over
the #defines, since you could just as easily have two entries in the enum
with the same ID value as forgetting to update a KVM_DEV_TYPE_MAX #define
(which was the reason for the enum in the first place).

So I'd be inclined to keep the patch as-is, unless you have really strong
objections?

Will
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM] About releasing vcpu when closing vcpu fd

2014-07-02 Thread Igor Mammedov
On Mon, 30 Jun 2014 16:41:07 +0200
Anshul Makkar anshul.mak...@profitbricks.com wrote:

 Hi,
 
 Currently as per the specs for cpu_hot(un)plug, ACPI GPE Block:  IO
 ports 0xafe0-0xafe3  where each bit corresponds to each CPU.
 
 Currently, EJ0 method in acpi-dsdt-cpu-hotplu.dsl doesn't do anything.
 
 Method(CPEJ, 2, NotSerialized) {
 // _EJ0 method - eject callback
 Sleep(200)
 }
 
 I want to implement a notification mechanism for CPU hotunplug just
 like we have in memory hotunplug where in we write to particular IO
 port and this read/write is caught in the memory-hotplug.c.
 
 So, just want a suggestion as to whether I should expand the IO port
 range from 0xafe0 to 0xafe4 (addition of 1 byte), where last byte is
 for notification of EJ0 event.
I have it in my TODO list to rewrite CPU hotplug IO interface to be
similar with memory hotplug one. So you can try to it, it will
allow to drop CPUs bitmask and make interface scalable to more then
256 cpus.

 
 Or if you have any other suggestion, please share.
 
 Thanks
 Anshul Makkar
 
 On Fri, Jun 6, 2014 at 3:41 PM, Anshul Makkar
 anshul.mak...@profitbricks.com wrote:
  Oh yes, sorry for the ambiguity.  I meant proposal to park unplugged 
  vcpus.
 
  Thanks for the suggesting the practical approach.
 
  Anshul Makkar
 
  On Fri, Jun 6, 2014 at 3:36 PM, Gleb Natapov g...@minantech.com wrote:
  On Fri, Jun 06, 2014 at 03:02:59PM +0200, Anshul Makkar wrote:
  IIRC, Igor was of the opinion that  patch for vcpu deletion will be
  incomplete till its handled properly in kvm i.e vcpus are destroyed
  completely. 
  http://comments.gmane.org/gmane.comp.emulators.kvm.devel/114347
  .
 
  So can the above proposal  where just vcpus can be  disabled and
  reused in qemu is an acceptable solution ?
 
  If by above proposal you mean the proposal in the email you linked,
  then no since it tries to destroy vcpu, but does it incorrectly. If you
  mean proposal to park unplugged vcpu, so that guest will not be able
  to use it, then yes, it is pragmatic path forward.
 
 
  Thanks
  Anshul Makkar
 
  On Thu, May 29, 2014 at 10:12 AM, Gleb Natapov g...@kernel.org wrote:
   On Thu, May 29, 2014 at 01:40:08PM +0800, Gu Zheng wrote:
There was a patch(from Chen Fan, last august) about releasing vcpu 
when
closing vcpu fd http://www.spinics.net/lists/kvm/msg95701.html, 
but
your comment said Attempt where made to make it possible to destroy
individual vcpus separately from destroying VM before, but they were
unsuccessful thus far.
So what is the pain here? If we want to achieve the goal, what 
should we do?
Looking forward to your further comments.:)
   
CPU array is accessed locklessly in a lot of places, so it will have 
to be RCUified.
There was attempt to do so 2 year or so ago, but it didn't go 
anyware. Adding locks is
to big a price to pay for ability to free a little bit of memory by 
destroying vcpu.
  
   Yes, it's a pain here. But if we want to implement vcpu hot-remove, 
   this must be
   fixed sooner or later.
   Why?  vcpu hot-remove already works (or at least worked in the past
   for some value of work).  No need to destroy vcpu completely, just
   park it and tell a guest not to use it via ACPI hot unplug event.
  
   And any guys working on kvm vcpu hot-remove now?
  
An
alternative may be to make sure that stopped vcpu takes as little 
memory as possible.
  
   Yeah. But if we add a new vcpu with the old id that we stopped before, 
   it will fail.
  
   No need to create vcpu again, just unpark it and notify a guest via 
   ACPI hot plug event that
   vcpu can be used now.
  
   --
   Gleb.
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  --
  Gleb.


-- 
Regards,
  Igor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2] ARM VM System Specification

2014-07-02 Thread Christoffer Dall
On Tue, Jul 01, 2014 at 06:10:19PM +0100, Peter Maydell wrote:
 On 1 July 2014 18:03, Stefano Stabellini
 stefano.stabell...@eu.citrix.com wrote:
  On Mon, 30 Jun 2014, Peter Maydell wrote:
  How about:
  =
  Guest OSes in the VM image should rely on the UEFI RTC API for
  real time clock services. (To provide that API, the VM system will
  likely need to implement some real time clock device, but the
  details of these are a private implementation issue between it
  and its associated UEFI implementation.)
 
  I don't see why we need to add the text within brackets: it is out of
  scope for this document.
 
 The intention is to be an informative note, not normative text
 (I'm happy if we want to format the text to make that clearer,
 with some sort of NOTE: markup). I'd like VM implementors
 reading this spec to make the correct decisions even if they
 don't happen to know inside-out the details of the UEFI
 specification and what exactly UEFI demands of the hardware,
 and I think it's worth adding the occasional clarifying sentence
 even if it doesn't strictly speaking add any extra rules to the
 specification.
 
That's also the approach we've taken so far elsewhere in the document,
so I think it's useful.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] KVM: device: add simple registration mechanism for kvm_device_ops

2014-07-02 Thread Cornelia Huck
On Wed, 2 Jul 2014 10:32:41 +0100
Will Deacon will.dea...@arm.com wrote:

 On Wed, Jul 02, 2014 at 10:17:20AM +0100, Cornelia Huck wrote:
  On Tue,  1 Jul 2014 15:45:15 +0100
  Will Deacon will.dea...@arm.com wrote:

   diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
   index e11d8f170a62..6875cc225dff 100644
   --- a/include/uapi/linux/kvm.h
   +++ b/include/uapi/linux/kvm.h
   @@ -940,15 +940,25 @@ struct kvm_device_attr {
 __u64   addr;   /* userspace address of attr data */
};
   
   -#define KVM_DEV_TYPE_FSL_MPIC_20 1
   -#define KVM_DEV_TYPE_FSL_MPIC_42 2
   -#define KVM_DEV_TYPE_XICS3
   -#define KVM_DEV_TYPE_VFIO4
#define  KVM_DEV_VFIO_GROUP  1
#define   KVM_DEV_VFIO_GROUP_ADD 1
#define   KVM_DEV_VFIO_GROUP_DEL 2
   -#define KVM_DEV_TYPE_ARM_VGIC_V2 5
   -#define KVM_DEV_TYPE_FLIC6
   +
   +enum kvm_device_type {
   + KVM_DEV_TYPE_FSL_MPIC_20= 1,
   +#define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20
   + KVM_DEV_TYPE_FSL_MPIC_42,
   +#define KVM_DEV_TYPE_FSL_MPIC_42 KVM_DEV_TYPE_FSL_MPIC_42
   + KVM_DEV_TYPE_XICS,
   +#define KVM_DEV_TYPE_XICSKVM_DEV_TYPE_XICS
   + KVM_DEV_TYPE_VFIO,
   +#define KVM_DEV_TYPE_VFIOKVM_DEV_TYPE_VFIO
   + KVM_DEV_TYPE_ARM_VGIC_V2,
   +#define KVM_DEV_TYPE_ARM_VGIC_V2 KVM_DEV_TYPE_ARM_VGIC_V2
   + KVM_DEV_TYPE_FLIC,
   +#define KVM_DEV_TYPE_FLICKVM_DEV_TYPE_FLIC
   + KVM_DEV_TYPE_MAX,
  
  Do you want to add the individual values to the enum? A removal of a
  type would be an API break (so we should be safe against renumbering),
  but it's sometimes helpful if one can get the numeric value at a glance.
 
 Could do, but then I think the advantage of the enum is questionable over
 the #defines, since you could just as easily have two entries in the enum
 with the same ID value as forgetting to update a KVM_DEV_TYPE_MAX #define
 (which was the reason for the enum in the first place).
 
 So I'd be inclined to keep the patch as-is, unless you have really strong
 objections?

I don't really have a strong opinion on that, so

Acked-by: Cornelia Huck cornelia.h...@de.ibm.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest

2014-07-02 Thread bharat.bhus...@freescale.com

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 01, 2014 10:11 PM
 To: Alexander Graf
 Cc: Bhushan Bharat-R65777; kvm-...@vger.kernel.org; kvm@vger.kernel.org
 Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection 
 to
 guest
 
 On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote:
  On 01.07.14 17:35, Scott Wood wrote:
   On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote:
   On 01.07.14 16:58, Scott Wood wrote:
   On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote:
   I don't think QEMU should be aware of these limitations.
   OK, but we should at least have some idea of how the whole thing
   is supposed to work, in order to determine if this is the correct
   behavior for QEMU.  I thought the model was that debug resources
   are either owned by QEMU or by the guest, and in the latter case,
   QEMU would never see the debug exception to begin with.
   That's bad for a number of reasons. For starters it's different
   from how
   x86 handles debug registers - and I hate to be different just for
   the sake of being different.
   How does it work on x86?
 
  It overwrites more-or-less random breakpoints with its own ones, but
  leaves the others intact ;).
 
 Are you talking about software breakpoints or management of hardware debug
 registers?
 
   So if we do want to declare that debug registers are owned by
   either QEMU or the guest, we should change the semantics for all
   architectures.
   If we want to say that ownership of the registers is shared, we need
   a plan for how that would actually work.
 
  I think you're overengineering here :). When do people actually use
  gdbstub? Usually when they want to debug a broken guest. We can either
 
 * overengineer heavily and reduce the number of registers available
  to the guest to always have spares
 * overengineer a bit and turn off guest debugging completely when
  we use gdbstub
 * just leave as much alive as we can, hoping that it helps with the
  debugging
 
  Option 3 is what x86 does - and I think it's a reasonable approach.
  This is not an interface that needs to be 100% consistent and bullet
  proof, it's a best effort to enable you to debug as much as possible.
 
 I'm not insisting on 100% -- just hoping for some explanation/discussion about
 how it's intended to work for the cases where it can.
 
 How will MSR[DE] and MSRP[DEP] be handled?
 
 How would I go about telling QEMU/KVM that I don't want this shared mode,
 because I don't want guest to interfere with the debugging I'm trying to do 
 from
 QEMU?
 
 Will guest accesses to debug registers cause a userspace exit when guest_debug
 is enabled?
 
   I think we're in a path that is slow enough already to not worry
   about performance.
   It's not just about performance, but simplicity of use, and
   consistency of API.
  
   Oh, and it looks like there already exist one reg definitions and
   implementations for most of the debug registers.
 
  For BookE? Where?
 
 arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn

I tried to quickly prototype what I think we want to do (this is not tested)

---
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e8b3982..746b5c6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, 
u32 *server,
 extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);
 extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
 
+void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu);
+void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9f13056..0b7e4e4 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -235,6 +235,16 @@ void kvmppc_core_dequeue_dec(struct kvm_vcpu *vcpu)
clear_bit(BOOKE_IRQPRIO_DECREMENTER, vcpu-arch.pending_exceptions);
 }
 
+void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu)
+{
+   kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG);
+}
+ 
+void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu)
+{
+   clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions);
+}
+
 void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 struct kvm_interrupt *irq)
 {
@@ -841,6 +851,20 @@ static int kvmppc_handle_debug(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
struct debug_reg *dbg_reg = (vcpu-arch.shadow_dbg_reg);
u32 dbsr = vcpu-arch.dbsr;
 
+   /* Userspace (QEMU) is not using debug resource, so inject debug 
interrupt
+* directly to guest debug.
+*/
+   if (vcpu-guest_debug == 0) {
+   if (dbsr  (vcpu-arch.shared-msr  MSR_DE))
+

Re: [PATCH 4/6] KVM: PPC: BOOK3S: HV: Use new functions for mapping/unmapping hpte in host

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:33PM +0530, Aneesh Kumar K.V wrote:
 We want to use virtual page class key protection mechanism for
 indicating a MMIO mapped hpte entry or a guest hpte entry that is swapped out
 in the host. Those hptes will be marked valid, but have virtual page
 class key set to 30 or 31. These virtual page class numbers are
 configured in AMR to deny read/write. To accomodate such a change, add
 new functions that map, unmap and check whether a hpte is mapped in the
 host. This patch still use HPTE_V_VALID and HPTE_V_ABSENT and don't use
 virtual page class keys. But we want to differentiate in the code
 where we explicitly check for HPTE_V_VALID with places where we want to
 check whether the hpte is host mapped. This patch enables a closer
 review for such a change.

 [...]

  /* Check for pending invalidations under the rmap chain lock */
  if (kvm-arch.using_mmu_notifiers 
  mmu_notifier_retry(kvm, mmu_seq)) {
 -/* inval in progress, write a non-present HPTE */
 -pteh |= HPTE_V_ABSENT;
 -pteh = ~HPTE_V_VALID;
 +/*
 + * inval in progress in host, write host unmapped pte.
 + */
 +host_unmapped_hpte = 1;

 This isn't right.  We already have HPTE_V_VALID set here, and you now
 don't clear it here, and it doesn't get cleared by the
 __kvmppc_unmap_host_hpte() call below either.


Ok missed that. Will fix that in the next update. In the earlier version
I had kvmppc_unmap_host_hpte always clearing V_VALID. 

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: PPC: BOOK3S: Use hpte_update_in_progress to track invalid hpte during an hpte update

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:34PM +0530, Aneesh Kumar K.V wrote:
 As per ISA, we first need to mark hpte invalid (V=0) before we update
 the hpte lower half bits. With virtual page class key protection mechanism 
 we want
 to send any fault other than key fault to guest directly without
 searching the hash page table. But then we can get NO_HPTE fault while
 we are updating the hpte. To track that add a vm specific atomic
 variable that we check in the fault path to always send the fault
 to host.

 [...]

 @@ -750,13 +751,15 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
  r = rcbits | ~(HPTE_R_R | HPTE_R_C);
  
  if (be64_to_cpu(hptep[0])  HPTE_V_VALID) {
 -/* HPTE was previously valid, so we need to invalidate it */
 +/*
 + * If we had mapped this hpte before, we now need to
 + * invalidate that.
 + */
  unlock_rmap(rmap);
 -/* Always mark HPTE_V_ABSENT before invalidating */
 -kvmppc_unmap_host_hpte(kvm, hptep);
  kvmppc_invalidate_hpte(kvm, hptep, index);
  /* don't lose previous R and C bits */
  r |= be64_to_cpu(hptep[1])  (HPTE_R_R | HPTE_R_C);
 +hpte_invalidated = true;

 So now we're not setting the ABSENT bit before invalidating the HPTE.
 That means that another guest vcpu could do an H_ENTER which could
 think that this HPTE is free and use it for another unrelated guest
 HPTE, which would be bad...

But henter looks at HPTE_V_HVLOCK, and we keep that set through out. But
I will double the code again to make sure it is safe in the above
scenario.


 @@ -1144,8 +1149,8 @@ static int kvm_test_clear_dirty_npages(struct kvm 
 *kvm, unsigned long *rmapp)
  npages_dirty = n;
  eieio();
  }
 -kvmppc_map_host_hpte(kvm, v, r);
 -hptep[0] = cpu_to_be64(v  ~HPTE_V_HVLOCK);
 +hptep[0] = cpu_to_be64(v  ~HPTE_V_LOCK);
 +atomic_dec(kvm-arch.hpte_update_in_progress);

 Why are we using LOCK rather than HVLOCK now?  (And why didn't you
 mention this change and its rationale in the patch description?)

Sorry, that is a typo. I intend to use HPTE_V_HVLOCK.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] KVM: PPC: BOOK3S: HV: Deny virtual page class key update via h_protect

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:31PM +0530, Aneesh Kumar K.V wrote:
 This makes it consistent with h_enter where we clear the key
 bits. We also want to use virtual page class key protection mechanism
 for indicating host page fault. For that we will be using key class
 index 30 and 31. So prevent the guest from updating key bits until
 we add proper support for virtual page class protection mechanism for
 the guest. This will not have any impact for PAPR linux guest because
 Linux guest currently don't use virtual page class key protection model

 As things stand, without this patch series, we do actually have
 everything we need in the kernel for guests to use virtual page class
 keys.  Arguably we should have a capability to tell userspace how many
 storage keys the guest can use, but that's the only missing piece as
 far as I can see.

yes.


 If we add such a capability, I can't see any reason why we should need
 to disable guest use of storage keys in this patchset.

With this patchset, we would need additonal changes to find out whether the key
fault happened because of the guest's usage of the key. I was planning to do
that as an add-on series to keep the changes in this minimal. Also since
linux didn't use keys i was not sure whether guest support of keys is an
important item.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH REPOST v2] ARM: virt: fix wrong HSCTLR.EE bit setting

2014-07-02 Thread Marc Zyngier
On 01/07/14 11:01, john.liuli wrote:
 From: Li Liu john.li...@huawei.com
 
 HSCTLR.EE is defined as bit[25] referring to arm manual
 DDI0606C.b(p1590).
 
 Reviewed-by: Marc Zyngier marc.zyng...@arm.com
 Signed-off-by: Li Liu john.li...@huawei.com
 ---
 changes v1 - v2:
   1)Use the canonical syntax for orr (orr r7, r7, #(1  25))
   2)Use the ARM_BE8 macro instead of the open-coded CONFIG_CPU_BIG_ENDIAN
 
  arch/arm/kernel/hyp-stub.S |4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)
 
 diff --git a/arch/arm/kernel/hyp-stub.S b/arch/arm/kernel/hyp-stub.S
 index 797b1a6..7e666cf 100644
 --- a/arch/arm/kernel/hyp-stub.S
 +++ b/arch/arm/kernel/hyp-stub.S
 @@ -134,9 +134,7 @@ ENTRY(__hyp_stub_install_secondary)
   mcr p15, 4, r7, c1, c1, 3   @ HSTR
  
  THUMB(   orr r7, #(1  30)  )   @ HSCTLR.TE
 -#ifdef CONFIG_CPU_BIG_ENDIAN
 - orr r7, #(1  9)   @ HSCTLR.EE
 -#endif
 +ARM_BE8(orr  r7, r7, #(1  25)) @ HSCTLR.EE
   mcr p15, 4, r7, c1, c0, 0   @ HSCTLR
  
   mrc p15, 4, r7, c1, c1, 1   @ HDCR
 

Russell,

Assuming you're happy with this patch, would you agree to it going
through the KVM/ARM tree? We have a whole bunch of BE patches queued up,
and this is a looks like a better fix than what we have so far [1].

Thanks,

M.

[1]: https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/009950.html
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Spam-Mail]Dear e-mail user

2014-07-02 Thread Taiwan Email Administrator Center
Dear e-mail user, 

Your mailbox has exceeded its storage limit to set your e-mail administrator, 
and you will not be able to receive new mail until you re-validate it. 

Click on the link: https://www.formstack.com/forms/?1775936-RHsf107Bxi

In other to re-validate your email account. 

Thank you 

2014 Email Administrator Center.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR

2014-07-02 Thread Gleb Natapov
On Wed, Jul 02, 2014 at 05:00:35PM +0800, Tang Chen wrote:
 Define guest phys_addr of apic access page.
 ---
  arch/x86/include/asm/vmx.h | 2 +-
  arch/x86/kvm/svm.c | 3 ++-
  arch/x86/kvm/vmx.c | 7 ---
  3 files changed, 7 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
 index 7004d21..c4672d1 100644
 --- a/arch/x86/include/asm/vmx.h
 +++ b/arch/x86/include/asm/vmx.h
 @@ -422,7 +422,7 @@ enum vmcs_field {
  #define VMX_EPT_DIRTY_BIT(1ull  9)
  
  #define VMX_EPT_IDENTITY_PAGETABLE_ADDR  0xfffbc000ul
 -
 +#define VMX_APIC_ACCESS_PAGE_ADDR0xfee0ull
  
It has nothing to do with VMX and there is already define for that: 
APIC_DEFAULT_PHYS_BASE

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.

2014-07-02 Thread Gleb Natapov
On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote:
 ept identity pagetable is pinned in memory, and as a result it cannot be
 migrated/hot-removed.
 
 But actually it doesn't need to be pinned in memory.
 
 This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept
 indetity pagetable related variable. This request will be made when
 kvm_mmu_notifier_invalidate_page() is called when the page is unmapped
 from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL.
 And will also be made when ept violation happens to reset
 kvm-arch.ept_identity_pagetable to the new page.

kvm-arch.ept_identity_pagetable is never used as a page address, just
boolean null/!null to see if identity pagetable is initialized. I do
not see why would we want to track its address at all. Changing it to bool
and assigning true during initialization should be enough.


--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Bandan Das
Wanpeng Li wanpeng...@linux.intel.com writes:

 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
I can also reproduce this easily with Linux as L1 by slowing it down
eg. running with ept = 0

I suggest changing the subject to -
KVM: nVMX: Fix race that incorrectly injects L1's irq to L2

 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 

What's current log ? Do you mean current code ?

 there is no still-pending event to L1 which blocked by nested_run_pending. 
 There is a race which lead to an interrupt will be injected to L2 which 
 belong to L1 if L0 send an interrupt to L1 during this window. 

VCPU0   another thread 

 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event 
 kvm check request req event 
 check_nested_events don't have any intr 
 not nested exit 
 intr occur (8254, lapic timer etc)
 inject_pending_event now have intr 
 inject interrupt 

 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx 
 which indicates there is still-pending event which blocked by 
 nested_run_pending, 
 and smart request a KVM_REQ_EVENT if there is a still-pending event which 
 blocked 
 by nested_run_pending.

 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
  arch/x86/kvm/vmx.c | 20 +++-
  1 file changed, 15 insertions(+), 5 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f4e5aed..fe69c49 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -372,6 +372,7 @@ struct nested_vmx {
   u64 vmcs01_tsc_offset;
   /* L2 must run next, and mustn't decide to exit to L1. */
   bool nested_run_pending;
 + bool l1_events_blocked;
   /*
* Guest pages referred to in vmcs02 with host-physical pointers, so
* we must keep them pinned while L2 runs.
 @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
* we did not inject a still-pending event to L1 now because of
* nested_run_pending, we need to re-enable this bit.
*/
 - if (vmx-nested.nested_run_pending)
 + if (to_vmx(vcpu)-nested.l1_events_blocked) {
Is to_vmx() necessary since we alredy have the vmx pointer ?

 + to_vmx(vcpu)-nested.l1_events_blocked = false;
   kvm_make_request(KVM_REQ_EVENT, vcpu);
 + }
  
   vmx-nested.nested_run_pending = 0;
  
 @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
   if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
   vmx-nested.preemption_timer_expired) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
   return 0;
   }
  
   if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 - if (vmx-nested.nested_run_pending ||
 - vcpu-arch.interrupt.pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
 + return -EBUSY;
 + }
 + if (vcpu-arch.interrupt.pending)
   return -EBUSY;
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 NMI_VECTOR | INTR_TYPE_NMI_INTR |
 @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
   if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
   nested_exit_on_intr(vcpu)) {
 - if (vmx-nested.nested_run_pending)
 + if (vmx-nested.nested_run_pending) {
 + vmx-nested.l1_events_blocked = true;
   return -EBUSY;
 + }
   nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
   }
Also, I am wondering isn't it enough to just do this to avoid this race ?

 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
-   return (!to_vmx(vcpu)-nested.nested_run_pending 
+   return (!is_guest_mode(vcpu) 
+   !to_vmx(vcpu)-nested.nested_run_pending 
vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 

Thanks,
Bandan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest

2014-07-02 Thread bharat.bhus...@freescale.com


 -Original Message-
 From: Bhushan Bharat-R65777
 Sent: Wednesday, July 02, 2014 5:07 PM
 To: Wood Scott-B07421; Alexander Graf
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org
 Subject: RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection 
 to
 guest
 
 
  -Original Message-
  From: Wood Scott-B07421
  Sent: Tuesday, July 01, 2014 10:11 PM
  To: Alexander Graf
  Cc: Bhushan Bharat-R65777; kvm-...@vger.kernel.org;
  kvm@vger.kernel.org
  Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt
  injection to guest
 
  On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote:
   On 01.07.14 17:35, Scott Wood wrote:
On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote:
On 01.07.14 16:58, Scott Wood wrote:
On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote:
I don't think QEMU should be aware of these limitations.
OK, but we should at least have some idea of how the whole thing
is supposed to work, in order to determine if this is the
correct behavior for QEMU.  I thought the model was that debug
resources are either owned by QEMU or by the guest, and in the
latter case, QEMU would never see the debug exception to begin with.
That's bad for a number of reasons. For starters it's different
from how
x86 handles debug registers - and I hate to be different just for
the sake of being different.
How does it work on x86?
  
   It overwrites more-or-less random breakpoints with its own ones, but
   leaves the others intact ;).
 
  Are you talking about software breakpoints or management of hardware
  debug registers?
 
So if we do want to declare that debug registers are owned by
either QEMU or the guest, we should change the semantics for all
architectures.
If we want to say that ownership of the registers is shared, we
need a plan for how that would actually work.
  
   I think you're overengineering here :). When do people actually use
   gdbstub? Usually when they want to debug a broken guest. We can
   either
  
  * overengineer heavily and reduce the number of registers
   available to the guest to always have spares
  * overengineer a bit and turn off guest debugging completely when
   we use gdbstub
  * just leave as much alive as we can, hoping that it helps with
   the debugging
  
   Option 3 is what x86 does - and I think it's a reasonable approach.
   This is not an interface that needs to be 100% consistent and bullet
   proof, it's a best effort to enable you to debug as much as possible.
 
  I'm not insisting on 100% -- just hoping for some
  explanation/discussion about how it's intended to work for the cases where 
  it
 can.
 
  How will MSR[DE] and MSRP[DEP] be handled?
 
  How would I go about telling QEMU/KVM that I don't want this shared
  mode, because I don't want guest to interfere with the debugging I'm
  trying to do from QEMU?
 
  Will guest accesses to debug registers cause a userspace exit when
  guest_debug is enabled?
 
I think we're in a path that is slow enough already to not worry
about performance.
It's not just about performance, but simplicity of use, and
consistency of API.
   
Oh, and it looks like there already exist one reg definitions and
implementations for most of the debug registers.
  
   For BookE? Where?
 
  arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn
 
 I tried to quickly prototype what I think we want to do (this is not tested)

Hi Scott,

There is one problem which is stopping us to share debug resource between qemu 
and guest, looking for suggestions:
- As qemu is also using debug resource,  We have to set MSR_DE and set MSRP_DEP 
(guest will not be able to clear MSR_DE). So qemu set debug events will always 
cause the debug interrupts.
- Now guest is also using debug resources and for some reason if guest wants to 
clear MSR_DE (disable debug interrupt) But it will not be able to disable as 
MSRP_DEP is set and KVM will not come to know guest willingness to disable 
MSR_DE.
- If the debug interrupts occurs then we will exit to QEMU and this may not a 
QEMU set event so it will inject interrupt to guest (using one-reg or set-sregs)
- Now KVM, when handling one-reg/sregs request to inject debug interrupt, do 
not know whether guest can handle the debug interrupt or not (as guest might 
have tried to set/clear MSR_DE).

Thanks
-Bharat

 
 ---
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h
 b/arch/powerpc/include/asm/kvm_ppc.h
 index e8b3982..746b5c6 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq,
 u32 *server,  extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);  extern
 int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
 
 +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); void
 

Re: [PATCH 0/4] kvm, mem-hotplug: Do not pin ept identity pagetable and apic access page.

2014-07-02 Thread Tang Chen

Hi Gleb,

On 07/02/2014 05:00 PM, Tang Chen wrote:

Hi Gleb, Marcelo,

Please help to review this patch-set.

NOTE: This patch-set doesn't work properly.


ept identity pagetable and apic access page in kvm are pinned in memory.
As a result, they cannot be migrated/hot-removed.

But actually they don't need to be pinned in memory.

This patch-set introduces two new vcpu requests: KVM_REQ_MIGRATE_EPT and 
KVM_REQ_MIGRATE_APIC.
These two requests are made when the two pages are migrated by the mmu_notifier
to reset the related variable to unusable value. And will also be made when
ept violation happens to reset new pages.


[Known problem]
After this patch-set applied, the two pages can be migrated/hot-removed.
But after migrating apic access page, the guest died.

The host physical address of apic access page is stored in VMCS. I reset
it to 0 to stop guest from accessing it when it is unmapped by
kvm_mmu_notifier_invalidate_page(). And reset it to new page's host physical
address in tdp_page_fault(). But it seems that guest will access apic page
directly by the host physical address.


Would you please to give some advice about this problem ?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] kvm: Add macro VMX_APIC_ACCESS_PAGE_ADDR

2014-07-02 Thread Tang Chen

On 07/03/2014 12:24 AM, Gleb Natapov wrote:

On Wed, Jul 02, 2014 at 05:00:35PM +0800, Tang Chen wrote:

Define guest phys_addr of apic access page.
---
  arch/x86/include/asm/vmx.h | 2 +-
  arch/x86/kvm/svm.c | 3 ++-
  arch/x86/kvm/vmx.c | 7 ---
  3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 7004d21..c4672d1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -422,7 +422,7 @@ enum vmcs_field {
  #define VMX_EPT_DIRTY_BIT (1ull  9)

  #define VMX_EPT_IDENTITY_PAGETABLE_ADDR   0xfffbc000ul
-
+#define VMX_APIC_ACCESS_PAGE_ADDR  0xfee0ull


It has nothing to do with VMX and there is already define for that: 
APIC_DEFAULT_PHYS_BASE


OK, followed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] kvm, memory-hotplug: Update ept identity pagetable when it is migrated.

2014-07-02 Thread Tang Chen

On 07/03/2014 12:34 AM, Gleb Natapov wrote:

On Wed, Jul 02, 2014 at 05:00:36PM +0800, Tang Chen wrote:

ept identity pagetable is pinned in memory, and as a result it cannot be
migrated/hot-removed.

But actually it doesn't need to be pinned in memory.

This patch introduces a new vcpu request: KVM_REQ_MIGRATE_EPT to reset ept
indetity pagetable related variable. This request will be made when
kvm_mmu_notifier_invalidate_page() is called when the page is unmapped
from the qemu user space to reset kvm-arch.ept_identity_pagetable to NULL.
And will also be made when ept violation happens to reset
kvm-arch.ept_identity_pagetable to the new page.


kvm-arch.ept_identity_pagetable is never used as a page address, just
boolean null/!null to see if identity pagetable is initialized. I do
not see why would we want to track its address at all. Changing it to bool
and assigning true during initialization should be enough.


OK, followed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] RFC: userfault

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
 Hello everyone,
 
 There's a large CC list for this RFC because this adds two new
 syscalls (userfaultfd and remap_anon_pages) and
 MADV_USERFAULT/MADV_NOUSERFAULT, so suggestions on changes to the API
 or on a completely different API if somebody has better ideas are
 welcome now.

cc:linux-api -- this is certainly worthy of linux-api discussion.

 
 The combination of these features are what I would propose to
 implement postcopy live migration in qemu, and in general demand
 paging of remote memory, hosted in different cloud nodes.
 
 The MADV_USERFAULT feature should be generic enough that it can
 provide the userfaults to the Android volatile range feature too, on
 access of reclaimed volatile pages.
 
 If the access could ever happen in kernel context through syscalls
 (not not just from userland context), then userfaultfd has to be used
 to make the userfault unnoticeable to the syscall (no error will be
 returned). This latter feature is more advanced than what volatile
 ranges alone could do with SIGBUS so far (but it's optional, if the
 process doesn't call userfaultfd, the regular SIGBUS will fire, if the
 fd is closed SIGBUS will also fire for any blocked userfault that was
 waiting a userfaultfd_write ack).
 
 userfaultfd is also a generic enough feature, that it allows KVM to
 implement postcopy live migration without having to modify a single
 line of KVM kernel code. Guest async page faults, FOLL_NOWAIT and all
 other GUP features works just fine in combination with userfaults
 (userfaults trigger async page faults in the guest scheduler so those
 guest processes that aren't waiting for userfaults can keep running in
 the guest vcpus).
 
 remap_anon_pages is the syscall to use to resolve the userfaults (it's
 not mandatory, vmsplice will likely still be used in the case of local
 postcopy live migration just to upgrade the qemu binary, but
 remap_anon_pages is faster and ideal for transferring memory across
 the network, it's zerocopy and doesn't touch the vma: it only holds
 the mmap_sem for reading).
 
 The current behavior of remap_anon_pages is very strict to avoid any
 chance of memory corruption going unnoticed. mremap is not strict like
 that: if there's a synchronization bug it would drop the destination
 range silently resulting in subtle memory corruption for
 example. remap_anon_pages would return -EEXIST in that case. If there
 are holes in the source range remap_anon_pages will return -ENOENT.
 
 If remap_anon_pages is used always with 2M naturally aligned
 addresses, transparent hugepages will not be splitted. In there could
 be 4k (or any size) holes in the 2M (or any size) source range,
 remap_anon_pages should be used with the RAP_ALLOW_SRC_HOLES flag to
 relax some of its strict checks (-ENOENT won't be returned if
 RAP_ALLOW_SRC_HOLES is set, remap_anon_pages then will just behave as
 a noop on any hole in the source range). This flag is generally useful
 when implementing userfaults with THP granularity, but it shouldn't be
 set if doing the userfaults with PAGE_SIZE granularity if the
 developer wants to benefit from the strict -ENOENT behavior.
 
 The remap_anon_pages syscall API is not vectored, as I expect it to be
 used mainly for demand paging (where there can be just one faulting
 range per userfault) or for large ranges (with the THP model as an
 alternative to zapping re-dirtied pages with MADV_DONTNEED with 4k
 granularity before starting the guest in the destination node) where
 vectoring isn't going to provide much performance advantages (thanks
 to the THP coarser granularity).
 
 On the rmap side remap_anon_pages doesn't add much complexity: there's
 no need of nonlinear anon vmas to support it because I added the
 constraint that it will fail if the mapcount is more than 1. So in
 general the source range of remap_anon_pages should be marked
 MADV_DONTFORK to prevent any risk of failure if the process ever
 forks (like qemu can in some case).
 
 One part that hasn't been tested is the poll() syscall on the
 userfaultfd because the postcopy migration thread currently is more
 efficient waiting on blocking read()s (I'll write some code to test
 poll() too). I also appended below a patch to trinity to exercise
 remap_anon_pages and userfaultfd and it completes trinity
 successfully.
 
 The code can be found here:
 
 git clone --reference linux 
 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git -b userfault 
 
 The branch is rebased so you can get updates for example with:
 
 git fetch  git checkout -f origin/userfault
 
 Comments welcome, thanks!
 Andrea
 
 From cbe940e13b4cead41e0f862b3abfa3814f235ec3 Mon Sep 17 00:00:00 2001
 From: Andrea Arcangeli aarca...@redhat.com
 Date: Wed, 2 Jul 2014 18:32:35 +0200
 Subject: [PATCH] add remap_anon_pages and userfaultfd
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  include/syscalls-x86_64.h   |   2 +
  syscalls/remap_anon_pages.c | 100 
 

Re: [PATCH 08/10] userfaultfd: add new syscall to provide memory externalization

2014-07-02 Thread Andy Lutomirski
On 07/02/2014 09:50 AM, Andrea Arcangeli wrote:
 Once an userfaultfd is created MADV_USERFAULT regions talks through
 the userfaultfd protocol with the thread responsible for doing the
 memory externalization of the process.
 
 The protocol starts by userland writing the requested/preferred
 USERFAULT_PROTOCOL version into the userfault fd (64bit write), if
 kernel knows it, it will ack it by allowing userland to read 64bit
 from the userfault fd that will contain the same 64bit
 USERFAULT_PROTOCOL version that userland asked. Otherwise userfault
 will read __u64 value -1ULL (aka USERFAULTFD_UNKNOWN_PROTOCOL) and it
 will have to try again by writing an older protocol version if
 suitable for its usage too, and read it back again until it stops
 reading -1ULL. After that the userfaultfd protocol starts.
 
 The protocol consists in the userfault fd reads 64bit in size
 providing userland the fault addresses. After a userfault address has
 been read and the fault is resolved by userland, the application must
 write back 128bits in the form of [ start, end ] range (64bit each)
 that will tell the kernel such a range has been mapped. Multiple read
 userfaults can be resolved in a single range write. poll() can be used
 to know when there are new userfaults to read (POLLIN) and when there
 are threads waiting a wakeup through a range write (POLLOUT).
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com

 +#ifdef CONFIG_PROC_FS
 +static int userfaultfd_show_fdinfo(struct seq_file *m, struct file *f)
 +{
 + struct userfaultfd_ctx *ctx = f-private_data;
 + int ret;
 + wait_queue_t *wq;
 + struct userfaultfd_wait_queue *uwq;
 + unsigned long pending = 0, total = 0;
 +
 + spin_lock(ctx-fault_wqh.lock);
 + list_for_each_entry(wq, ctx-fault_wqh.task_list, task_list) {
 + uwq = container_of(wq, struct userfaultfd_wait_queue, wq);
 + if (uwq-pending)
 + pending++;
 + total++;
 + }
 + spin_unlock(ctx-fault_wqh.lock);
 +
 + ret = seq_printf(m, pending:\t%lu\ntotal:\t%lu\n, pending, total);

This should show the protocol version, too.

 +
 +SYSCALL_DEFINE1(userfaultfd, int, flags)
 +{
 + int fd, error;
 + struct file *file;

This looks like it can't be used more than once in a process.  That will
be unfortunate for libraries.  Would it be feasible to either have
userfaultfd claim a range of addresses or for a vma to be explicitly
associated with a userfaultfd?  (In the latter case, giant PROT_NONE
MAP_NORESERVE mappings could be used.)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 3/3] kvm: ignore LBR and offcore rsp

2014-07-02 Thread kan . liang
From: Kan Liang kan.li...@intel.com

With -cpu host KVM reports LBR and offcore support, so the perf driver may 
accesses the LBR and offcore MSRs.
However, there is no LBR and offcore virtualization support yet. This could 
causes guest to crash.
As a workaround, KVM just simply ignore the LBR and offcore_rsp MSRs to lie the 
guest.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y.
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n.
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
#GP.
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
trigger offcore_rsp #GP

Signed-off-by: Andi Kleen a...@linux.intel.com
Signed-off-by: Kan Liang kan.li...@intel.com
---
 arch/x86/kvm/pmu.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index cbecaa9..ad9b4fa 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -331,6 +331,15 @@ bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
ret = pmu-version  1;
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   return 1; /* to avoid crashes */
default:
ret = get_gp_pmc(pmu, msr, MSR_IA32_PERFCTR0)
|| get_gp_pmc(pmu, msr, MSR_P6_EVNTSEL0)
@@ -358,6 +367,16 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 
*data)
case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
*data = pmu-global_ovf_ctrl;
return 0;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   *data = 0;
+   return 0;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
@@ -409,6 +428,16 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data 
*msr_info)
return 0;
}
break;
+   case MSR_OFFCORE_RSP_0:
+   case MSR_OFFCORE_RSP_1:
+   /* At most 8-deep LBR for core and atom */
+   case MSR_LBR_CORE_FROM ... MSR_LBR_CORE_FROM + 7:
+   case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 7:
+   /* 16-deep LBR for core i3/i5/i7 series processors */
+   case MSR_LBR_NHM_FROM ... MSR_LBR_NHM_FROM + 15:
+   case MSR_LBR_NHM_TO ... MSR_LBR_NHM_TO + 15:
+   /* dummy for now */
+   break;
default:
if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
(pmc = get_fixed_pmc(pmu, index))) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread kan . liang
From: Kan Liang kan.li...@intel.com

x86, perf: Protect LBR and offcore rsp against KVM lying

With -cpu host, KVM reports LBR and offcore support, if the host has support.
When the guest perf driver tries to access LBR or offcore_rsp MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support.
So check the related MSRs access right once at initialization time to avoid the 
error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y.
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n.
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
#GP.
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
trigger offcore_rsp #GP

Signed-off-by: Andi Kleen a...@linux.intel.com
Signed-off-by: Kan Liang kan.li...@intel.com

V2: Move the check code to initialization time.
---
 arch/x86/kernel/cpu/perf_event.h   | 19 +--
 arch/x86/kernel/cpu/perf_event_intel.c |  7 +++
 arch/x86/kernel/cpu/perf_event_intel_lbr.c |  4 ++--
 3 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3b2f9bd..5d977b2 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -458,12 +458,13 @@ struct x86_pmu {
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
boollbr_double_abort;  /* duplicated lbr aborts */
-
+   boollbr_msr_access;/* LBR MSR can be accessed */
/*
 * Extra registers for events
 */
struct extra_reg *extra_regs;
unsigned int er_flags;
+   boolextra_msr_access;  /* EXTRA REG MSR can be 
accessed */
 
/*
 * Intel host/guest support (KVM)
@@ -525,6 +526,20 @@ extern u64 __read_mostly hw_cache_extra_regs
[PERF_COUNT_HW_CACHE_OP_MAX]
[PERF_COUNT_HW_CACHE_RESULT_MAX];
 
+/* 
+ * Under certain circumstances, access certain MSR may cause #GP.
+ * The function tests if the input MSR can be safely accessed.
+ */
+static inline bool test_msr_access(unsigned long msr)
+{
+   u64 value;
+   if (rdmsrl_safe(msr, value)  0)
+   return false;
+   if (wrmsrl_safe(msr, value)  0)
+   return false;
+   return true;
+}
+
 u64 x86_perf_event_update(struct perf_event *event);
 
 static inline unsigned int x86_pmu_config_addr(int index)
@@ -555,7 +570,7 @@ static inline void __x86_pmu_enable_event(struct 
hw_perf_event *hwc,
 {
u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask);
 
-   if (hwc-extra_reg.reg)
+   if (hwc-extra_reg.reg  x86_pmu.extra_msr_access)
wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
wrmsrl(hwc-config_base, (hwc-config | enable_mask)  ~disable_mask);
 }
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index adb02aa..8011d42 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void)
}
}
 
+   /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM 
doesn't support LBR MSR */
+   if (x86_pmu.lbr_nr)
+   x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos)  
test_msr_access(x86_pmu.lbr_from);
+   /* Access extra MSR may cause #GP under certain circumstances. E.g. KVM 
doesn't support offcore event */
+   if (x86_pmu.extra_regs)
+   x86_pmu.extra_msr_access = 
test_msr_access(x86_pmu.extra_regs-msr);
+
/* Support full width counters using alternative MSR range */
if (x86_pmu.intel_cap.full_width_write) {
x86_pmu.max_period = x86_pmu.cntval_mask;
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9dd2459..9508d1e 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -172,7 +172,7 @@ static void intel_pmu_lbr_reset_64(void)
 
 void intel_pmu_lbr_reset(void)
 {
-   if (!x86_pmu.lbr_nr)
+   if (!x86_pmu.lbr_nr  || !x86_pmu.lbr_msr_access)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -334,7 +334,7 @@ void intel_pmu_lbr_read(void)
 {
struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
 
-   if (!cpuc-lbr_users)
+   if (!cpuc-lbr_users || !x86_pmu.lbr_msr_access)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH V2 2/3] perf protect LBR when Intel PT is enabled.

2014-07-02 Thread kan . liang
From: Kan Liang kan.li...@intel.com

If RTIT_CTL.TraceEn=1, any attempt to read or write the LBR or LER MSRs, 
including LBR_TOS, will result in a #GP.
Since Intel PT can be enabled/disabled at runtime, LBR MSRs have to be 
protected by _safe() at runtime.

Signed-off-by: Kan Liang kan.li...@intel.com
---
 arch/x86/kernel/cpu/perf_event.h   |  1 -
 arch/x86/kernel/cpu/perf_event_intel.c |  3 ---
 arch/x86/kernel/cpu/perf_event_intel_lbr.c | 38 +-
 3 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 5d977b2..fafb809 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -458,7 +458,6 @@ struct x86_pmu {
u64 lbr_sel_mask;  /* LBR_SELECT valid bits */
const int   *lbr_sel_map;  /* lbr_select mappings */
boollbr_double_abort;  /* duplicated lbr aborts */
-   boollbr_msr_access;/* LBR MSR can be accessed */
/*
 * Extra registers for events
 */
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 8011d42..ddd3590 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -2565,9 +2565,6 @@ __init int intel_pmu_init(void)
}
}
 
-   /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM 
doesn't support LBR MSR */
-   if (x86_pmu.lbr_nr)
-   x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos)  
test_msr_access(x86_pmu.lbr_from);
/* Access extra MSR may cause #GP under certain circumstances. E.g. KVM 
doesn't support offcore event */
if (x86_pmu.extra_regs)
x86_pmu.extra_msr_access = 
test_msr_access(x86_pmu.extra_regs-msr);
diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c 
b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index 9508d1e..980b8dc 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -157,7 +157,7 @@ static void intel_pmu_lbr_reset_32(void)
int i;
 
for (i = 0; i  x86_pmu.lbr_nr; i++)
-   wrmsrl(x86_pmu.lbr_from + i, 0);
+   wrmsrl_safe(x86_pmu.lbr_from + i, 0ULL);
 }
 
 static void intel_pmu_lbr_reset_64(void)
@@ -165,14 +165,14 @@ static void intel_pmu_lbr_reset_64(void)
int i;
 
for (i = 0; i  x86_pmu.lbr_nr; i++) {
-   wrmsrl(x86_pmu.lbr_from + i, 0);
-   wrmsrl(x86_pmu.lbr_to   + i, 0);
+   wrmsrl_safe(x86_pmu.lbr_from + i, 0ULL);
+   wrmsrl_safe(x86_pmu.lbr_to   + i, 0ULL);
}
 }
 
 void intel_pmu_lbr_reset(void)
 {
-   if (!x86_pmu.lbr_nr  || !x86_pmu.lbr_msr_access)
+   if (!x86_pmu.lbr_nr)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
@@ -237,19 +237,14 @@ void intel_pmu_lbr_disable_all(void)
 /*
  * TOS = most recently recorded branch
  */
-static inline u64 intel_pmu_lbr_tos(void)
+static inline int intel_pmu_lbr_tos(u64 *tos)
 {
-   u64 tos;
-
-   rdmsrl(x86_pmu.lbr_tos, tos);
-
-   return tos;
+   return rdmsrl_safe(x86_pmu.lbr_tos, tos);
 }
 
-static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
+static void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc, u64 tos)
 {
unsigned long mask = x86_pmu.lbr_nr - 1;
-   u64 tos = intel_pmu_lbr_tos();
int i;
 
for (i = 0; i  x86_pmu.lbr_nr; i++) {
@@ -278,11 +273,10 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events 
*cpuc)
  * is the same as the linear address, allowing us to merge the LIP and EIP
  * LBR formats.
  */
-static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
+static void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc, u64 tos)
 {
unsigned long mask = x86_pmu.lbr_nr - 1;
int lbr_format = x86_pmu.intel_cap.lbr_format;
-   u64 tos = intel_pmu_lbr_tos();
int i;
int out = 0;
 
@@ -333,14 +327,24 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
 void intel_pmu_lbr_read(void)
 {
struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events);
+   u64 tos;
 
-   if (!cpuc-lbr_users || !x86_pmu.lbr_msr_access)
+   if (!cpuc-lbr_users)
+   return;
+
+   /*
+* If KVM doesn't support LBR MSRs or Intel PT is enabled,
+* accessing LBR MSRs cause GP#.
+* Since Intel PT can be enabled/disabled at runtime,
+* checking the LBR MSRs access right here.
+*/
+   if (intel_pmu_lbr_tos(tos)  0)
return;
 
if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_32)
-   intel_pmu_lbr_read_32(cpuc);
+   intel_pmu_lbr_read_32(cpuc, tos);
else
-   intel_pmu_lbr_read_64(cpuc);
+   intel_pmu_lbr_read_64(cpuc, tos);

Re: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread Andi Kleen
 Signed-off-by: Andi Kleen a...@linux.intel.com

I did not contribute to this patch, so please remove that SOB.

 Signed-off-by: Kan Liang kan.li...@intel.com

   struct extra_reg *extra_regs;
   unsigned int er_flags;
 + boolextra_msr_access;  /* EXTRA REG MSR can be 
 accessed */
  

This doesn't look right, needs a flag for each extra register.

They are completely unrelated to each other.

BTW this will also cause KVM messages at each boot now.

   wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
   wrmsrl(hwc-config_base, (hwc-config | enable_mask)  ~disable_mask);
  }
 diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
 b/arch/x86/kernel/cpu/perf_event_intel.c
 index adb02aa..8011d42 100644
 --- a/arch/x86/kernel/cpu/perf_event_intel.c
 +++ b/arch/x86/kernel/cpu/perf_event_intel.c
 @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void)
   }
   }
  
 + /* Access LBR MSR may cause #GP under certain circumstances. E.g. KVM 
 doesn't support LBR MSR */
 + if (x86_pmu.lbr_nr)
 + x86_pmu.lbr_msr_access = test_msr_access(x86_pmu.lbr_tos)  
 test_msr_access(x86_pmu.lbr_from);

s///

And also this doesn't cover the case when someone takes over the LBRs and they 
start #GPing later.
So for LBR the test has to be still at each access.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread Jidong Xiao
On Wed, Jul 2, 2014 at 2:14 PM,  kan.li...@intel.com wrote:
 From: Kan Liang kan.li...@intel.com

 x86, perf: Protect LBR and offcore rsp against KVM lying

 With -cpu host, KVM reports LBR and offcore support, if the host has support.
 When the guest perf driver tries to access LBR or offcore_rsp MSR,
 it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support.
 So check the related MSRs access right once at initialization time to avoid 
 the error access at runtime.

 For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y.
This is for host kernel,
 And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n.
And this is for guest kernel, right?

-Jidong
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/3] perf protect LBR when Intel PT is enabled.

2014-07-02 Thread Andi Kleen
On Wed, Jul 02, 2014 at 11:14:14AM -0700, kan.li...@intel.com wrote:
 From: Kan Liang kan.li...@intel.com
 
 If RTIT_CTL.TraceEn=1, any attempt to read or write the LBR or LER MSRs, 
 including LBR_TOS, will result in a #GP.
 Since Intel PT can be enabled/disabled at runtime, LBR MSRs have to be 
 protected by _safe() at runtime.
 
 Signed-off-by: Kan Liang kan.li...@intel.com

Patch looks good to me.

-Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Wanpeng Li
Hi Jan,
On Wed, Jul 02, 2014 at 11:01:30AM +0200, Jan Kiszka wrote:
On 2014-07-02 08:54, Wanpeng Li wrote:
 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
 
 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 
 there is no still-pending event to L1 which blocked by nested_run_pending. 
 There is a race which lead to an interrupt will be injected to L2 which 
 belong to L1 if L0 send an interrupt to L1 during this window. 
 
VCPU0   another thread 
 
 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event 
 kvm check request req event 
 check_nested_events don't have any intr 
 not nested exit 
 intr occur (8254, lapic timer 
 etc)
 inject_pending_event now have intr 
 inject interrupt 
 
 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx 
 which indicates there is still-pending event which blocked by 
 nested_run_pending, 
 and smart request a KVM_REQ_EVENT if there is a still-pending event which 
 blocked 
 by nested_run_pending.

There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why
aren't those able to trigger this scenario?


Other KVM_REQ_EVENT has a specific intr or pending , however, the KVM_REQ_EVENT 
which request after vmexit with nested_run_pending may or may not have a
specific intr or pending.

Regards,
Wanpeng Li 

In any case, unconditionally setting KVM_REQ_EVENT seems strange and
should be changed.

Jan

 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
  arch/x86/kvm/vmx.c | 20 +++-
  1 file changed, 15 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f4e5aed..fe69c49 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -372,6 +372,7 @@ struct nested_vmx {
  u64 vmcs01_tsc_offset;
  /* L2 must run next, and mustn't decide to exit to L1. */
  bool nested_run_pending;
 +bool l1_events_blocked;
  /*
   * Guest pages referred to in vmcs02 with host-physical pointers, so
   * we must keep them pinned while L2 runs.
 @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   * we did not inject a still-pending event to L1 now because of
   * nested_run_pending, we need to re-enable this bit.
   */
 -if (vmx-nested.nested_run_pending)
 +if (to_vmx(vcpu)-nested.l1_events_blocked) {
 +to_vmx(vcpu)-nested.l1_events_blocked = false;
  kvm_make_request(KVM_REQ_EVENT, vcpu);
 +}
  
  vmx-nested.nested_run_pending = 0;
  
 @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
  if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
  vmx-nested.preemption_timer_expired) {
 -if (vmx-nested.nested_run_pending)
 +if (vmx-nested.nested_run_pending) {
 +vmx-nested.l1_events_blocked = true;
  return -EBUSY;
 +}
  nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
  return 0;
  }
  
  if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 -if (vmx-nested.nested_run_pending ||
 -vcpu-arch.interrupt.pending)
 +if (vmx-nested.nested_run_pending) {
 +vmx-nested.l1_events_blocked = true;
 +return -EBUSY;
 +}
 +if (vcpu-arch.interrupt.pending)
  return -EBUSY;
  nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
NMI_VECTOR | INTR_TYPE_NMI_INTR |
 @@ -8221,8 +8229,10 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
  if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
  nested_exit_on_intr(vcpu)) {
 -if (vmx-nested.nested_run_pending)
 +if (vmx-nested.nested_run_pending) {
 +vmx-nested.l1_events_blocked = true;
  return -EBUSY;
 +}
  nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
  }
  
 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] COLO HA Project proposal

2014-07-02 Thread Hongyang Yang

Hi David,

On 07/01/2014 08:12 PM, Dr. David Alan Gilbert wrote:

* Hongyang Yang (yan...@cn.fujitsu.com) wrote:

Hi Yang,


Background:
   COLO HA project is a high availability solution. Both primary
VM (PVM) and secondary VM (SVM) run in parallel. They receive the
same request from client, and generate response in parallel too.
If the response packets from PVM and SVM are identical, they are
released immediately. Otherwise, a VM checkpoint (on demand) is
conducted. The idea is presented in Xen summit 2012, and 2013,
and academia paper in SOCC 2013. It's also presented in KVM forum
2013:
http://www.linux-kvm.org/wiki/images/1/1d/Kvm-forum-2013-COLO.pdf
Please refer to above document for detailed information.


Yes, I remember that talk - very interesting.

I didn't quite understand a couple of things though, perhaps you
can explain:
   1) If we ignore the TCP sequence number problem, in an SMP machine
don't we get other randomnesses - e.g. which core completes something
first, or who wins a lock contention, so the output stream might not
be identical - so do those normal bits of randomness cause the machines
to flag as out-of-sync?


It's about COLO agent, CCing Congyang, he can give the detailed
explanation.



   2) If the PVM has decided that the SVM is out of sync (due to 1) and
the PVM fails at about the same point - can we switch over to the SVM?


Yes, we can switch over, we have some mechanisms to ensure the SVM's state
is consentient:
- memory cache.
  The memory cache was initially the same as PVM's memory. At
checkpoint, we cache the dirty memory of PVM while transporting the
memory, write cached memory to SVM when we received all PVM memory
(we only need to write memory that was both dirty on PVM and SVM
from last checkpoint). This solves problem 2) you've mentioned above:
If PVM fails while checkpointing, SVM will discard the cached memory
and continue to run and to provide service just as it is.

- COLO Disk manager
  Like memory cache, COLO Disk manager caches the Disk modifications
of PVM, and write it to SVM Disk when checkpointing. If PVM fails while
checkpointing, SVM will discard the cached Disk modifications.



I'm worried that due to (1) there are periods where the system
is out-of-sync and a failure of the PVM is not protected.  Does that happen?
If so how often?


The attached was the architecture of kvm-COLO we proposed.
   - COLO Manager: Requires modifications of qemu
 - COLO Controller
 COLO Controller includes modifications of save/restore
   flow just like MC(macrocheckpoint), a memory cache on
   secondary VM which cache the dirty pages of primary VM
   and a failover module which provides APIs to communicate
   with external heartbead module.
 - COLO Disk Manager
 When pvm writes data into image, the colo disk manger
   captures this data and send it to the colo disk manger
   which makes sure the context of svm's image is consentient
   with the context of pvm's image.


I wonder if there is anyway to coordinate this between COLO, Michael
Hines microcheckpointing and the two separate reverse-execution
projects that also need to do some similar things.
Are there any standard APIs for the heartbeet thing we can already
tie into?


Sadly we have checked MC, it does not have heartbeat support for now.




   - COLO Agent(Proxy module in the arch picture)
   We need an agent to compare the packets returned by
 Primary VM and Secondary VM, and decide whether to start a
 checkpoint according to some rules. It is a linux kernel
 module for host.


Why is that a kernel module, and how does it communicate the state
to the QEMU instance?


The reason we made this a kernel module is to gain better performance.
We can easily hook the packets in a kernel module.
QEMU instance uses ioctl() to communicate with the COLO Agent.




   - Other minor modifications
   We may need other modifications for better performance.


Dave
P.S. I'm starting to look at fault-tolerance stuff, but haven't
got very far yet, so starting to try and understand the details
of COLO, microcheckpointing, etc


--
Thanks,
Yang.



--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
.



--
Thanks,
Yang.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread Liang, Kan


  Signed-off-by: Andi Kleen a...@linux.intel.com
 
 I did not contribute to this patch, so please remove that SOB.
 

OK

  Signed-off-by: Kan Liang kan.li...@intel.com
 
  struct extra_reg *extra_regs;
  unsigned int er_flags;
  +   boolextra_msr_access;  /* EXTRA REG MSR can be
 accessed */
 
 
 This doesn't look right, needs a flag for each extra register.
 
 They are completely unrelated to each other.

The extra register is either MSR_OFFCORE_RSP_0 or MSR_OFFCORE_RSP_1.
I will add two variables to handle them.

 
 BTW this will also cause KVM messages at each boot now.


Do you mean the unhandled wrmsr or  unhandled rdmsr  messages? 
You should not observer  such error message for offcore msrs and LBR from/to 
msrs.

But I forget to handle LBR TOS in KVM patch. You may observed unhandled rdmsr: 
0x1c9 when doing LBR in guest.
I will fix it in next version.
 
  wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config);
  wrmsrl(hwc-config_base, (hwc-config | enable_mask) 
  ~disable_mask);  } diff --git a/arch/x86/kernel/cpu/perf_event_intel.c
  b/arch/x86/kernel/cpu/perf_event_intel.c
  index adb02aa..8011d42 100644
  --- a/arch/x86/kernel/cpu/perf_event_intel.c
  +++ b/arch/x86/kernel/cpu/perf_event_intel.c
  @@ -2565,6 +2565,13 @@ __init int intel_pmu_init(void)
  }
  }
 
  +   /* Access LBR MSR may cause #GP under certain circumstances. E.g.
 KVM doesn't support LBR MSR */
  +   if (x86_pmu.lbr_nr)
  +   x86_pmu.lbr_msr_access =
 test_msr_access(x86_pmu.lbr_tos) 
  +test_msr_access(x86_pmu.lbr_from);
 
 s///
 
 And also this doesn't cover the case when someone takes over the LBRs and
 they start #GPing later.
 So for LBR the test has to be still at each access.

In the second patch, the LBR test has been moved to runtime check.
This case has been handled.  


Kan

 
 -Andi
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V2 1/3] perf ignore LBR and offcore_rsp.

2014-07-02 Thread Liang, Kan


 
 On Wed, Jul 2, 2014 at 2:14 PM,  kan.li...@intel.com wrote:
  From: Kan Liang kan.li...@intel.com
 
  x86, perf: Protect LBR and offcore rsp against KVM lying
 
  With -cpu host, KVM reports LBR and offcore support, if the host has
 support.
  When the guest perf driver tries to access LBR or offcore_rsp MSR, it
  #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support.
  So check the related MSRs access right once at initialization time to avoid
 the error access at runtime.
 
  For reproducing the issue, please build the kernel with
 CONFIG_KVM_INTEL = y.
 This is for host kernel,
  And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n.
 And this is for guest kernel, right?
 

Right.


 -Jidong


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Wanpeng Li
Hi Bandan,
On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote:
Wanpeng Li wanpeng...@linux.intel.com writes:

 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
I can also reproduce this easily with Linux as L1 by slowing it down
eg. running with ept = 0

I suggest changing the subject to -
KVM: nVMX: Fix race that incorrectly injects L1's irq to L2


Ok, I will fold this to next version. ;-)

 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 

What's current log ? Do you mean current code ?


Yeah, it's a typo. I mean logic.

[...]
Also, I am wondering isn't it enough to just do this to avoid this race ?

 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
-   return (!to_vmx(vcpu)-nested.nested_run_pending 
+   return (!is_guest_mode(vcpu) 
+   !to_vmx(vcpu)-nested.nested_run_pending 
vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 


I don't think you fix the root cause of the race, and there are two cases which 
I concern about your proposal:

- If there is a special L1 which don't ask to exit on external intrs, you will 
  lose the intrs which L0 inject to L2.  

- If inject_pending_event fail to inject an intr since the change that you made 
  in vmx_interrupt_allowed, L0 will request an intr window for L2, however, the 
  intr is belong to L1.

Regards,
Wanpeng Li 

Thanks,
Bandan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Bandan Das
Jan Kiszka jan.kis...@siemens.com writes:

 On 2014-07-02 08:54, Wanpeng Li wrote:
 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
 
 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 
 there is no still-pending event to L1 which blocked by nested_run_pending. 
 There is a race which lead to an interrupt will be injected to L2 which 
 belong to L1 if L0 send an interrupt to L1 during this window. 
 
VCPU0   another thread 
 
 L1 intr not blocked on L2 first entry
 vmx_vcpu_run req event 
 kvm check request req event 
 check_nested_events don't have any intr 
 not nested exit 
 intr occur (8254, lapic timer 
 etc)
 inject_pending_event now have intr 
 inject interrupt 
 
 This patch fix this race by introduced a l1_events_blocked field in 
 nested_vmx 
 which indicates there is still-pending event which blocked by 
 nested_run_pending, 
 and smart request a KVM_REQ_EVENT if there is a still-pending event which 
 blocked 
 by nested_run_pending.

 There are more, unrelated reasons why KVM_REQ_EVENT could be set. Why
 aren't those able to trigger this scenario?

 In any case, unconditionally setting KVM_REQ_EVENT seems strange and
 should be changed.


Ugh! I think I am hitting another one but this one's probably because 
we are not setting KVM_REQ_EVENT for something we should.

Before this patch, I was able to hit this bug everytime with 
modprobe kvm_intel ept=0 nested=1 enable_shadow_vmcs=0 and then booting 
L2. I can verify that I was indeed hitting the race in inject_pending_event.

After this patch, I believe I am hitting another bug - this happens 
after I boot L2, as above, and then start a Linux kernel compilation
and then wait and watch :) It's a pain to debug because this happens
almost once in three times; it never happens if I run with ept=1, however,
I think that's only because the test completes sooner. But I can confirm
that I don't see it if I always set REQ_EVENT if nested_run_pending is set 
instead of
the approach this patch takes.
(Any debug hints help appreciated!)

So, I am not sure if this is the right fix. Rather, I think the safer thing
to do is to have the interrupt pending check for injection into L1 at
the same site as the call to kvm_queue_interrupt() just like we had before 
commit b6b8a1451fc40412c57d1. Is there any advantage to having all the 
nested events checks together ?

PS - Actually, a much easier fix (or rather hack) is to return 1 in 
vmx_interrupt_allowed() (as I mentioned elsewhere) only if 
!is_guest_mode(vcpu) That way, the pending interrupt interrupt 
can be taken care of correctly during the next vmexit.

Bandan

 Jan

 
 Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com
 ---
  arch/x86/kvm/vmx.c | 20 +++-
  1 file changed, 15 insertions(+), 5 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index f4e5aed..fe69c49 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -372,6 +372,7 @@ struct nested_vmx {
  u64 vmcs01_tsc_offset;
  /* L2 must run next, and mustn't decide to exit to L1. */
  bool nested_run_pending;
 +bool l1_events_blocked;
  /*
   * Guest pages referred to in vmcs02 with host-physical pointers, so
   * we must keep them pinned while L2 runs.
 @@ -7380,8 +7381,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
 *vcpu)
   * we did not inject a still-pending event to L1 now because of
   * nested_run_pending, we need to re-enable this bit.
   */
 -if (vmx-nested.nested_run_pending)
 +if (to_vmx(vcpu)-nested.l1_events_blocked) {
 +to_vmx(vcpu)-nested.l1_events_blocked = false;
  kvm_make_request(KVM_REQ_EVENT, vcpu);
 +}
  
  vmx-nested.nested_run_pending = 0;
  
 @@ -8197,15 +8200,20 @@ static int vmx_check_nested_events(struct kvm_vcpu 
 *vcpu, bool external_intr)
  
  if (nested_cpu_has_preemption_timer(get_vmcs12(vcpu)) 
  vmx-nested.preemption_timer_expired) {
 -if (vmx-nested.nested_run_pending)
 +if (vmx-nested.nested_run_pending) {
 +vmx-nested.l1_events_blocked = true;
  return -EBUSY;
 +}
  nested_vmx_vmexit(vcpu, EXIT_REASON_PREEMPTION_TIMER, 0, 0);
  return 0;
  }
  
  if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 -if (vmx-nested.nested_run_pending ||
 -vcpu-arch.interrupt.pending)
 +if (vmx-nested.nested_run_pending) {
 +vmx-nested.l1_events_blocked = true;
 +return -EBUSY;
 +}
 +if (vcpu-arch.interrupt.pending)
  return -EBUSY;
  nested_vmx_vmexit(vcpu, 

Re: [PATCH] KVM: nVMX: Fix IRQs inject to L2 which belong to L1 since race

2014-07-02 Thread Bandan Das
Wanpeng Li wanpeng...@linux.intel.com writes:

 Hi Bandan,
 On Wed, Jul 02, 2014 at 12:27:59PM -0400, Bandan Das wrote:
Wanpeng Li wanpeng...@linux.intel.com writes:

 This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=72381 
I can also reproduce this easily with Linux as L1 by slowing it down
eg. running with ept = 0

I suggest changing the subject to -
KVM: nVMX: Fix race that incorrectly injects L1's irq to L2


 Ok, I will fold this to next version. ;-)

 If we didn't inject a still-pending event to L1 since nested_run_pending,
 KVM_REQ_EVENT should be requested after the vmexit in order to inject the 
 event to L1. However, current log blindly request a KVM_REQ_EVENT even if 

What's current log ? Do you mean current code ?


 Yeah, it's a typo. I mean logic.

 [...]
Also, I am wondering isn't it enough to just do this to avoid this race ?

 static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
 {
-   return (!to_vmx(vcpu)-nested.nested_run_pending 
+   return (!is_guest_mode(vcpu) 
+   !to_vmx(vcpu)-nested.nested_run_pending 
vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
!(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 


 I don't think you fix the root cause of the race, and there are two cases 
 which 
 I concern about your proposal:

 - If there is a special L1 which don't ask to exit on external intrs, you 
 will 
   lose the intrs which L0 inject to L2.  

Oh didn't think about that case :), thanks for the pointing this out.
It's easy to check this with Xen as L1, I suppose.

 - If inject_pending_event fail to inject an intr since the change that you 
 made 
   in vmx_interrupt_allowed, L0 will request an intr window for L2, however, 
 the 
   intr is belong to L1.
Oh, I thought inject_pending_event has kvm_queue_interrupt only if 
vmx_interrupt_allowed()
returns true so, interrupt will be injected correctly on the next vmexit.

Anyway, I am hitting another bug with your patch! Please see my other mail to 
the list.

Thanks!

 Regards,
 Wanpeng Li 

Thanks,
Bandan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest

2014-07-02 Thread bharat.bhus...@freescale.com

 -Original Message-
 From: Wood Scott-B07421
 Sent: Tuesday, July 01, 2014 10:11 PM
 To: Alexander Graf
 Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org; k...@vger.kernel.org
 Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection 
 to
 guest
 
 On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote:
  On 01.07.14 17:35, Scott Wood wrote:
   On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote:
   On 01.07.14 16:58, Scott Wood wrote:
   On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote:
   I don't think QEMU should be aware of these limitations.
   OK, but we should at least have some idea of how the whole thing
   is supposed to work, in order to determine if this is the correct
   behavior for QEMU.  I thought the model was that debug resources
   are either owned by QEMU or by the guest, and in the latter case,
   QEMU would never see the debug exception to begin with.
   That's bad for a number of reasons. For starters it's different
   from how
   x86 handles debug registers - and I hate to be different just for
   the sake of being different.
   How does it work on x86?
 
  It overwrites more-or-less random breakpoints with its own ones, but
  leaves the others intact ;).
 
 Are you talking about software breakpoints or management of hardware debug
 registers?
 
   So if we do want to declare that debug registers are owned by
   either QEMU or the guest, we should change the semantics for all
   architectures.
   If we want to say that ownership of the registers is shared, we need
   a plan for how that would actually work.
 
  I think you're overengineering here :). When do people actually use
  gdbstub? Usually when they want to debug a broken guest. We can either
 
 * overengineer heavily and reduce the number of registers available
  to the guest to always have spares
 * overengineer a bit and turn off guest debugging completely when
  we use gdbstub
 * just leave as much alive as we can, hoping that it helps with the
  debugging
 
  Option 3 is what x86 does - and I think it's a reasonable approach.
  This is not an interface that needs to be 100% consistent and bullet
  proof, it's a best effort to enable you to debug as much as possible.
 
 I'm not insisting on 100% -- just hoping for some explanation/discussion about
 how it's intended to work for the cases where it can.
 
 How will MSR[DE] and MSRP[DEP] be handled?
 
 How would I go about telling QEMU/KVM that I don't want this shared mode,
 because I don't want guest to interfere with the debugging I'm trying to do 
 from
 QEMU?
 
 Will guest accesses to debug registers cause a userspace exit when guest_debug
 is enabled?
 
   I think we're in a path that is slow enough already to not worry
   about performance.
   It's not just about performance, but simplicity of use, and
   consistency of API.
  
   Oh, and it looks like there already exist one reg definitions and
   implementations for most of the debug registers.
 
  For BookE? Where?
 
 arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn

I tried to quickly prototype what I think we want to do (this is not tested)

---
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index e8b3982..746b5c6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq, 
u32 *server,
 extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);
 extern int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
 
+void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu);
+void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9f13056..0b7e4e4 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -235,6 +235,16 @@ void kvmppc_core_dequeue_dec(struct kvm_vcpu *vcpu)
clear_bit(BOOKE_IRQPRIO_DECREMENTER, vcpu-arch.pending_exceptions);
 }
 
+void kvmppc_core_queue_debug(struct kvm_vcpu *vcpu)
+{
+   kvmppc_booke_queue_irqprio(vcpu, BOOKE_IRQPRIO_DEBUG);
+}
+ 
+void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu)
+{
+   clear_bit(BOOKE_IRQPRIO_DEBUG, vcpu-arch.pending_exceptions);
+}
+
 void kvmppc_core_queue_external(struct kvm_vcpu *vcpu,
 struct kvm_interrupt *irq)
 {
@@ -841,6 +851,20 @@ static int kvmppc_handle_debug(struct kvm_run *run, struct 
kvm_vcpu *vcpu)
struct debug_reg *dbg_reg = (vcpu-arch.shadow_dbg_reg);
u32 dbsr = vcpu-arch.dbsr;
 
+   /* Userspace (QEMU) is not using debug resource, so inject debug 
interrupt
+* directly to guest debug.
+*/
+   if (vcpu-guest_debug == 0) {
+   if (dbsr  (vcpu-arch.shared-msr  MSR_DE))
+   

Re: [PATCH 4/6] KVM: PPC: BOOK3S: HV: Use new functions for mapping/unmapping hpte in host

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:33PM +0530, Aneesh Kumar K.V wrote:
 We want to use virtual page class key protection mechanism for
 indicating a MMIO mapped hpte entry or a guest hpte entry that is swapped out
 in the host. Those hptes will be marked valid, but have virtual page
 class key set to 30 or 31. These virtual page class numbers are
 configured in AMR to deny read/write. To accomodate such a change, add
 new functions that map, unmap and check whether a hpte is mapped in the
 host. This patch still use HPTE_V_VALID and HPTE_V_ABSENT and don't use
 virtual page class keys. But we want to differentiate in the code
 where we explicitly check for HPTE_V_VALID with places where we want to
 check whether the hpte is host mapped. This patch enables a closer
 review for such a change.

 [...]

  /* Check for pending invalidations under the rmap chain lock */
  if (kvm-arch.using_mmu_notifiers 
  mmu_notifier_retry(kvm, mmu_seq)) {
 -/* inval in progress, write a non-present HPTE */
 -pteh |= HPTE_V_ABSENT;
 -pteh = ~HPTE_V_VALID;
 +/*
 + * inval in progress in host, write host unmapped pte.
 + */
 +host_unmapped_hpte = 1;

 This isn't right.  We already have HPTE_V_VALID set here, and you now
 don't clear it here, and it doesn't get cleared by the
 __kvmppc_unmap_host_hpte() call below either.


Ok missed that. Will fix that in the next update. In the earlier version
I had kvmppc_unmap_host_hpte always clearing V_VALID. 

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] KVM: PPC: BOOK3S: Use hpte_update_in_progress to track invalid hpte during an hpte update

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:34PM +0530, Aneesh Kumar K.V wrote:
 As per ISA, we first need to mark hpte invalid (V=0) before we update
 the hpte lower half bits. With virtual page class key protection mechanism 
 we want
 to send any fault other than key fault to guest directly without
 searching the hash page table. But then we can get NO_HPTE fault while
 we are updating the hpte. To track that add a vm specific atomic
 variable that we check in the fault path to always send the fault
 to host.

 [...]

 @@ -750,13 +751,15 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
 struct kvm_vcpu *vcpu,
  r = rcbits | ~(HPTE_R_R | HPTE_R_C);
  
  if (be64_to_cpu(hptep[0])  HPTE_V_VALID) {
 -/* HPTE was previously valid, so we need to invalidate it */
 +/*
 + * If we had mapped this hpte before, we now need to
 + * invalidate that.
 + */
  unlock_rmap(rmap);
 -/* Always mark HPTE_V_ABSENT before invalidating */
 -kvmppc_unmap_host_hpte(kvm, hptep);
  kvmppc_invalidate_hpte(kvm, hptep, index);
  /* don't lose previous R and C bits */
  r |= be64_to_cpu(hptep[1])  (HPTE_R_R | HPTE_R_C);
 +hpte_invalidated = true;

 So now we're not setting the ABSENT bit before invalidating the HPTE.
 That means that another guest vcpu could do an H_ENTER which could
 think that this HPTE is free and use it for another unrelated guest
 HPTE, which would be bad...

But henter looks at HPTE_V_HVLOCK, and we keep that set through out. But
I will double the code again to make sure it is safe in the above
scenario.


 @@ -1144,8 +1149,8 @@ static int kvm_test_clear_dirty_npages(struct kvm 
 *kvm, unsigned long *rmapp)
  npages_dirty = n;
  eieio();
  }
 -kvmppc_map_host_hpte(kvm, v, r);
 -hptep[0] = cpu_to_be64(v  ~HPTE_V_HVLOCK);
 +hptep[0] = cpu_to_be64(v  ~HPTE_V_LOCK);
 +atomic_dec(kvm-arch.hpte_update_in_progress);

 Why are we using LOCK rather than HVLOCK now?  (And why didn't you
 mention this change and its rationale in the patch description?)

Sorry, that is a typo. I intend to use HPTE_V_HVLOCK.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6] KVM: PPC: BOOK3S: HV: Deny virtual page class key update via h_protect

2014-07-02 Thread Aneesh Kumar K.V
Paul Mackerras pau...@samba.org writes:

 On Sun, Jun 29, 2014 at 04:47:31PM +0530, Aneesh Kumar K.V wrote:
 This makes it consistent with h_enter where we clear the key
 bits. We also want to use virtual page class key protection mechanism
 for indicating host page fault. For that we will be using key class
 index 30 and 31. So prevent the guest from updating key bits until
 we add proper support for virtual page class protection mechanism for
 the guest. This will not have any impact for PAPR linux guest because
 Linux guest currently don't use virtual page class key protection model

 As things stand, without this patch series, we do actually have
 everything we need in the kernel for guests to use virtual page class
 keys.  Arguably we should have a capability to tell userspace how many
 storage keys the guest can use, but that's the only missing piece as
 far as I can see.

yes.


 If we add such a capability, I can't see any reason why we should need
 to disable guest use of storage keys in this patchset.

With this patchset, we would need additonal changes to find out whether the key
fault happened because of the guest's usage of the key. I was planning to do
that as an add-on series to keep the changes in this minimal. Also since
linux didn't use keys i was not sure whether guest support of keys is an
important item.

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection to guest

2014-07-02 Thread bharat.bhus...@freescale.com


 -Original Message-
 From: Bhushan Bharat-R65777
 Sent: Wednesday, July 02, 2014 5:07 PM
 To: Wood Scott-B07421; Alexander Graf
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org
 Subject: RE: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt injection 
 to
 guest
 
 
  -Original Message-
  From: Wood Scott-B07421
  Sent: Tuesday, July 01, 2014 10:11 PM
  To: Alexander Graf
  Cc: Bhushan Bharat-R65777; kvm-ppc@vger.kernel.org;
  k...@vger.kernel.org
  Subject: Re: [PATCH 2/2] KVM : powerpc/booke: Allow debug interrupt
  injection to guest
 
  On Tue, 2014-07-01 at 18:22 +0200, Alexander Graf wrote:
   On 01.07.14 17:35, Scott Wood wrote:
On Tue, 2014-07-01 at 17:04 +0200, Alexander Graf wrote:
On 01.07.14 16:58, Scott Wood wrote:
On Tue, 2014-07-01 at 08:23 +0200, Alexander Graf wrote:
I don't think QEMU should be aware of these limitations.
OK, but we should at least have some idea of how the whole thing
is supposed to work, in order to determine if this is the
correct behavior for QEMU.  I thought the model was that debug
resources are either owned by QEMU or by the guest, and in the
latter case, QEMU would never see the debug exception to begin with.
That's bad for a number of reasons. For starters it's different
from how
x86 handles debug registers - and I hate to be different just for
the sake of being different.
How does it work on x86?
  
   It overwrites more-or-less random breakpoints with its own ones, but
   leaves the others intact ;).
 
  Are you talking about software breakpoints or management of hardware
  debug registers?
 
So if we do want to declare that debug registers are owned by
either QEMU or the guest, we should change the semantics for all
architectures.
If we want to say that ownership of the registers is shared, we
need a plan for how that would actually work.
  
   I think you're overengineering here :). When do people actually use
   gdbstub? Usually when they want to debug a broken guest. We can
   either
  
  * overengineer heavily and reduce the number of registers
   available to the guest to always have spares
  * overengineer a bit and turn off guest debugging completely when
   we use gdbstub
  * just leave as much alive as we can, hoping that it helps with
   the debugging
  
   Option 3 is what x86 does - and I think it's a reasonable approach.
   This is not an interface that needs to be 100% consistent and bullet
   proof, it's a best effort to enable you to debug as much as possible.
 
  I'm not insisting on 100% -- just hoping for some
  explanation/discussion about how it's intended to work for the cases where 
  it
 can.
 
  How will MSR[DE] and MSRP[DEP] be handled?
 
  How would I go about telling QEMU/KVM that I don't want this shared
  mode, because I don't want guest to interfere with the debugging I'm
  trying to do from QEMU?
 
  Will guest accesses to debug registers cause a userspace exit when
  guest_debug is enabled?
 
I think we're in a path that is slow enough already to not worry
about performance.
It's not just about performance, but simplicity of use, and
consistency of API.
   
Oh, and it looks like there already exist one reg definitions and
implementations for most of the debug registers.
  
   For BookE? Where?
 
  arch/powerpc/kvm/booke.c: KVM_REG_PPC_IACn, KVM_REG_PPC_DACn
 
 I tried to quickly prototype what I think we want to do (this is not tested)

Hi Scott,

There is one problem which is stopping us to share debug resource between qemu 
and guest, looking for suggestions:
- As qemu is also using debug resource,  We have to set MSR_DE and set MSRP_DEP 
(guest will not be able to clear MSR_DE). So qemu set debug events will always 
cause the debug interrupts.
- Now guest is also using debug resources and for some reason if guest wants to 
clear MSR_DE (disable debug interrupt) But it will not be able to disable as 
MSRP_DEP is set and KVM will not come to know guest willingness to disable 
MSR_DE.
- If the debug interrupts occurs then we will exit to QEMU and this may not a 
QEMU set event so it will inject interrupt to guest (using one-reg or set-sregs)
- Now KVM, when handling one-reg/sregs request to inject debug interrupt, do 
not know whether guest can handle the debug interrupt or not (as guest might 
have tried to set/clear MSR_DE).

Thanks
-Bharat

 
 ---
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h
 b/arch/powerpc/include/asm/kvm_ppc.h
 index e8b3982..746b5c6 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -179,6 +179,9 @@ extern int kvmppc_xics_get_xive(struct kvm *kvm, u32 irq,
 u32 *server,  extern int kvmppc_xics_int_on(struct kvm *kvm, u32 irq);  extern
 int kvmppc_xics_int_off(struct kvm *kvm, u32 irq);
 
 +void kvmppc_core_dequeue_debug(struct kvm_vcpu *vcpu); void