date:20130220

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 07:28:52PM -0600, Scott Wood wrote:
> On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:
> >On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >> >It is then not necessary to set device attributes on a live
> >guest and
> >> >deal with the complications associated with that.
> >>
> >> Which complications?
> >>
> >> -Scott
> >
> >Semantics of individual attribute writes, for one.
> 
> When the attribute is a device register, the hardware documentation
> takes care of that.  Otherwise, the semantics are documented in the
> device-specific documentation -- which can include restricting when
> the access is allowed.  Same as with any other interface
> documentation.
> 
> I suppose mpic.txt could use an additional statement that
> KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed
> by the guest.
> 
If access to an attribute has a guest visible side effect you cannot
use this interface for migration. This is exactly the point in favor
of distinguish accesses that have side effects (COMMAND or whatever)
and accesses that set/get attribute (SET|GET_ATTR).

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-20 Thread Gleb Natapov

On Thu, Feb 21, 2013 at 06:04:52AM +, Zhang, Yang Z wrote:
> Hi Marcelo,
> 
> Can you help to review this patch? Many thanks if you can review it quickly.
> 
The patch is only 2 days on the list. Be patient.

> Zhang, Yang Z wrote on 2013-02-19:
> > From: Yang Zhang 
> > 
> > Posted Interrupt allows APIC interrupts to inject into guest directly
> > without any vmexit.
> > 
> > - When delivering a interrupt to guest, if target vcpu is running,
> >   update Posted-interrupt requests bitmap and send a notification event
> >   to the vcpu. Then the vcpu will handle this interrupt automatically,
> >   without any software involvemnt.
> > - If target vcpu is not running or there already a notification event
> >   pending in the vcpu, do nothing. The interrupt will be handled by
> >   next vm entry
> > Signed-off-by: Yang Zhang 
> > ---
> >  arch/x86/include/asm/entry_arch.h  |4 +
> >  arch/x86/include/asm/hw_irq.h  |1 +
> >  arch/x86/include/asm/irq_vectors.h |5 +
> >  arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h 
> > |4 + arch/x86/kernel/entry_64.S |5 +
> >  arch/x86/kernel/irq.c  |   20 +
> >  arch/x86/kernel/irqinit.c  |4 + arch/x86/kvm/lapic.c   
> > |   19 - arch/x86/kvm/lapic.h   |1 +
> >  arch/x86/kvm/svm.c |   13 +++ arch/x86/kvm/vmx.c   
> >   |  157 +++- arch/x86/kvm/x86.c
> >  |1 + 13 files changed, 214 insertions(+), 23
> >  deletions(-)
> > diff --git a/arch/x86/include/asm/entry_arch.h
> > b/arch/x86/include/asm/entry_arch.h index 40afa00..9bd4eca 100644 ---
> > a/arch/x86/include/asm/entry_arch.h +++
> > b/arch/x86/include/asm/entry_arch.h @@ -19,6 +19,10 @@
> > BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
> > 
> >  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> > +#ifdef CONFIG_HAVE_KVM +BUILD_INTERRUPT(kvm_posted_intr_ipi,
> > POSTED_INTR_VECTOR) +#endif +
> >  /*
> >   * every pentium local APIC has two 'local interrupts', with a
> >   * soft-definable vector attached to both interrupts, one of
> > diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> > index eb92a6e..cebef02 100644
> > --- a/arch/x86/include/asm/hw_irq.h
> > +++ b/arch/x86/include/asm/hw_irq.h
> > @@ -28,6 +28,7 @@
> >  /* Interrupt handlers registered during init_IRQ */ extern void
> >  apic_timer_interrupt(void); extern void x86_platform_ipi(void); +extern
> >  void kvm_posted_intr_ipi(void); extern void error_interrupt(void);
> >  extern void irq_work_interrupt(void);
> > diff --git a/arch/x86/include/asm/irq_vectors.h
> > b/arch/x86/include/asm/irq_vectors.h index 1508e51..774dc9f 100644 ---
> > a/arch/x86/include/asm/irq_vectors.h +++
> > b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,11 @@
> >   */
> >  #define X86_PLATFORM_IPI_VECTOR0xf7
> > +/* Vector for KVM to deliver posted interrupt IPI */
> > +#ifdef CONFIG_HAVE_KVM
> > +#define POSTED_INTR_VECTOR 0xf2
> > +#endif
> > +
> >  /*
> >   * IRQ work vector:
> >   */
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h index b8388e9..79da55e 100644 ---
> > a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -704,6 +704,9 @@ struct kvm_x86_ops {
> > void (*hwapic_isr_update)(struct kvm *kvm, int isr);
> > void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> > void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> > +   bool (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector,
> > +   int *result, bool *delivered);
> > +   void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
> > int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
> > int (*get_tdp_level)(void);
> > u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index 5c9dbad..ce8ac80 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -158,6 +158,7 @@
> >  #define PIN_BASED_EXT_INTR_MASK 0x0001
> >  #define PIN_BASED_NMI_EXITING   0x0008
> >  #define PIN_BASED_VIRTUAL_NMIS  0x0020
> > +#define PIN_BASED_POSTED_INTR   0x0080
> > 
> >  #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
> >  VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -180,6 +181,7 @@
> >  /* VMCS Encodings */ enum vmcs_field { VIRTUAL_PROCESSOR_ID   
> >  = 0x, +POSTED_INTR_NV  = 0x0002,
> > GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR  
> >  = 0x0802,  GUEST_SS_SELECTOR   = 0x0804, @@
> >  -214,6 +216,8 @@ enum vmcs_field { VIRTUAL_APIC_PAGE_ADDR_HIGH 
> > =
> >  0x2013,

RE: [PATCH v3 2/2] KVM: VMX: Add Posted Interrupt supporting

2013-02-20 Thread Zhang, Yang Z

Hi Marcelo,

Can you help to review this patch? Many thanks if you can review it quickly.

Zhang, Yang Z wrote on 2013-02-19:
> From: Yang Zhang 
> 
> Posted Interrupt allows APIC interrupts to inject into guest directly
> without any vmexit.
> 
> - When delivering a interrupt to guest, if target vcpu is running,
>   update Posted-interrupt requests bitmap and send a notification event
>   to the vcpu. Then the vcpu will handle this interrupt automatically,
>   without any software involvemnt.
> - If target vcpu is not running or there already a notification event
>   pending in the vcpu, do nothing. The interrupt will be handled by
>   next vm entry
> Signed-off-by: Yang Zhang 
> ---
>  arch/x86/include/asm/entry_arch.h  |4 +
>  arch/x86/include/asm/hw_irq.h  |1 +
>  arch/x86/include/asm/irq_vectors.h |5 +
>  arch/x86/include/asm/kvm_host.h|3 + arch/x86/include/asm/vmx.h 
> |4 + arch/x86/kernel/entry_64.S |5 +
>  arch/x86/kernel/irq.c  |   20 +
>  arch/x86/kernel/irqinit.c  |4 + arch/x86/kvm/lapic.c   
> |   19 - arch/x86/kvm/lapic.h   |1 +
>  arch/x86/kvm/svm.c |   13 +++ arch/x86/kvm/vmx.c   
>   |  157 +++- arch/x86/kvm/x86.c
>  |1 + 13 files changed, 214 insertions(+), 23
>  deletions(-)
> diff --git a/arch/x86/include/asm/entry_arch.h
> b/arch/x86/include/asm/entry_arch.h index 40afa00..9bd4eca 100644 ---
> a/arch/x86/include/asm/entry_arch.h +++
> b/arch/x86/include/asm/entry_arch.h @@ -19,6 +19,10 @@
> BUILD_INTERRUPT(reboot_interrupt,REBOOT_VECTOR)
> 
>  BUILD_INTERRUPT(x86_platform_ipi, X86_PLATFORM_IPI_VECTOR)
> +#ifdef CONFIG_HAVE_KVM +BUILD_INTERRUPT(kvm_posted_intr_ipi,
> POSTED_INTR_VECTOR) +#endif +
>  /*
>   * every pentium local APIC has two 'local interrupts', with a
>   * soft-definable vector attached to both interrupts, one of
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index eb92a6e..cebef02 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -28,6 +28,7 @@
>  /* Interrupt handlers registered during init_IRQ */ extern void
>  apic_timer_interrupt(void); extern void x86_platform_ipi(void); +extern
>  void kvm_posted_intr_ipi(void); extern void error_interrupt(void);
>  extern void irq_work_interrupt(void);
> diff --git a/arch/x86/include/asm/irq_vectors.h
> b/arch/x86/include/asm/irq_vectors.h index 1508e51..774dc9f 100644 ---
> a/arch/x86/include/asm/irq_vectors.h +++
> b/arch/x86/include/asm/irq_vectors.h @@ -102,6 +102,11 @@
>   */
>  #define X86_PLATFORM_IPI_VECTOR  0xf7
> +/* Vector for KVM to deliver posted interrupt IPI */
> +#ifdef CONFIG_HAVE_KVM
> +#define POSTED_INTR_VECTOR   0xf2
> +#endif
> +
>  /*
>   * IRQ work vector:
>   */
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index b8388e9..79da55e 100644 ---
> a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h
> @@ -704,6 +704,9 @@ struct kvm_x86_ops {
>   void (*hwapic_isr_update)(struct kvm *kvm, int isr);
>   void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
>   void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set);
> + bool (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector,
> + int *result, bool *delivered);
> + void (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
>   int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
>   int (*get_tdp_level)(void);
>   u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 5c9dbad..ce8ac80 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -158,6 +158,7 @@
>  #define PIN_BASED_EXT_INTR_MASK 0x0001
>  #define PIN_BASED_NMI_EXITING   0x0008
>  #define PIN_BASED_VIRTUAL_NMIS  0x0020
> +#define PIN_BASED_POSTED_INTR   0x0080
> 
>  #define VM_EXIT_SAVE_DEBUG_CONTROLS 0x0002 #define
>  VM_EXIT_HOST_ADDR_SPACE_SIZE0x0200 @@ -180,6 +181,7 @@
>  /* VMCS Encodings */ enum vmcs_field {   VIRTUAL_PROCESSOR_ID   
>  = 0x, +  POSTED_INTR_NV  = 0x0002,
>   GUEST_ES_SELECTOR   = 0x0800,   GUEST_CS_SELECTOR  
>  = 0x0802,GUEST_SS_SELECTOR   = 0x0804, @@
>  -214,6 +216,8 @@ enum vmcs_field {   VIRTUAL_APIC_PAGE_ADDR_HIGH =
>  0x2013,  APIC_ACCESS_ADDR= 0x2014,   
> APIC_ACCESS_ADDR_HIGH   =
>  0x2015,
> + POSTED_INTR_DESC_ADDR   = 0x2016,
> + POSTED_INTR_DESC_ADDR_HIGH  = 0x2017,
>   EPT_POINTER = 0x201a,
>   EPT_POINTER_HIGH

[PATCH v2 13/15] target-s390x: Refactor debug output macros

2013-02-20 Thread Andreas Färber

Make debug output compile-testable even if disabled.

Rename dprintf() in kvm.c to kvm_dprintf() due to a conflict with glibc.

Drop unused DEBUG_HELPER and LOG_HELPER() in fpu_helper.c.
Drop unused LOG_DISAS() in translate.c and inline S390X_DEBUG_DISAS.

Signed-off-by: Andreas Färber 
---
 target-s390x/cc_helper.c   |   14 ++--
 target-s390x/fpu_helper.c  |7 --
 target-s390x/helper.c  |   53 ++--
 target-s390x/int_helper.c  |   14 ++--
 target-s390x/kvm.c |   33 +--
 target-s390x/mem_helper.c  |   14 ++--
 target-s390x/misc_helper.c |   14 ++--
 target-s390x/translate.c   |   23 +--
 8 Dateien geändert, 120 Zeilen hinzugefügt(+), 52 Zeilen entfernt(-)

diff --git a/target-s390x/cc_helper.c b/target-s390x/cc_helper.c
index a6d60bf..e45a19a 100644
--- a/target-s390x/cc_helper.c
+++ b/target-s390x/cc_helper.c
@@ -24,11 +24,21 @@
 
 /* #define DEBUG_HELPER */
 #ifdef DEBUG_HELPER
-#define HELPER_LOG(x...) qemu_log(x)
+static const bool debug_helper = true;
 #else
-#define HELPER_LOG(x...)
+static const bool debug_helper;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) HELPER_LOG(const char *fmt, ...)
+{
+if (debug_helper) {
+va_list ap;
+va_start(ap, fmt);
+qemu_log_vprintf(fmt, ap);
+va_end(ap);
+}
+}
+
 static uint32_t cc_calc_ltgt_32(int32_t src, int32_t dst)
 {
 if (src == dst) {
diff --git a/target-s390x/fpu_helper.c b/target-s390x/fpu_helper.c
index 94375b6..937da59 100644
--- a/target-s390x/fpu_helper.c
+++ b/target-s390x/fpu_helper.c
@@ -25,13 +25,6 @@
 #include "exec/softmmu_exec.h"
 #endif
 
-/* #define DEBUG_HELPER */
-#ifdef DEBUG_HELPER
-#define HELPER_LOG(x...) qemu_log(x)
-#else
-#define HELPER_LOG(x...)
-#endif
-
 #define RET128(F) (env->retxl = F.low, F.high)
 
 #define convert_bit(mask, from, to) \
diff --git a/target-s390x/helper.c b/target-s390x/helper.c
index 1183b45..4a02251 100644
--- a/target-s390x/helper.c
+++ b/target-s390x/helper.c
@@ -30,24 +30,53 @@
 //#define DEBUG_S390_STDOUT
 
 #ifdef DEBUG_S390
-#ifdef DEBUG_S390_STDOUT
-#define DPRINTF(fmt, ...) \
-do { fprintf(stderr, fmt, ## __VA_ARGS__); \
- qemu_log(fmt, ##__VA_ARGS__); } while (0)
+static const bool debug_helper = true;
 #else
-#define DPRINTF(fmt, ...) \
-do { qemu_log(fmt, ## __VA_ARGS__); } while (0)
+static const bool debug_helper;
 #endif
+
+#ifdef DEBUG_S390_PTE
+static const bool debug_pte = true;
 #else
-#define DPRINTF(fmt, ...) \
-do { } while (0)
+static const bool debug_pte;
 #endif
 
-#ifdef DEBUG_S390_PTE
-#define PTE_DPRINTF DPRINTF
+#ifdef DEBUG_S390_STDOUT
+static const bool debug_to_stdout = true;
 #else
-#define PTE_DPRINTF(fmt, ...) \
-do { } while (0)
+static const bool debug_to_stdout;
+#endif
+
+#ifndef CONFIG_USER_ONLY
+static inline void GCC_FMT_ATTR(1, 0) vDPRINTF(const char *fmt, va_list ap)
+{
+if (debug_helper) {
+if (debug_to_stdout) {
+vfprintf(stderr, fmt, ap);
+}
+qemu_log_vprintf(fmt, ap);
+}
+}
+
+static void GCC_FMT_ATTR(1, 2) DPRINTF(const char *fmt, ...)
+{
+if (debug_helper) {
+va_list ap;
+va_start(ap, fmt);
+vDPRINTF(fmt, ap);
+va_end(ap);
+}
+}
+
+static void GCC_FMT_ATTR(1, 2) PTE_DPRINTF(const char *fmt, ...)
+{
+if (debug_pte) {
+va_list ap;
+va_start(ap, fmt);
+vDPRINTF(fmt, ap);
+va_end(ap);
+}
+}
 #endif
 
 #ifndef CONFIG_USER_ONLY
diff --git a/target-s390x/int_helper.c b/target-s390x/int_helper.c
index 6858301..c7da7e6 100644
--- a/target-s390x/int_helper.c
+++ b/target-s390x/int_helper.c
@@ -24,11 +24,21 @@
 
 /* #define DEBUG_HELPER */
 #ifdef DEBUG_HELPER
-#define HELPER_LOG(x...) qemu_log(x)
+static const bool debug_helper = true;
 #else
-#define HELPER_LOG(x...)
+static const bool debug_helper;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) HELPER_LOG(const char *fmt, ...)
+{
+if (debug_helper) {
+va_list ap;
+va_start(ap, fmt);
+qemu_log_vprintf(fmt, ap);
+va_end(ap);
+}
+}
+
 /* 64/64 -> 128 unsigned multiplication */
 uint64_t HELPER(mul128)(CPUS390XState *env, uint64_t v1, uint64_t v2)
 {
diff --git a/target-s390x/kvm.c b/target-s390x/kvm.c
index 3929771..8eea7f1 100644
--- a/target-s390x/kvm.c
+++ b/target-s390x/kvm.c
@@ -38,13 +38,21 @@
 /* #define DEBUG_KVM */
 
 #ifdef DEBUG_KVM
-#define dprintf(fmt, ...) \
-do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+static const bool debug_kvm = true;
 #else
-#define dprintf(fmt, ...) \
-do { } while (0)
+static const bool debug_kvm;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) kvm_dprintf(const char *fmt, ...)
+{
+if (debug_kvm) {
+va_list ap;
+va_start(ap, fmt);
+vfprintf(stderr, fmt, ap);
+va_end(ap);
+}
+}
+
 #define IPA0_DIAG   0x8300
 #define IPA0_SIGP   0xae00
 #define IPA0_B

[PATCH v2 08/15] target-i386: Refactor debug output macros

2013-02-20 Thread Andreas Färber

Make debug output compile-testable even if disabled.

Signed-off-by: Andreas Färber 
Cc: Richard Henderson 
---
 target-i386/helper.c |   42 ++
 target-i386/kvm.c|   16 
 target-i386/seg_helper.c |   24 +++-
 3 Dateien geändert, 57 Zeilen hinzugefügt(+), 25 Zeilen entfernt(-)

diff --git a/target-i386/helper.c b/target-i386/helper.c
index 4bf9db7..696e4e1 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -26,6 +26,22 @@
 
 //#define DEBUG_MMU
 
+#ifdef DEBUG_MMU
+static const bool debug_mmu = true;
+#else
+static const bool debug_mmu;
+#endif
+
+static void GCC_FMT_ATTR(1, 2) mmu_dprintf(const char *fmt, ...)
+{
+if (debug_mmu) {
+va_list ap;
+va_start(ap, fmt);
+vprintf(fmt, ap);
+va_end(ap);
+}
+}
+
 static void cpu_x86_version(CPUX86State *env, int *family, int *model)
 {
 int cpuver = env->cpuid_version;
@@ -372,9 +388,8 @@ void x86_cpu_set_a20(X86CPU *cpu, int a20_state)
 
 a20_state = (a20_state != 0);
 if (a20_state != ((env->a20_mask >> 20) & 1)) {
-#if defined(DEBUG_MMU)
-printf("A20 update: a20=%d\n", a20_state);
-#endif
+mmu_dprintf("A20 update: a20=%d\n", a20_state);
+
 /* if the cpu is currently executing code, we must unlink it and
all the potentially executing TB */
 cpu_interrupt(env, CPU_INTERRUPT_EXITTB);
@@ -390,9 +405,8 @@ void cpu_x86_update_cr0(CPUX86State *env, uint32_t new_cr0)
 {
 int pe_state;
 
-#if defined(DEBUG_MMU)
-printf("CR0 update: CR0=0x%08x\n", new_cr0);
-#endif
+mmu_dprintf("CR0 update: CR0=0x%08x\n", new_cr0);
+
 if ((new_cr0 & (CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK)) !=
 (env->cr[0] & (CR0_PG_MASK | CR0_WP_MASK | CR0_PE_MASK))) {
 tlb_flush(env, 1);
@@ -433,18 +447,15 @@ void cpu_x86_update_cr3(CPUX86State *env, target_ulong 
new_cr3)
 {
 env->cr[3] = new_cr3;
 if (env->cr[0] & CR0_PG_MASK) {
-#if defined(DEBUG_MMU)
-printf("CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3);
-#endif
+mmu_dprintf("CR3 update: CR3=" TARGET_FMT_lx "\n", new_cr3);
 tlb_flush(env, 0);
 }
 }
 
 void cpu_x86_update_cr4(CPUX86State *env, uint32_t new_cr4)
 {
-#if defined(DEBUG_MMU)
-printf("CR4 update: CR4=%08x\n", (uint32_t)env->cr[4]);
-#endif
+mmu_dprintf("CR4 update: CR4=%08x\n", (uint32_t)env->cr[4]);
+
 if ((new_cr4 ^ env->cr[4]) &
 (CR4_PGE_MASK | CR4_PAE_MASK | CR4_PSE_MASK |
  CR4_SMEP_MASK | CR4_SMAP_MASK)) {
@@ -510,10 +521,9 @@ int cpu_x86_handle_mmu_fault(CPUX86State *env, 
target_ulong addr,
 target_ulong vaddr, virt_addr;
 
 is_user = mmu_idx == MMU_USER_IDX;
-#if defined(DEBUG_MMU)
-printf("MMU fault: addr=" TARGET_FMT_lx " w=%d u=%d eip=" TARGET_FMT_lx 
"\n",
-   addr, is_write1, is_user, env->eip);
-#endif
+mmu_dprintf("MMU fault: addr=" TARGET_FMT_lx " w=%d u=%d"
+" eip=" TARGET_FMT_lx "\n",
+addr, is_write1, is_user, env->eip);
 is_write = is_write1 & 1;
 
 if (!(env->cr[0] & CR0_PG_MASK)) {
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0cf413d..9cbef77 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -37,13 +37,21 @@
 //#define DEBUG_KVM
 
 #ifdef DEBUG_KVM
-#define DPRINTF(fmt, ...) \
-do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+static const bool debug_kvm = true;
 #else
-#define DPRINTF(fmt, ...) \
-do { } while (0)
+static const bool debug_kvm;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) DPRINTF(const char *fmt, ...)
+{
+if (debug_kvm) {
+va_list ap;
+va_start(ap, fmt);
+vfprintf(stderr, fmt, ap);
+va_end(ap);
+}
+}
+
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
diff --git a/target-i386/seg_helper.c b/target-i386/seg_helper.c
index 3247dee..2338d54 100644
--- a/target-i386/seg_helper.c
+++ b/target-i386/seg_helper.c
@@ -29,14 +29,28 @@
 #endif /* !defined(CONFIG_USER_ONLY) */
 
 #ifdef DEBUG_PCALL
-# define LOG_PCALL(...) qemu_log_mask(CPU_LOG_PCALL, ## __VA_ARGS__)
-# define LOG_PCALL_STATE(env)  \
-log_cpu_state_mask(CPU_LOG_PCALL, (env), CPU_DUMP_CCOP)
+static const bool debug_pcall = true;
 #else
-# define LOG_PCALL(...) do { } while (0)
-# define LOG_PCALL_STATE(env) do { } while (0)
+static const bool debug_pcall;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) LOG_PCALL(const char *fmt, ...)
+{
+if (debug_pcall) {
+va_list ap;
+va_start(ap, fmt);
+qemu_log_mask(CPU_LOG_PCALL, fmt, ap);
+va_end(ap);
+}
+}
+
+static inline void LOG_PCALL_STATE(CPUX86State *env)
+{
+if (debug_pcall) {
+log_cpu_state_mask(CPU_LOG_PCALL, (env), CPU_DUMP_CCOP);
+}
+}
+
 /* return non zero if error */
 static inline int load_segment(CPUX86State *env, uint32_t *e1_ptr,
uint32_t *e2_ptr, int selector)
-- 
1.7.10.4

--
To

[PATCH v2 12/15] target-ppc: Refactor debug output macros

2013-02-20 Thread Andreas Färber

Make debug output compile-testable even if disabled.

Rename dprintf() in kvm.c to kvm_dprintf() to avoid conflict with glibc.

Inline DEBUG_OP check in excp_helper.c.
Inline LOG_MMU_STATE() in mmu_helper.c.
Inline PPC_{DEBUG_SPR,DUMP_SPR_ACCESSES} checks in translate_init.c.

Signed-off-by: Andreas Färber 
---
 target-ppc/excp_helper.c|   30 ++
 target-ppc/kvm.c|   28 +++--
 target-ppc/mem_helper.c |2 --
 target-ppc/mmu_helper.c |   71 ++-
 target-ppc/translate.c  |   15 +++--
 target-ppc/translate_init.c |   40 +++-
 6 Dateien geändert, 137 Zeilen hinzugefügt(+), 49 Zeilen entfernt(-)

diff --git a/target-ppc/excp_helper.c b/target-ppc/excp_helper.c
index 0a1ac86..9a11e1f 100644
--- a/target-ppc/excp_helper.c
+++ b/target-ppc/excp_helper.c
@@ -24,10 +24,28 @@
 //#define DEBUG_OP
 //#define DEBUG_EXCEPTIONS
 
+#ifdef DEBUG_OP
+static const bool debug_op = true;
+#else
+static const bool debug_op;
+#endif
+
 #ifdef DEBUG_EXCEPTIONS
-#  define LOG_EXCP(...) qemu_log(__VA_ARGS__)
+static const bool debug_exceptions = true;
 #else
-#  define LOG_EXCP(...) do { } while (0)
+static const bool debug_exceptions;
+#endif
+
+#ifndef CONFIG_USER_ONLY
+static void GCC_FMT_ATTR(1, 2) LOG_EXCP(const char *fmt, ...)
+{
+if (debug_exceptions) {
+va_list ap;
+va_start(ap, fmt);
+qemu_log_vprintf(fmt, ap);
+va_end(ap);
+}
+}
 #endif
 
 /*/
@@ -777,7 +795,7 @@ void ppc_hw_interrupt(CPUPPCState *env)
 }
 #endif /* !CONFIG_USER_ONLY */
 
-#if defined(DEBUG_OP)
+#ifndef CONFIG_USER_ONLY
 static void cpu_dump_rfi(target_ulong RA, target_ulong msr)
 {
 qemu_log("Return from exception at " TARGET_FMT_lx " with flags "
@@ -835,9 +853,9 @@ static inline void do_rfi(CPUPPCState *env, target_ulong 
nip, target_ulong msr,
 /* XXX: beware: this is false if VLE is supported */
 env->nip = nip & ~((target_ulong)0x0003);
 hreg_store_msr(env, msr, 1);
-#if defined(DEBUG_OP)
-cpu_dump_rfi(env->nip, env->msr);
-#endif
+if (debug_op) {
+cpu_dump_rfi(env->nip, env->msr);
+}
 /* No need to raise an exception here,
  * as rfi is always the last insn of a TB
  */
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 2c64c63..997d400 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -40,13 +40,21 @@
 //#define DEBUG_KVM
 
 #ifdef DEBUG_KVM
-#define dprintf(fmt, ...) \
-do { fprintf(stderr, fmt, ## __VA_ARGS__); } while (0)
+static const bool debug_kvm = true;
 #else
-#define dprintf(fmt, ...) \
-do { } while (0)
+static const bool debug_kvm;
 #endif
 
+static void GCC_FMT_ATTR(1, 2) kvm_dprintf(const char *fmt, ...)
+{
+if (debug_kvm) {
+va_list ap;
+va_start(ap, fmt);
+vfprintf(stderr, fmt, ap);
+va_end(ap);
+}
+}
+
 #define PROC_DEVTREE_CPU  "/proc/device-tree/cpus/"
 
 const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
@@ -769,7 +777,7 @@ void kvm_arch_pre_run(CPUState *cs, struct kvm_run *run)
  */
 irq = KVM_INTERRUPT_SET;
 
-dprintf("injected interrupt %d\n", irq);
+kvm_dprintf("injected interrupt %d\n", irq);
 r = kvm_vcpu_ioctl(cs, KVM_INTERRUPT, &irq);
 if (r < 0) {
 printf("cpu %d fail inject %x\n", cs->cpu_index, irq);
@@ -831,20 +839,20 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run 
*run)
 switch (run->exit_reason) {
 case KVM_EXIT_DCR:
 if (run->dcr.is_write) {
-dprintf("handle dcr write\n");
+kvm_dprintf("handle dcr write\n");
 ret = kvmppc_handle_dcr_write(env, run->dcr.dcrn, run->dcr.data);
 } else {
-dprintf("handle dcr read\n");
+kvm_dprintf("handle dcr read\n");
 ret = kvmppc_handle_dcr_read(env, run->dcr.dcrn, &run->dcr.data);
 }
 break;
 case KVM_EXIT_HLT:
-dprintf("handle halt\n");
+kvm_dprintf("handle halt\n");
 ret = kvmppc_handle_halt(env);
 break;
 #ifdef CONFIG_PSERIES
 case KVM_EXIT_PAPR_HCALL:
-dprintf("handle PAPR hypercall\n");
+kvm_dprintf("handle PAPR hypercall\n");
 run->papr_hcall.ret = spapr_hypercall(cpu,
   run->papr_hcall.nr,
   run->papr_hcall.args);
@@ -852,7 +860,7 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
 break;
 #endif
 case KVM_EXIT_EPR:
-dprintf("handle epr\n");
+kvm_dprintf("handle epr\n");
 run->epr.epr = ldl_phys(env->mpic_iack);
 ret = 0;
 break;
diff --git a/target-ppc/mem_helper.c b/target-ppc/mem_helper.c
index ba383c8..89c51d0 100644
--- a/target-ppc/mem_helper.c
+++ b/target-ppc/mem_helper.c
@@ -26,8 +26,6 @@
 #include "exec/softmmu_exec.

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Scott Wood

 On 02/20/2013 07:09:49 AM, Gleb Natapov wrote:

On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> The ability to set/get attributes is needed.  Sorry, but "get or  
set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is  
just

> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >>
> >Proposed interface sticks pointers into ioctl data, so why doing
> >the same
> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
>
> There's a difference between putting a pointer in an ioctl control
> structure that is specifically documented as being that way (as in
> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> blob of state and embedding pointers in it.  It would be like
> sticking a pointer in the attribute payload of this API, which I
> think is something to be discouraged.
If documentation is what differentiate for you between silly and smart
then write documentation instead of new interfaces.

You mean like what we did with SREGS, that got deprecated and replaced  
with ONE_REG?

How is writing documentation not creating new interfaces, if the  
documentation is different from what the interface is currently  
understood to do?
Note that Marcelo seems to view KVM_SET_IRQCHIP as effectively being a  
device reset, which is rather different.

KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data  
on

x86, nothing prevent you from adding MPIC specifics to the interface,
Add mpic state into kvm_irqchip structure and if 512 bytes is not  
enough

for you to transfer the state put pointers there and _document_ them.

So basically, you want me to keep this interface but share the ioctl  
number with an older interface? :-P

But with 512 bytes you can transfer properties inline, so you probably
do not need pointer there anyway. I see you have three properties 2 of
them 32bit and one 64bit.

Three *groups* of properties.  One of the property groups is per  
source, and we can have hundreds of sources.  Another exposes the  
register space, which is 64 KiB (admittedly it's somewhat sparse, but  
there's more than 512 bytes of real data in there).  And we don't  
necessarily want to set *everything*.

>It'd also be using
> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> to later on regarding KVM_IRQ_LINE_STATUS.
>
Do not see why.

It's either that, or have the data direction of the "chip" field in  
KVM_GET_IRQCHIP not be entirely in the "get" direction.

> Then there's the silliness of transporting 512 bytes just to read a
> descriptor for transporting something else.
>
Yes, agree. But is this enough of a reason to introduce entirely new
interface? Is it on performance critical path? Doubt it, unless you
abuse the interface to send interrupts, but then isn't it silty to
do copy_from_user() twice to inject an interrupt like proposed  
interface

does?

It should probably be get_user() instead, which is pretty fast in the  
absence of a fault.

> >For signaling irqs (I think this is what you mean by "commands")
> >we have KVM_IRQ_LINE.
>
> It's one type of command.  Another is setting the address.  Another
> is writing to registers that have side effects (this is how MSI
> injection is done on MPIC, just as in real hardware).
>
Setting the address is setting an attribute. Sending MSI is a command.
Things you set/get during init/migration are attributes. Things you do
to cause side-effects are commands.

What if I set the address at a time that isn't init/migration (the  
hardware allows moving it, like a PCI BAR)?  Suddenly it becomes not an  
attribute due to how the caller uses it?

> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> (non-glue/wrapper) code can become common?
>
No new ioctl with exactly same result (well actually even faster since
less copying is done).

Which ioctl would go away?

You need to show us the benefits of the new interface
vs existing one, not vice versa.

Well, as I said to Marcello, the real reason why we probably need to  
use the existing interface is irqfd.  That doesn't make the device  
control stuff go away.

> And I really hope you don't want us to do MSIs the x86 way.
>
What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to  
glue it

to mpic.

We'll have to write extra code for it compared to the current way where  
it works with *zero* code beyond what is wanted for other purposes such  
as debug and (eventually) migration.  At least it's more direct than  
having to establish a GSI route...

> In the XICS thread, Paul brought up the possibliity of cascaded
> MPICs.  It's not relevant to the systems we're trying to model, but
> if one did want to use the in-kernel irqchip interface for that, it
> would be really nice

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood

On 02/20/2013 07:09:55 PM, Marcelo Tosatti wrote:

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
> On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> >This is probably a stupid question, but why the
> >KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate  
for

> >your purposes?
> >
> >x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> >KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
>
> To start, the whole IRQ routing stuff is poorly documented.
>
> Am I supposed to make up GSI numbers and use the routing thing to
> associate them with real interrupts?

I have no idea. Is mapping from one integer linear space (GSIs)
to real interrupts suitable for you?

I can live with it.

> Where does the code live to manage this table, and how APICy is it  
(looks like the

> answer is "irq_comm.c, and very")?

Thinking faster than typing? Not sure what you mean.

Sorry...  The code to manage the table lives in virt/kvm/irq_comm.c.   
That code is very APIC-specific and not currently in a state that  
invites sharing.

> It looks like I'm going to have to do this anyway for irqfd, though
> that doesn't make the other uses of the device control api go away.
> Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
> the status for debugging (reading device registers doesn't quite do
> it, since the "active" bit won't show up if the interrupt is
> masked).

> At that point, is it more offensive to make it read-only
> even though it would be trivial to make it read/write (which would
> allow users who don't need it to bypass the routing API), or to make
> it read/write and live with there being more than one way to do
> something?

Can't follow this sentence.

Suppose, for the sake of irqfd (and maillist tranquility) MPIC uses  
existing KVM_IRQ_LINE and such.  Will there be objection to being able  
to use KVM_GET_DEVICE_ATTR to *get* the irq line status for debugging  
purposes (maybe also useful for migration)?  If there's no objection to  
that, would there be any actual reason, beyond saving a few lines of  
glue code, to make it a read-only attribute?

> KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
> of state, and because it doesn't allow debugging access to device
> registers (e.g. inspecting from the QEMU command line), and because
> it's hard to add new pieces of state if we realize we left something
> out.  It reminds be of GET/SET_SREGS.  With that, I did what you
> seem to want here, which is to adapt the existing interfaces, using
> feature flags to control optional state.  It ended up being a mess,
> and ONE_REG was introduced as a replacement.  The device control API
> is the equivalent of ONE_REG for things other than vcpus.
>
> -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?

Well, that's what KVM_SET_DEVICE_ATTR is.

- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far
is not acceptable?

That's exactly what we did with SREGS, and it got declared a mess and  
replaced with ONE_REG.  I'm trying to learn from my mistakes. :-)

- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

For one, it's more work.  The current way works by simulating a guest  
MMIO access.  No blob format to design, implement, and keep compatible.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Scott Wood

On 02/20/2013 06:14:37 PM, Marcelo Tosatti wrote:

On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >It is then not necessary to set device attributes on a live guest  
and

> >deal with the complications associated with that.
>
> Which complications?
>
> -Scott

Semantics of individual attribute writes, for one.

When the attribute is a device register, the hardware documentation  
takes care of that.  Otherwise, the semantics are documented in the  
device-specific documentation -- which can include restricting when the  
access is allowed.  Same as with any other interface documentation.

I suppose mpic.txt could use an additional statement that  
KVM_DEV_MPIC_GRP_REGISTER performs an access as if it were performed by  
the guest.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

If you mean kvm_set_irq() in irq_comm.c, that's only relevant when you  
have a GSI routing table, which this patchset doesn't.

Assuming we end up having a routing table to support irqfd, we still  
can't share the code as is, since it's APIC-specific.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti

On Wed, Feb 20, 2013 at 05:53:20PM -0600, Scott Wood wrote:
> >Why exactly you need to set attributes Scott?
> 
> That's best answered by looking at patch 6/6 and discussing the
> actual attributes that are defined so far.
> 
> The need to set the base address of the registers is
> straightforward.  When ARM added their special ioctl for this, we
> discussed it being eventually replaced with a more general API (in
> fact, that was the reason they put "ARM" in the name).
> 
> Access to device registers was originally intended for debugging,
> and eventually migration.  It turned out to also be very useful for
> injecting MSIs -- nothing special needed to be done.  It Just
> Works(tm) the same way it does in hardware, with an MMIO write from
> a PCI device to a specific MPIC register.  Again, irqfd may
> complicate this, but there's no need for QEMU-generated MSIs to have
> to take a circuitous path.
> 
> Finally, there's the interrupt source attribute.  Even if we get rid
> of that, we'll need MPIC-specific documentation on how to flatten
> the IRQ numberspace, and if we ever have a cascaded PIC situation
> things could get ugly since there's no way to identify a specific
> IRQ controller in KVM_IRQ_LINE.
> 
> >Also related to this question is the point of avoiding the
> >implementation of devices to be spread across userspace and the kernel
> >(this is one point Avi brought up often). If the device emulation
> >is implemented entirely in the kernel, all is necessary are initial
> >values of device registers (and retrieval of those values later for
> >savevm/migration).
> 
> MPIC emulation is entirely in the kernel with this patchset --
> though the register that lets you reset cores will likely need to be
> kicked back to QEMU.
> 
> >It is then not necessary to set device attributes on a live guest and
> >deal with the complications associated with that.
> 
> Which complications?
> 
> -Scott

Semantics of individual attribute writes, for one.

Locking versus currently executing VCPUs, for another (see how
KVM_SET_IRQ's RCU usage, for instance, that is something should be
shared).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti

On Wed, Feb 20, 2013 at 06:20:51PM -0600, Scott Wood wrote:
> On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:
> >This is probably a stupid question, but why the
> >KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
> >your purposes?
> >
> >x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
> >KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.
> 
> To start, the whole IRQ routing stuff is poorly documented.
> 
> Am I supposed to make up GSI numbers and use the routing thing to
> associate them with real interrupts? 

I have no idea. Is mapping from one integer linear space (GSIs) 
to real interrupts suitable for you? 

> Are there constraints on what
> sort of GSI numbers I can choose (I now see in the code that
> KVM_MAX_IRQ_ROUTES is returned from the capability check, but where
> is that documented?  

Don't think it is.

> It looks like the APIC implementation has
> default routes, where is that documented?)? 

In the code.

> Where does the code live to manage this table, and how APICy is it (looks 
> like the
> answer is "irq_comm.c, and very")? 

Thinking faster than typing? Not sure what you mean.

> I suppose I could write another
> implementation of the table management code for MPIC, though the
> placement of "irqchip" inside the route entry, rather than as an
> argument to KVM_IRQ_LINE, suggests the table is supposed to be
> global, not in the individual interrupt controller.

Yes the table is global. It maps GSI ("Global System Interrupt" IIRC)
(integer) to (irqchip,pin) pair.

> It looks like I'm going to have to do this anyway for irqfd, though
> that doesn't make the other uses of the device control api go away.
> Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading
> the status for debugging (reading device registers doesn't quite do
> it, since the "active" bit won't show up if the interrupt is
> masked).  

> At that point, is it more offensive to make it read-only
> even though it would be trivial to make it read/write (which would
> allow users who don't need it to bypass the routing API), or to make
> it read/write and live with there being more than one way to do
> something?

Can't follow this sentence.

> KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes
> of state, and because it doesn't allow debugging access to device
> registers (e.g. inspecting from the QEMU command line), and because
> it's hard to add new pieces of state if we realize we left something
> out.  It reminds be of GET/SET_SREGS.  With that, I did what you
> seem to want here, which is to adapt the existing interfaces, using
> feature flags to control optional state.  It ended up being a mess,
> and ONE_REG was introduced as a replacement.  The device control API
> is the equivalent of ONE_REG for things other than vcpus.
> 
> -Scott

- ACK on 512 bytes not sufficient. Add another ioctl, SET_IRQCHIP2?
- Agree on poor extensibility of interface. Adding a reserved amount
of space as padding and versioning such as has been done so far 
is not acceptable? 
- Debugging: why is reading entire register state not acceptable? Yes,
  its slow.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] KVM updates for the 3.9 merge window

2013-02-20 Thread Marcelo Tosatti



Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.9-1

to receive the KVM updates for the 3.9 merge window, including x86 real
mode emulation fixes, stronger memory slot interface restrictions, 
mmu_lock spinlock hold time reduction, improved handling of large 
page faults on shadow, initial APICv HW acceleration support, 
s390 channel IO based virtio, amongst others.

--

Alex Williamson (13):
  KVM: Restrict non-existing slot state transitions
  KVM: Check userspace_addr when modifying a memory slot
  KVM: Fix iommu map/unmap to handle memory slot moves
  KVM: Minor memory slot optimization
  KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
  KVM: Make KVM_PRIVATE_MEM_SLOTS optional
  KVM: struct kvm_memory_slot.user_alloc -> bool
  KVM: struct kvm_memory_slot.flags -> u32
  KVM: struct kvm_memory_slot.id -> short
  KVM: Increase user memory slots on x86 to 125
  kvm: Fix memory slot generation updates
  kvm: Force IOMMU remapping on memory slot read-only flag changes
  kvm: Obey read-only mappings in iommu

Alexander Graf (17):
  KVM: PPC: Only WARN on invalid emulation
  KVM: PPC: Book3S: PR: Enable alternative instruction for SC 1
  KVM: PPC: BookE: Allow irq deliveries to inject requests
  KVM: PPC: BookE: Emulate mfspr on EPR
  KVM: PPC: BookE: Implement EPR exit
  KVM: PPC: BookE: Add EPR ONE_REG sync
  KVM: PPC: E500: Move write_stlbe higher
  KVM: PPC: E500: Explicitly mark shadow maps invalid
  KVM: PPC: E500: Propagate errors when shadow mapping
  KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
  KVM: PPC: E500: Split host and guest MMU parts
  KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
  KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
  KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
  Merge commit 'origin/next' into kvm-ppc-next
  KVM: PPC: BookE: Handle alignment interrupts
  Merge commit 'origin/next' into kvm-ppc-next

Avi Kivity (16):
  KVM: x86 emulator: framework for streamlining arithmetic opcodes
  KVM: x86 emulator: Support for declaring single operand fastops
  KVM: x86 emulator: introduce NoWrite flag
  KVM: x86 emulator: mark CMP, CMPS, SCAS, TEST as NoWrite
  KVM: x86 emulator: convert NOT, NEG to fastop
  KVM: x86 emulator: add macros for defining 2-operand fastop emulation
  KVM: x86 emulator: convert basic ALU ops to fastop
  KVM: x86 emulator: Convert SHLD, SHRD to fastop
  KVM: x86 emulator: convert shift/rotate instructions to fastop
  KVM: x86 emulator: covert SETCC to fastop
  KVM: x86 emulator: convert INC/DEC to fastop
  KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
  KVM: x86 emulator: convert 2-operand IMUL to fastop
  KVM: x86 emulator: rearrange fastop definitions
  KVM: x86 emulator: convert a few freestanding emulations to fastop
  KVM: x86 emulator: fix test_cc() build failure on i386

Bharat Bhushan (3):
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: PPC: booke: Allow multiple exception types
  booke: Added DBCR4 SPR number

Christian Borntraeger (3):
  KVM: s390: Gracefully handle busy conditions on ccw_device_start
  s390/kvm: Fix store status for ACRS/FPRS
  s390/kvm: Fix instruction decoding

Cong Ding (1):
  KVM: s390: kvm/sigp.c: fix memory leakage

Cornelia Huck (14):
  KVM: s390: Handle hosts not supporting s390-virtio.
  s390/ccwdev: Include asm/schid.h.
  KVM: s390: Add a channel I/O based virtio transport driver.
  KVM: s390: Constify intercept handler tables.
  KVM: s390: Decoding helper functions.
  KVM: s390: Support for I/O interrupts.
  KVM: s390: Add support for machine checks.
  KVM: s390: In-kernel handling of I/O instructions.
  KVM: s390: Base infrastructure for enabling capabilities.
  KVM: s390: Add support for channel I/O instructions.
  KVM: s390: Dynamic allocation of virtio-ccw I/O data.
  KVM: trace: Fix exit decoding.
  s390/virtio-ccw: Fix setup_vq error handling.
  KVM: s390: Fix handling of iscs.

Dongxiao Xu (1):
  KVM: VMX: disable SMEP feature when guest is in non-paging mode

Geoff Levand (1):
  KVM: Remove duplicate text in api.txt

Gleb Natapov (39):
  KVM: emulator: implement AAD instruction
  KVM: inject ExtINT interrupt before APIC interrupts
  KVM: remove unused variable.
  KVM: VMX: cleanup rmode_segment_valid()
  KVM: VMX: relax check for CS register in rmode_segment_valid()
  KVM: VMX: return correct segment limit and flags for CS/SS registers in 
real mode
  KVM: VMX: use fix_rmode_seg() to fix all code/data segments
  KVM: VMX: remove redundant code from vmx_set_segment()
  KVM: VMX: clean-up vmx_set_segment()
  KVM: VMX: remove unneeded temporary variable from vmx_set_segment()
  KVM

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Scott Wood

On 02/20/2013 06:01:00 PM, Marcelo Tosatti wrote:

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is  
allowed.

That is defined by the individual attributes.

There is one constraint that I did forget to indicate in mpic.txt --  
KVM_DEV_MPIC_GRP_REGISTER can only be used when the registers are  
mapped.

With a big blog, that information is implicit: a SET operation is a  
full

device reset.

So, for example, how would I handle the guest changing the location of  
the MPIC registers dynamically (similar to changing a PCI BAR)?  QEMU  
doesn't implement this currently, but the hardware does, so the kernel  
interface shouldn't preclude it.

What if we later discover that we need some extra bit of state that  
wasn't included in the initial definition?  Especially since we don't  
do migration at all yet on ppc, so we can't even use what QEMU  
currently saves as a reference.

What if a particular array within the state blob needs to be scaled up  
because we now have more sources?

Will we need to version the blob?  It just gets to be a mess, like  
SREGS.

And then we need special code for packing/unpacking the blob, whereas  
with this patchset we reuse the same emulation code guest MMIO uses.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?

If there are constraints, the device documentation should specify them.

- Locking.

Same locking I need for MMIO and interrupt injection via other paths.

- End result: more logic/code necessary.

That's a tradeoff for individual devices to make, in cases where it's  
actually true.

> It's not about how often we do it, but how much state we have,
> especially if we ever want to implement migration.

Migration reads/writes the device state once, which is supposedly much
smaller than VM's RAM, so can't see the logic here.

struct kvm_irqchip can only hold 512 bytes.  We have more state than  
that.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Scott Wood


On 02/20/2013 01:58:54 PM, Marcelo Tosatti wrote:

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.


To start, the whole IRQ routing stuff is poorly documented.

Am I supposed to make up GSI numbers and use the routing thing to  
associate them with real interrupts?  Are there constraints on what  
sort of GSI numbers I can choose (I now see in the code that  
KVM_MAX_IRQ_ROUTES is returned from the capability check, but where is  
that documented?  It looks like the APIC implementation has default  
routes, where is that documented?)?  Where does the code live to manage  
this table, and how APICy is it (looks like the answer is "irq_comm.c,  
and very")?  I suppose I could write another implementation of the  
table management code for MPIC, though the placement of "irqchip"  
inside the route entry, rather than as an argument to KVM_IRQ_LINE,  
suggests the table is supposed to be global, not in the individual  
interrupt controller.


It looks like I'm going to have to do this anyway for irqfd, though  
that doesn't make the other uses of the device control api go away.   
Even KVM_DEV_MPIC_GRP_IRQ_ACTIVE would still be useful for reading the  
status for debugging (reading device registers doesn't quite do it,  
since the "active" bit won't show up if the interrupt is masked).  At  
that point, is it more offensive to make it read-only even though it  
would be trivial to make it read/write (which would allow users who  
don't need it to bypass the routing API), or to make it read/write and  
live with there being more than one way to do something?


KVM_SET_IRQCHIP is not suitable because we have more than 512 bytes of  
state, and because it doesn't allow debugging access to device  
registers (e.g. inspecting from the QEMU command line), and because  
it's hard to add new pieces of state if we realize we left something  
out.  It reminds be of GET/SET_SREGS.  With that, I did what you seem  
to want here, which is to adapt the existing interfaces, using feature  
flags to control optional state.  It ended up being a mess, and ONE_REG  
was introduced as a replacement.  The device control API is the  
equivalent of ONE_REG for things other than vcpus.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti

On Wed, Feb 20, 2013 at 05:20:50PM -0600, Scott Wood wrote:
>  On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >> >Copying Christoffer since ARM has in kernel irq chip too.
> >> >
> >> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> >> Currently, devices that are emulated inside KVM are
> >configured in a
> >> >> hardcoded manner based on an assumption that any given
> >architecture
> >> >> only has one way to do it.  If there's any need to access device
> >> >state,
> >> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little
> >thing is
> >> >> cumbersome and depletes a limited numberspace.
> >
> >Creation of ioctl has advantages. It makes explicit what the
> >data contains and how it should be interpreted.
> >This means that for example, policy control can be performed at ioctl
> >level (can group, by ioctl number, which ioctl calls are allowed after
> >VM creation, for example).
> >
> >It also makes it explicit that its a userspace interface which
> >applications depend.
> >
> >With a single 'set device attribute' ioctl its more difficult
> >to do that.
> 
> I don't see why creating ioctls rather than device attributes (or
> whatever you want to call them) differs this way.  Device attributes
> are inherently userspace interfaces, just as ioctls are.  What the
> data contains and how it is interpreted are documented, albeit in a
> more lightweight format than KVM usually uses for ioctls.
> 
> An ioctl is no guarantee of good documentation.  KVM is far better
> than most of the kernel in that regard, but even there some ioctls
> are missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this
> thread), and some others are inadequately explained (e.g. IRQ
> routing).
> 
> By "policy control", do you just mean that some ioctls act on a
> /dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is
> pretty similar to having a device fd, except for the mechanism used.

No, i mean policy control acting on ioctl numbers. Its essentially the
same issue as with 'strace' (in that the its necessary to parse the
ioctl data further).

> The main things that I dislike about adding new things at the ioctl
> level are:
> - limited numberspace, and more prone to merge conflicts than a
> device-specific numberspace

Can reserve per-architecture ranges to deal with that.

> - having to add a new pathway to get into the driver for each ioctl,
> rather than having all operations on a particular device go to the
>   right place, and the device implementation interprets further
> (this assumes that we're talking about vm ioctls, and not returning
> a
>   new fd for the device)

Agree with that point.

> - awkward way of negotiating capabilities with userspace (have to
> declare the capability id separately, and add code somewhere outside
> the
>   device implementation to respond to capability requests)

Agree with that point.

> - api.txt formatting that imposes yet another document-internal
> numberspace to conflict on :-)

The main problem, to me, is that the interface allows flexibility but
the details of using such flexibility are not worked out.

That is, lack of definitions such as when setting attributes is allowed.
With a big blog, that information is implicit: a SET operation is a full
device reset.

With individual attributes:
- Which attributes can be set individually?
- Is there an order which must be respected?
- Locking.
- End result: more logic/code necessary.

> >Abstracting the details also makes it cumbersome to read
> >strace output :-)
> 
> You'd have to update strace one way or another if you want
> pretty-printed output.  Having a more restricted API than arbitrary
> ioctls could actually improve the situation -- this could be a good
> reason for including the attribute length as an explicit parameter
> rather than being implicit in the attribute id.  Then you'd just
> teach strace about the device control API once, and not for each new
> attribute or device that gets added.
> 
> >> >> This API provides a mechanism to instantiate a device of a
> >certain
> >> >> type, returning an ID that can be used to set/get attributes
> >of the
> >> >> device.  Attributes may include configuration parameters (e.g.
> >> >> register base address), device state, operational commands,
> >etc.  It
> >> >> is similar to the ONE_REG API, except that it acts on devices
> >rather
> >> >> than vcpus.
> >> >You are not only provide different way to create in kernel irq
> >> >chip you
> >> >also use an alternate way to trigger interrupt lines. Before going
> >> >into
> >> >interface specifics lets think about whether it is really worth it?
> >>
> >> Which "it" do you mean here?
> >>
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Scott Wood

On 02/20/2013 03:28:24 PM, Marcelo Tosatti wrote:

On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get  
or set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is  
just

> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> >
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be  
setting/getting a

> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and  
smart

> then write documentation instead of new interfaces.
>
> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of  
data on
> x86, nothing prevent you from adding MPIC specifics to the  
interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not  
enough
> for you to transfer the state put pointers there and _document_  
them.
> But with 512 bytes you can transfer properties inline, so you  
probably
> do not need pointer there anyway. I see you have three properties 2  
of

> them 32bit and one 64bit.
>
> >It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you  
object

> > to later on regarding KVM_IRQ_LINE_STATUS.
> >
> Do not see why.
>
> > Then there's the silliness of transporting 512 bytes just to read  
a

> > descriptor for transporting something else.
> >
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed  
interface

> does?
>
> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> >
> > It's one type of command.  Another is setting the address.   
Another

> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> >
> Setting the address is setting an attribute. Sending MSI is a  
command.
> Things you set/get during init/migration are attributes. Things you  
do

> to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for "initial configuration of device state", such as
registers etc.

Why exactly you need to set attributes Scott?

That's best answered by looking at patch 6/6 and discussing the actual  
attributes that are defined so far.

The need to set the base address of the registers is straightforward.   
When ARM added their special ioctl for this, we discussed it being  
eventually replaced with a more general API (in fact, that was the  
reason they put "ARM" in the name).

Access to device registers was originally intended for debugging, and  
eventually migration.  It turned out to also be very useful for  
injecting MSIs -- nothing special needed to be done.  It Just Works(tm)  
the same way it does in hardware, with an MMIO write from a PCI device  
to a specific MPIC register.  Again, irqfd may complicate this, but  
there's no need for QEMU-generated MSIs to have to take a circuitous  
path.

Finally, there's the interrupt source attribute.  Even if we get rid of  
that, we'll need MPIC-specific documentation on how to flatten the IRQ  
numberspace, and if we ever have a cascaded PIC situation things could  
get ugly since there's no way to identify a specific IRQ controller in  
KVM_IRQ_LINE.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

MPIC emulation is entirely in the kernel with this patchset -- though  
the register that lets you reset cores will likely need to be kicked  
back to QEMU.

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.

Which complications?

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord.

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Scott Wood

 On 02/20/2013 03:17:27 PM, Marcelo Tosatti wrote:

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured  
in a
> >> hardcoded manner based on an assumption that any given  
architecture

> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing  
is

> >> cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

I don't see why creating ioctls rather than device attributes (or  
whatever you want to call them) differs this way.  Device attributes  
are inherently userspace interfaces, just as ioctls are.  What the data  
contains and how it is interpreted are documented, albeit in a more  
lightweight format than KVM usually uses for ioctls.

An ioctl is no guarantee of good documentation.  KVM is far better than  
most of the kernel in that regard, but even there some ioctls are  
missing (e.g. KVM_IRQ_LINE_STATUS as mentioned elsewhere in this  
thread), and some others are inadequately explained (e.g. IRQ routing).

By "policy control", do you just mean that some ioctls act on a  
/dev/kvm fd, others on a vm fd, and others on a vcpu fd?  This is  
pretty similar to having a device fd, except for the mechanism used.

The main things that I dislike about adding new things at the ioctl  
level are:
- limited numberspace, and more prone to merge conflicts than a  
device-specific numberspace
- having to add a new pathway to get into the driver for each ioctl,  
rather than having all operations on a particular device go to the
  right place, and the device implementation interprets further (this  
assumes that we're talking about vm ioctls, and not returning a

  new fd for the device)
- awkward way of negotiating capabilities with userspace (have to  
declare the capability id separately, and add code somewhere outside the

  device implementation to respond to capability requests)
- api.txt formatting that imposes yet another document-internal  
numberspace to conflict on :-)

Abstracting the details also makes it cumbersome to read
strace output :-)

You'd have to update strace one way or another if you want  
pretty-printed output.  Having a more restricted API than arbitrary  
ioctls could actually improve the situation -- this could be a good  
reason for including the attribute length as an explicit parameter  
rather than being implicit in the attribute id.  Then you'd just teach  
strace about the device control API once, and not for each new  
attribute or device that gets added.

> >> This API provides a mechanism to instantiate a device of a  
certain
> >> type, returning an ID that can be used to set/get attributes of  
the

> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands,  
etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices  
rather

> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
>
> Which "it" do you mean here?
>
> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

It's not about how often we do it, but how much state we have,  
especially if we ever want to implement migration.

> If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

Yes, but even if that gets shot down (the best objection IMHO is the  
one nobody is raising -- how to hook into irqfd), we still need the  
rest of it.  I think we'd even want to keep attributes for IRQ line  
status so that we have a way to read it, even if just for debugging.

-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti

On Wed, Feb 20, 2013 at 06:28:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> > On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > > >> not good enough -- assuming you don't want us to start sticking
> > > >> pointers and commands in *that* data. :-)
> > > >>
> > > >Proposed interface sticks pointers into ioctl data, so why doing
> > > >the same
> > > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > > 
> > > There's a difference between putting a pointer in an ioctl control
> > > structure that is specifically documented as being that way (as in
> > > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > > blob of state and embedding pointers in it.  It would be like
> > > sticking a pointer in the attribute payload of this API, which I
> > > think is something to be discouraged.
> > If documentation is what differentiate for you between silly and smart
> > then write documentation instead of new interfaces.
> > 
> > KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> > x86, nothing prevent you from adding MPIC specifics to the interface,
> > Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> > for you to transfer the state put pointers there and _document_ them.
> > But with 512 bytes you can transfer properties inline, so you probably
> > do not need pointer there anyway. I see you have three properties 2 of
> > them 32bit and one 64bit.
> > 
> > >It'd also be using
> > > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > > to later on regarding KVM_IRQ_LINE_STATUS.
> > > 
> > Do not see why.
> > 
> > > Then there's the silliness of transporting 512 bytes just to read a
> > > descriptor for transporting something else.
> > > 
> > Yes, agree. But is this enough of a reason to introduce entirely new
> > interface? Is it on performance critical path? Doubt it, unless you
> > abuse the interface to send interrupts, but then isn't it silty to
> > do copy_from_user() twice to inject an interrupt like proposed interface
> > does?
> > 
> > > >For signaling irqs (I think this is what you mean by "commands")
> > > >we have KVM_IRQ_LINE.
> > > 
> > > It's one type of command.  Another is setting the address.  Another
> > > is writing to registers that have side effects (this is how MSI
> > > injection is done on MPIC, just as in real hardware).
> > > 
> > Setting the address is setting an attribute. Sending MSI is a command.
> > Things you set/get during init/migration are attributes. Things you do
> > to cause side-effects are commands.
> 
> Yes, it would be good to restrict what can be done via the interface
> (to avoid abstracting away problems). At a first glance, i would say
> it should allow for "initial configuration of device state", such as
> registers etc.
> 
> Why exactly you need to set attributes Scott?
> 
> > > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > > (non-glue/wrapper) code can become common?
> > > 
> > No new ioctl with exactly same result (well actually even faster since
> > less copying is done). You need to show us the benefits of the new interface
> > vs existing one, not vice versa.
> 
> Also related to this question is the point of avoiding the
> implementation of devices to be spread across userspace and the kernel
> (this is one point Avi brought up often). If the device emulation
> is implemented entirely in the kernel, all is necessary are initial
> values of device registers (and retrieval of those values later for
> savevm/migration).
> 
> It is then not necessary to set device attributes on a live guest and
> deal with the complications associated with that.
> 
> 

I suppose the atomic-test-and-set type operations introduced to ONE_REG ioctls
are one example of such complications.

Avi, can you remind us what are the drawbacks of having separate
userspace/kernel device emulation?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti

On Wed, Feb 20, 2013 at 03:09:49PM +0200, Gleb Natapov wrote:
> On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> > On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> > >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> > >> The ability to set/get attributes is needed.  Sorry, but "get or set
> > >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> > >> not good enough -- assuming you don't want us to start sticking
> > >> pointers and commands in *that* data. :-)
> > >>
> > >Proposed interface sticks pointers into ioctl data, so why doing
> > >the same
> > >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> > 
> > There's a difference between putting a pointer in an ioctl control
> > structure that is specifically documented as being that way (as in
> > ONE_REG), versus taking an ioctl that claims to be setting/getting a
> > blob of state and embedding pointers in it.  It would be like
> > sticking a pointer in the attribute payload of this API, which I
> > think is something to be discouraged.
> If documentation is what differentiate for you between silly and smart
> then write documentation instead of new interfaces.
> 
> KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
> x86, nothing prevent you from adding MPIC specifics to the interface,
> Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
> for you to transfer the state put pointers there and _document_ them.
> But with 512 bytes you can transfer properties inline, so you probably
> do not need pointer there anyway. I see you have three properties 2 of
> them 32bit and one 64bit.
> 
> >It'd also be using
> > KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> > to later on regarding KVM_IRQ_LINE_STATUS.
> > 
> Do not see why.
> 
> > Then there's the silliness of transporting 512 bytes just to read a
> > descriptor for transporting something else.
> > 
> Yes, agree. But is this enough of a reason to introduce entirely new
> interface? Is it on performance critical path? Doubt it, unless you
> abuse the interface to send interrupts, but then isn't it silty to
> do copy_from_user() twice to inject an interrupt like proposed interface
> does?
> 
> > >For signaling irqs (I think this is what you mean by "commands")
> > >we have KVM_IRQ_LINE.
> > 
> > It's one type of command.  Another is setting the address.  Another
> > is writing to registers that have side effects (this is how MSI
> > injection is done on MPIC, just as in real hardware).
> > 
> Setting the address is setting an attribute. Sending MSI is a command.
> Things you set/get during init/migration are attributes. Things you do
> to cause side-effects are commands.

Yes, it would be good to restrict what can be done via the interface
(to avoid abstracting away problems). At a first glance, i would say
it should allow for "initial configuration of device state", such as
registers etc.

Why exactly you need to set attributes Scott?

> > What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> > (non-glue/wrapper) code can become common?
> > 
> No new ioctl with exactly same result (well actually even faster since
> less copying is done). You need to show us the benefits of the new interface
> vs existing one, not vice versa.

Also related to this question is the point of avoiding the
implementation of devices to be spread across userspace and the kernel
(this is one point Avi brought up often). If the device emulation
is implemented entirely in the kernel, all is necessary are initial
values of device registers (and retrieval of those values later for
savevm/migration).

It is then not necessary to set device attributes on a live guest and
deal with the complications associated with that.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Marcelo Tosatti

On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:
> >Copying Christoffer since ARM has in kernel irq chip too.
> >
> >On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
> >> Currently, devices that are emulated inside KVM are configured in a
> >> hardcoded manner based on an assumption that any given architecture
> >> only has one way to do it.  If there's any need to access device
> >state,
> >> it is done through inflexible one-purpose-only IOCTLs (e.g.
> >> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> >> cumbersome and depletes a limited numberspace.

Creation of ioctl has advantages. It makes explicit what the 
data contains and how it should be interpreted.
This means that for example, policy control can be performed at ioctl
level (can group, by ioctl number, which ioctl calls are allowed after
VM creation, for example).

It also makes it explicit that its a userspace interface which
applications depend.

With a single 'set device attribute' ioctl its more difficult
to do that.

Abstracting the details also makes it cumbersome to read 
strace output :-)

> >> This API provides a mechanism to instantiate a device of a certain
> >> type, returning an ID that can be used to set/get attributes of the
> >> device.  Attributes may include configuration parameters (e.g.
> >> register base address), device state, operational commands, etc.  It
> >> is similar to the ONE_REG API, except that it acts on devices rather
> >> than vcpus.
> >You are not only provide different way to create in kernel irq
> >chip you
> >also use an alternate way to trigger interrupt lines. Before going
> >into
> >interface specifics lets think about whether it is really worth it?
> 
> Which "it" do you mean here?
> 
> The ability to set/get attributes is needed.  Sorry, but "get or set
> one blob of data, up to 512 bytes, for the entire irqchip" is just
> not good enough -- assuming you don't want us to start sticking
> pointers and commands in *that* data. :-)

Why not? Is it necessary to constantly read/write attributes?

> If you mean the way to inject interrupts, it's simpler this way.

Are you injecting interrupts via this new SET_DEVICE_ATTRIBUTE ioctl?

> Why go out of our way to inject common glue code into a
> communication path between hw/kvm/mpic.c in QEMU and
> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> glue be specific to this one function when we could reuse the same
> communication glue used for other things, such as device state?
> 
> And that's just for regular interrupts.  MSIs are vastly simpler on
> MPIC than what x86 does.
> 
> >x86 obviously support old way and will have to for some, very
> >long, time.
> 
> Sure.
> 
> >ARM vGIC code, that is ready to go upstream, uses old way too. So
> >it will
> >be 2 archs against one.
> 
> I wasn't aware that that's how it worked. :-P
> 
> I was trying to be considerate by not making the entire thing
> gratuitously PPC or MPIC specific, as some others seem inclined to
> do (e.g. see irqfd and APIC).  We already had a discussion on ARM's
> "set address" ioctl and rather than extend *that* interface, they
> preferred to just stick something ARM-specific in ASAP with the
> understanding that it would be replaced (or more accurately, kept
> around as a thin wrapper around the new stuff) later.
> 
> >Christoffer do you think the proposed way it
> >better for your needs. Are you willing to make vGIC use it?
> >
> >Scott, what other devices are you planning to support with this
> >interface?
> 
> At the moment I do not have plans for other devices, though what
> does it hurt for the capability to be there?
> 
> -Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2][QEMU] vmxcap: Report APIC register emulation and RDTSCP control

2013-02-20 Thread Marcelo Tosatti

On Mon, Feb 18, 2013 at 07:56:54AM +0100, Jan Kiszka wrote:
> From: Jan Kiszka 
> 
> Signed-off-by: Jan Kiszka 
> ---
> 
> This time I've checked twice that I'm no longer missing a field.

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-20 Thread Marcelo Tosatti

On Sat, Feb 16, 2013 at 01:56:14PM +1100, Paul Mackerras wrote:
> On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
> > On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
> > >On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
> > >> On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
> > >> >From: Benjamin Herrenschmidt 
> > >> >
> > >> >This adds in-kernel emulation of the XICS (eXternal Interrupt
> > >> >Controller Specification) interrupt controller specified by
> > >PAPR, for
> > >> >both HV and PR KVM guests.
> > >> >
> > >> >This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
> > >> >KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
> > >> >should use in-kernel interrupt controller emulation, but also
> > >takes an
> > >> >argument struct that contains the type of interrupt controller
> > >> >architecture and an optional parameter.  Currently only one
> > >type value
> > >> >is defined, that which indicates the XICS architecture.
> > >>
> > >> Would the device config API I posted a couple days ago work for you?
> > >
> > >I suppose it could be made to work.  It doesn't feel like a natural
> > >fit though, because your API seems to assume (AFAICT) that a device is
> > >manipulated via registers at specific physical addresses, so I would
> > >have to invent an artificial set of registers with addresses and bit
> > >layouts, that aren't otherwise required.  The XICS is operated from
> > >the guest side via hcalls, not via emulated MMIO.
> > 
> > I don't think it makes such an assumption.  The MPIC device has
> > physical registers, so it exposes them, but it also exposes things
> > that are not physical registers (e.g. the per-IRQ input state).  The
> > generic device control layer leaves interpretation of attributes up
> > to the device.
> > 
> > I think it would be easier to fit XICS into the device control api
> > model than to fit MPIC into this model, not to mention what would
> > happen if we later want to emulate some other type of device -- x86
> > already has at least one non-irqchip emulated device (i8254).
> 
> I have no particular objection to the device control API per se, but
> I have two objections to using it as the primary interface to the XICS
> emulation.
> 
> First, I dislike the magical side-effect where creating a device of a
> particular type (e.g. MPIC or XICS) automatically attaches it to the
> interrupt lines of the vcpus.  I prefer an explicit request to do
> in-kernel interrupt control. 

This is probably a stupid question, but why the
KVM_SET_IRQCHIP/KVM_SET_GSI_ROUTING interface is not appropriate for
your purposes?

x86 sets up a default GSI->IRQCHIP PIN mapping on creation (during
KVM_SET_IRQCHIP), but it can be modified with KVM_SET_GSI_ROUTING.

>  Further, the magic means that you can
> only have one instance of the device, whereas you might want to model
> the interrupt controller architecture as several devices.  You could
> do that using several device types, but then the interconnections
> between them would also be magic.
> 
> Secondly, it means that we are completely abandoning any attempt to
> define an abstract or generic interface to in-kernel interrupt
> controller emulations.  Each device will have its own unique set of
> attribute groups and its own unique userspace code to drive it, with
> no commonality between them.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 18:24, Jan Kiszka wrote:
> On 2013-02-20 18:01, Gleb Natapov wrote:
>> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>>> On 2013-02-20 15:14, Nadav Har'El wrote:
 Hi,

 By the way, if you haven't seen my description of why the current code
 did what it did, take a look at
 http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
 Another description might also come in handy:
 http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html

 On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework 
 event injection and recovery":
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
>
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.

 Last time I checked, if I'm remembering correctly, the nested SVM code did
 something a bit different: After the exit from L2 to L1 and unnecessarily
 queuing the pending interrupt for injection, it skipped one entry into L1,
 and as usual after the entry the interrupt queue is cleared so next time
 around, when L1 one is really entered, the wrong injection is not 
 attempted.

> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.

> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.

 I didn't understand this last point.
>>>
>>> - prepare_vmcs02 sets event to be injected into L2
>>> - while trying to enter L2, a cancel condition is met
>>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>>   into the arch event queues - it's kept in vmcs12
>>>
>> But what if we put it in arch event queue? It will be reinjected during
>> next entry attempt, so nothing bad happens and we have one less if() to 
>> explain,
>> or do I miss something terrible that will happen?
> 
> I started without that if but ran into troubles with KVM-on-KVM (L1
> locks up). Let me dig out the instrumentation and check the event flow
> again.

OK, got it again: If we transfer an IRQ that L1 wants to send to L2 into
the architectural VCPU state, we will also trigger enable_irq_window.
And that raises KVM_REQ_IMMEDIATE_EXIT again as it thinks L0 wants
inject. That will send us into an endless loop.

Not sure if we can and should handle this scenario in enable_irq_window
in a nicer way. Open for suggestions.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 18:01, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 15:14, Nadav Har'El wrote:
>>> Hi,
>>>
>>> By the way, if you haven't seen my description of why the current code
>>> did what it did, take a look at
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
>>> Another description might also come in handy:
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
>>>
>>> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework 
>>> event injection and recovery":
 This aligns VMX more with SVM regarding event injection and recovery for
 nested guests. The changes allow to inject interrupts directly from L0
 to L2.

 One difference to SVM is that we always transfer the pending event
 injection into the architectural state of the VCPU and then drop it from
 there if it turns out that we left L2 to enter L1.
>>>
>>> Last time I checked, if I'm remembering correctly, the nested SVM code did
>>> something a bit different: After the exit from L2 to L1 and unnecessarily
>>> queuing the pending interrupt for injection, it skipped one entry into L1,
>>> and as usual after the entry the interrupt queue is cleared so next time
>>> around, when L1 one is really entered, the wrong injection is not attempted.
>>>
 VMX and SVM are now identical in how they recover event injections from
 unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
 still contains a valid event and, if yes, transfer the content into L1's
 idt_vectoring_info_field.
>>>
 To avoid that we incorrectly leak an event into the architectural VCPU
 state that L1 wants to inject, we skip cancellation on nested run.
>>>
>>> I didn't understand this last point.
>>
>> - prepare_vmcs02 sets event to be injected into L2
>> - while trying to enter L2, a cancel condition is met
>> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>>   into the arch event queues - it's kept in vmcs12
>>
> But what if we put it in arch event queue? It will be reinjected during
> next entry attempt, so nothing bad happens and we have one less if() to 
> explain,
> or do I miss something terrible that will happen?

I started without that if but ran into troubles with KVM-on-KVM (L1
locks up). Let me dig out the instrumentation and check the event flow
again.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-20 Thread Peter Hurley

On Tue, 2013-02-19 at 10:26 +0200, Gleb Natapov wrote:
> On Mon, Feb 18, 2013 at 08:12:21PM -0500, Peter Hurley wrote:
> > On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote:
> > > On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
> > > > On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
> > > > > On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
> > > > > > On 02/12/2013 04:26 PM, Peter Hurley wrote:
> > > > > > > With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
> > > > > > > device/console):
> > > > > > > 
> > > > > > > [0.666410] udevd[97]: starting version 175
> > > > > > > [0.674043] udevd[97]: udevd:[97]: segfault at 
> > > > > > > ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d
> > > > > > > 
> > > > > > > and boots to an initramfs prompt.
> > > > > > > 
> > > > > > > git bisect (log attached) blames:
> > > > > > > 
> > > > > > > commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
> > > > > > > Merge: 3596f5b 949db15
> > > > > > > Author: H. Peter Anvin 
> > > > > > > Date:   Fri Jan 25 16:31:21 2013 -0800
> > > > > > > 
> > > > > > > Merge tag 'v3.8-rc5' into x86/mm
> > > > > > > 
> > > > > > > The __pa() fixup series that follows touches KVM code that is 
> > > > > > > not
> > > > > > > present in the existing branch based on v3.7-rc5, so merge in 
> > > > > > > the
> > > > > > > current upstream from Linus.
> > > > > > > 
> > > > > > > Signed-off-by: H. Peter Anvin 
> > > > > > > 
> > > > > > > 
> > > > > > > This only happens with the VGA device/console but that is the 
> > > > > > > default
> > > > > > > configuration for Ubuntu/KVM because it blacklists pretty much 
> > > > > > > every fb
> > > > > > > driver.
> > > > > > > 
> > > > > > 
> > > > > > I am guessing this is another bad use of __pa()... need to look 
> > > > > > into that.
> > > 
> > > Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
> > > there?
> > 
> > This is in the linux-next repo (any git tag after 'next-20130204' will
> > reproduce this). It's a pretty large merge commit.
> > 
> > This doesn't happen on 3.8-rc7.
> > 
> > I'll try to repro this on kvm.git sometime this week. Otherwise, we can
> > wait for it to show up in 3.9.
> > 
> Can you also drop 5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d from linux-next 
> and reproduce?

Still reproduced on next-20130204 after reverting 5dfd486c.
Haven't tested kvm.git and newest next.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 03:37:51PM +0100, Jan Kiszka wrote:
> On 2013-02-20 15:14, Nadav Har'El wrote:
> > Hi,
> > 
> > By the way, if you haven't seen my description of why the current code
> > did what it did, take a look at
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> > Another description might also come in handy:
> > http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> > 
> > On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework 
> > event injection and recovery":
> >> This aligns VMX more with SVM regarding event injection and recovery for
> >> nested guests. The changes allow to inject interrupts directly from L0
> >> to L2.
> >>
> >> One difference to SVM is that we always transfer the pending event
> >> injection into the architectural state of the VCPU and then drop it from
> >> there if it turns out that we left L2 to enter L1.
> > 
> > Last time I checked, if I'm remembering correctly, the nested SVM code did
> > something a bit different: After the exit from L2 to L1 and unnecessarily
> > queuing the pending interrupt for injection, it skipped one entry into L1,
> > and as usual after the entry the interrupt queue is cleared so next time
> > around, when L1 one is really entered, the wrong injection is not attempted.
> > 
> >> VMX and SVM are now identical in how they recover event injections from
> >> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >> still contains a valid event and, if yes, transfer the content into L1's
> >> idt_vectoring_info_field.
> > 
> >> To avoid that we incorrectly leak an event into the architectural VCPU
> >> state that L1 wants to inject, we skip cancellation on nested run.
> > 
> > I didn't understand this last point.
> 
> - prepare_vmcs02 sets event to be injected into L2
> - while trying to enter L2, a cancel condition is met
> - we call vmx_cancel_interrupts but should now avoid filling L1's event
>   into the arch event queues - it's kept in vmcs12
> 
But what if we put it in arch event queue? It will be reinjected during
next entry attempt, so nothing bad happens and we have one less if() to explain,
or do I miss something terrible that will happen?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 05:48:40PM +0100, Jan Kiszka wrote:
> On 2013-02-20 17:46, Gleb Natapov wrote:
> > On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
> >> This aligns VMX more with SVM regarding event injection and recovery for
> >> nested guests. The changes allow to inject interrupts directly from L0
> >> to L2.
> >>
> >> One difference to SVM is that we always transfer the pending event
> >> injection into the architectural state of the VCPU and then drop it from
> >> there if it turns out that we left L2 to enter L1.
> >>
> >> VMX and SVM are now identical in how they recover event injections from
> >> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >> still contains a valid event and, if yes, transfer the content into L1's
> >> idt_vectoring_info_field.
> >>
> >> To avoid that we incorrectly leak an event into the architectural VCPU
> >> state that L1 wants to inject, we skip cancellation on nested run.
> >>
> >> Signed-off-by: Jan Kiszka 
> >> ---
> >>
> >> Survived moderate testing here and (currently) makes sense to me, but
> >> please review very carefully. I wouldn't be surprised if I'm still
> >> missing some subtle corner case.
> >>
> >>  arch/x86/kvm/vmx.c |   57 
> >> +++
> >>  1 files changed, 26 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> index dd3a8a0..7d2fbd2 100644
> >> --- a/arch/x86/kvm/vmx.c
> >> +++ b/arch/x86/kvm/vmx.c
> >> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct 
> >> vcpu_vmx *vmx,
> >>  
> >>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >>  {
> >> -  if (is_guest_mode(&vmx->vcpu))
> >> -  return;
> >>__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> >>  VM_EXIT_INSTRUCTION_LEN,
> >>  IDT_VECTORING_ERROR_CODE);
> >> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
> >> *vmx)
> >>  
> >>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> >>  {
> >> -  if (is_guest_mode(vcpu))
> >> +  if (to_vmx(vcpu)->nested.nested_run_pending)
> >>return;
> > Why is this needed here?
> 
> Please check if my reply to Nadav explains this sufficiently.
> 
Ah, sorry. Will follow up there if it is not.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 17:46, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
>> This aligns VMX more with SVM regarding event injection and recovery for
>> nested guests. The changes allow to inject interrupts directly from L0
>> to L2.
>>
>> One difference to SVM is that we always transfer the pending event
>> injection into the architectural state of the VCPU and then drop it from
>> there if it turns out that we left L2 to enter L1.
>>
>> VMX and SVM are now identical in how they recover event injections from
>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>> still contains a valid event and, if yes, transfer the content into L1's
>> idt_vectoring_info_field.
>>
>> To avoid that we incorrectly leak an event into the architectural VCPU
>> state that L1 wants to inject, we skip cancellation on nested run.
>>
>> Signed-off-by: Jan Kiszka 
>> ---
>>
>> Survived moderate testing here and (currently) makes sense to me, but
>> please review very carefully. I wouldn't be surprised if I'm still
>> missing some subtle corner case.
>>
>>  arch/x86/kvm/vmx.c |   57 
>> +++
>>  1 files changed, 26 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index dd3a8a0..7d2fbd2 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
>> *vmx,
>>  
>>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>>  {
>> -if (is_guest_mode(&vmx->vcpu))
>> -return;
>>  __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
>>VM_EXIT_INSTRUCTION_LEN,
>>IDT_VECTORING_ERROR_CODE);
>> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
>> *vmx)
>>  
>>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>>  {
>> -if (is_guest_mode(vcpu))
>> +if (to_vmx(vcpu)->nested.nested_run_pending)
>>  return;
> Why is this needed here?

Please check if my reply to Nadav explains this sufficiently.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 02:01:47PM +0100, Jan Kiszka wrote:
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.
> 
> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.
> 
> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.
> 
> Signed-off-by: Jan Kiszka 
> ---
> 
> Survived moderate testing here and (currently) makes sense to me, but
> please review very carefully. I wouldn't be surprised if I'm still
> missing some subtle corner case.
> 
>  arch/x86/kvm/vmx.c |   57 +++
>  1 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd3a8a0..7d2fbd2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
> *vmx,
>  
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> - if (is_guest_mode(&vmx->vcpu))
> - return;
>   __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> VM_EXIT_INSTRUCTION_LEN,
> IDT_VECTORING_ERROR_CODE);
> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
> *vmx)
>  
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> - if (is_guest_mode(vcpu))
> + if (to_vmx(vcpu)->nested.nested_run_pending)
>   return;
Why is this needed here?

>   __vmx_complete_interrupts(to_vmx(vcpu),
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   unsigned long debugctlmsr;
>  
> - if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - if (vmcs12->idt_vectoring_info_field &
> - VECTORING_INFO_VALID_MASK) {
> - vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> - vmcs12->idt_vectoring_info_field);
> - vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> - vmcs12->vm_exit_instruction_len);
> - if (vmcs12->idt_vectoring_info_field &
> - VECTORING_INFO_DELIVER_CODE_MASK)
> - vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> - vmcs12->idt_vectoring_error_code);
> - }
> - }
> -
>   /* Record the guest's net vcpu time for enforced NMI injections. */
>   if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>   vmx->entry_time = ktime_get();
> @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  
>   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>  
> - if (is_guest_mode(vcpu)) {
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> - if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> - vmcs12->idt_vectoring_error_code =
> - vmcs_read32(IDT_VECTORING_ERROR_CODE);
> - vmcs12->vm_exit_instruction_len =
> - vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> - }
> - }
> -
>   vmx->loaded_vmcs->launched = 1;
>  
>   vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
> vmcs12 *vmcs12)
>   vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>   vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> - /* clear vm-entry fields which are to be cleared on exit */
> - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> + /* drop what we picked up for L0 via vmx_complete_interrupts */
> + vcpu->arch.nmi_injected = false;
> + kvm_clear_exception_queue(vcpu);
> + kvm_clear_interrupt_queue(vcpu);
> +
> + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> + vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> + /*
> +  * Preserve the event that was supposed to be injected
> +  * by emulating it would have been re

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 16:57, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 04:51:39PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 16:30, Gleb Natapov wrote:
>>> On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
 On 2013-02-20 14:01, Jan Kiszka wrote:
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
>
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.
>
> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.
>
> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.
>
> Signed-off-by: Jan Kiszka 
> ---
>
> Survived moderate testing here and (currently) makes sense to me, but
> please review very carefully. I wouldn't be surprised if I'm still
> missing some subtle corner case.

 Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
 of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
 injection work. So this patch logically depends on it.

>>> But this patch has hunks from that patch.
>>
>> Not mechanically.
>>
> What do you mean?

You can apply them in arbitrary order, just minor offset shifts will be
the result.

> 
>> If you prefer me merging them together, let me know.
>>
> For review not necessary, for applying preferably.

OK, will wait for review on this, then send out a combo patch.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 04:51:39PM +0100, Jan Kiszka wrote:
> On 2013-02-20 16:30, Gleb Natapov wrote:
> > On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
> >> On 2013-02-20 14:01, Jan Kiszka wrote:
> >>> This aligns VMX more with SVM regarding event injection and recovery for
> >>> nested guests. The changes allow to inject interrupts directly from L0
> >>> to L2.
> >>>
> >>> One difference to SVM is that we always transfer the pending event
> >>> injection into the architectural state of the VCPU and then drop it from
> >>> there if it turns out that we left L2 to enter L1.
> >>>
> >>> VMX and SVM are now identical in how they recover event injections from
> >>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> >>> still contains a valid event and, if yes, transfer the content into L1's
> >>> idt_vectoring_info_field.
> >>>
> >>> To avoid that we incorrectly leak an event into the architectural VCPU
> >>> state that L1 wants to inject, we skip cancellation on nested run.
> >>>
> >>> Signed-off-by: Jan Kiszka 
> >>> ---
> >>>
> >>> Survived moderate testing here and (currently) makes sense to me, but
> >>> please review very carefully. I wouldn't be surprised if I'm still
> >>> missing some subtle corner case.
> >>
> >> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
> >> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
> >> injection work. So this patch logically depends on it.
> >>
> > But this patch has hunks from that patch.
> 
> Not mechanically.
> 
What do you mean?

> If you prefer me merging them together, let me know.
> 
For review not necessary, for applying preferably.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 16:30, Gleb Natapov wrote:
> On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
>> On 2013-02-20 14:01, Jan Kiszka wrote:
>>> This aligns VMX more with SVM regarding event injection and recovery for
>>> nested guests. The changes allow to inject interrupts directly from L0
>>> to L2.
>>>
>>> One difference to SVM is that we always transfer the pending event
>>> injection into the architectural state of the VCPU and then drop it from
>>> there if it turns out that we left L2 to enter L1.
>>>
>>> VMX and SVM are now identical in how they recover event injections from
>>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>>> still contains a valid event and, if yes, transfer the content into L1's
>>> idt_vectoring_info_field.
>>>
>>> To avoid that we incorrectly leak an event into the architectural VCPU
>>> state that L1 wants to inject, we skip cancellation on nested run.
>>>
>>> Signed-off-by: Jan Kiszka 
>>> ---
>>>
>>> Survived moderate testing here and (currently) makes sense to me, but
>>> please review very carefully. I wouldn't be surprised if I'm still
>>> missing some subtle corner case.
>>
>> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
>> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
>> injection work. So this patch logically depends on it.
>>
> But this patch has hunks from that patch.

Not mechanically.

If you prefer me merging them together, let me know.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 03:53:53PM +0100, Jan Kiszka wrote:
> On 2013-02-20 14:01, Jan Kiszka wrote:
> > This aligns VMX more with SVM regarding event injection and recovery for
> > nested guests. The changes allow to inject interrupts directly from L0
> > to L2.
> > 
> > One difference to SVM is that we always transfer the pending event
> > injection into the architectural state of the VCPU and then drop it from
> > there if it turns out that we left L2 to enter L1.
> > 
> > VMX and SVM are now identical in how they recover event injections from
> > unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> > still contains a valid event and, if yes, transfer the content into L1's
> > idt_vectoring_info_field.
> > 
> > To avoid that we incorrectly leak an event into the architectural VCPU
> > state that L1 wants to inject, we skip cancellation on nested run.
> > 
> > Signed-off-by: Jan Kiszka 
> > ---
> > 
> > Survived moderate testing here and (currently) makes sense to me, but
> > please review very carefully. I wouldn't be surprised if I'm still
> > missing some subtle corner case.
> 
> Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
> of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
> injection work. So this patch logically depends on it.
> 
But this patch has hunks from that patch.

> Jan
> 
> > 
> >  arch/x86/kvm/vmx.c |   57 
> > +++
> >  1 files changed, 26 insertions(+), 31 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index dd3a8a0..7d2fbd2 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
> > *vmx,
> >  
> >  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> >  {
> > -   if (is_guest_mode(&vmx->vcpu))
> > -   return;
> > __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> >   VM_EXIT_INSTRUCTION_LEN,
> >   IDT_VECTORING_ERROR_CODE);
> > @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
> > *vmx)
> >  
> >  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> >  {
> > -   if (is_guest_mode(vcpu))
> > +   if (to_vmx(vcpu)->nested.nested_run_pending)
> > return;
> > __vmx_complete_interrupts(to_vmx(vcpu),
> >   vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> > @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> > *vcpu)
> > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > unsigned long debugctlmsr;
> >  
> > -   if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> > -   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -   if (vmcs12->idt_vectoring_info_field &
> > -   VECTORING_INFO_VALID_MASK) {
> > -   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> > -   vmcs12->idt_vectoring_info_field);
> > -   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> > -   vmcs12->vm_exit_instruction_len);
> > -   if (vmcs12->idt_vectoring_info_field &
> > -   VECTORING_INFO_DELIVER_CODE_MASK)
> > -   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> > -   vmcs12->idt_vectoring_error_code);
> > -   }
> > -   }
> > -
> > /* Record the guest's net vcpu time for enforced NMI injections. */
> > if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
> > vmx->entry_time = ktime_get();
> > @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> > *vcpu)
> >  
> > vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
> >  
> > -   if (is_guest_mode(vcpu)) {
> > -   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -   vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> > -   if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> > -   vmcs12->idt_vectoring_error_code =
> > -   vmcs_read32(IDT_VECTORING_ERROR_CODE);
> > -   vmcs12->vm_exit_instruction_len =
> > -   vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> > -   }
> > -   }
> > -
> > vmx->loaded_vmcs->launched = 1;
> >  
> > vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> > @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
> > vmcs12 *vmcs12)
> > vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> > vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
> >  
> > -   /* clear vm-entry fields which are to be cleared on exit */
> > -   if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> > +   /* drop what we picked up for L0 via vmx_complete_interrupts */
> > +   vcpu->arch.nmi_injected =

[Bug 54161] nVMX: nested vpid

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=54161


Nadav Har'El  changed:

   What|Removed |Added

 Blocks||53601




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 53601] nVMX meta-bug

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=53601


Nadav Har'El  changed:

   What|Removed |Added

 Depends on||54161




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 54161] New: nVMX: nested vpid

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=54161

   Summary: nVMX: nested vpid
   Product: Virtualization
   Version: unspecified
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: enhancement
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: n...@math.technion.ac.il
Regression: No


Currently, support for VPID in nested VMX is minimal: L1 doesn't see the vpid
feature, and doesn't use it, but L0 may use it (if available) when running the
guests: We use the same vpid to run L1 and all its L2 guests, but 
vmx_flush_tlb(vcpu) when going back and forth between L1 and L2 - this flushes
this particular vpid, but at least leave other L1's cache entries behind.

A better performing solution would be to support "nested vpid" - L0 exposes to
L1 the VPID feature, and maintains a mapping between VPID numbers that L1 uses
to VPID numbers that it uses. L0 runs each of its guests - L1s and L2s, as
separate guests.

Some points to remember while doing this (this is only a partial list!):

1. Remember to set vmx->vpid on nested_vmx_run() and nested_vmx_vmexit() to
L1's and L2's vpid. This is important because vmx_flush_tlb() (called on
context switch between a guest's processes) flushes only the current vpid (if
possible) and we need to do it to the correct one.
2. In free_nested(), also free the vpids that have been allocated to all the
L2s. This can probably be done in nested_free_saved_vmcs() (see mention above).
However, note that free_nested() is preceded (in vmx_free_vcpu()) by a call to
free_vpid() which frees the current vmx->vpid, which might be l1 or l2 and we
need to make sure we free all of them, and just once.
3. Implement, of course, also INVVPID for L1.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Avi Kivity

On Wed, Feb 20, 2013 at 3:10 PM, Zhang, Yang Z  wrote:

 push %%cs
>>> "push %%cs" is invalid in x86_64.
>>
>> Oops. 'push[lq] $__KERNEL_CS' then.
> Is this right? Just copy it from other file.
>
> #define __STR(X) #X
> #define STR(X) __STR(X)
>
> #ifdef CONFIG_X86_64
> "pushq $"STR(__KERNEL_CS)" \n\t"
> #else
> "pushl $"STR(__KERNEL_CS)" \n\t"
> #endif
>
> #undef STR
> #undef __STR
>

Use __ASM_SIZE and an immediate operand for __KERNEL_CS:

 asm ( ... : : [cs]"i"(__KERNEL_CS) );

and the code will be cleaner.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 14:01, Jan Kiszka wrote:
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.
> 
> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.
> 
> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.
> 
> Signed-off-by: Jan Kiszka 
> ---
> 
> Survived moderate testing here and (currently) makes sense to me, but
> please review very carefully. I wouldn't be surprised if I'm still
> missing some subtle corner case.

Forgot to point this out again: It still takes "KVM: nVMX: Fix injection
of PENDING_INTERRUPT and NMI_WINDOW exits to L1" to make L0->L2
injection work. So this patch logically depends on it.

Jan

> 
>  arch/x86/kvm/vmx.c |   57 +++
>  1 files changed, 26 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index dd3a8a0..7d2fbd2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
> *vmx,
>  
>  static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>  {
> - if (is_guest_mode(&vmx->vcpu))
> - return;
>   __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
> VM_EXIT_INSTRUCTION_LEN,
> IDT_VECTORING_ERROR_CODE);
> @@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
> *vmx)
>  
>  static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>  {
> - if (is_guest_mode(vcpu))
> + if (to_vmx(vcpu)->nested.nested_run_pending)
>   return;
>   __vmx_complete_interrupts(to_vmx(vcpu),
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>   struct vcpu_vmx *vmx = to_vmx(vcpu);
>   unsigned long debugctlmsr;
>  
> - if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - if (vmcs12->idt_vectoring_info_field &
> - VECTORING_INFO_VALID_MASK) {
> - vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
> - vmcs12->idt_vectoring_info_field);
> - vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> - vmcs12->vm_exit_instruction_len);
> - if (vmcs12->idt_vectoring_info_field &
> - VECTORING_INFO_DELIVER_CODE_MASK)
> - vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
> - vmcs12->idt_vectoring_error_code);
> - }
> - }
> -
>   /* Record the guest's net vcpu time for enforced NMI injections. */
>   if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
>   vmx->entry_time = ktime_get();
> @@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu 
> *vcpu)
>  
>   vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
>  
> - if (is_guest_mode(vcpu)) {
> - struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> - vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
> - if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
> - vmcs12->idt_vectoring_error_code =
> - vmcs_read32(IDT_VECTORING_ERROR_CODE);
> - vmcs12->vm_exit_instruction_len =
> - vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
> - }
> - }
> -
>   vmx->loaded_vmcs->launched = 1;
>  
>   vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
> vmcs12 *vmcs12)
>   vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>   vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> - /* clear vm-entry fields which are to be cleared on exit */
> - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> + /* drop what we picked up for L0 via vmx_complete_interrupts */
> + vcpu->arch.nmi_injected = false;
> + kvm_clear_exception_queue(vcpu);
> + kvm_clear_interrupt_queue(vcpu);
> +
> + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> + vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

On 2013-02-20 15:14, Nadav Har'El wrote:
> Hi,
> 
> By the way, if you haven't seen my description of why the current code
> did what it did, take a look at
> http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
> Another description might also come in handy:
> http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html
> 
> On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event 
> injection and recovery":
>> This aligns VMX more with SVM regarding event injection and recovery for
>> nested guests. The changes allow to inject interrupts directly from L0
>> to L2.
>>
>> One difference to SVM is that we always transfer the pending event
>> injection into the architectural state of the VCPU and then drop it from
>> there if it turns out that we left L2 to enter L1.
> 
> Last time I checked, if I'm remembering correctly, the nested SVM code did
> something a bit different: After the exit from L2 to L1 and unnecessarily
> queuing the pending interrupt for injection, it skipped one entry into L1,
> and as usual after the entry the interrupt queue is cleared so next time
> around, when L1 one is really entered, the wrong injection is not attempted.
> 
>> VMX and SVM are now identical in how they recover event injections from
>> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
>> still contains a valid event and, if yes, transfer the content into L1's
>> idt_vectoring_info_field.
> 
>> To avoid that we incorrectly leak an event into the architectural VCPU
>> state that L1 wants to inject, we skip cancellation on nested run.
> 
> I didn't understand this last point.

- prepare_vmcs02 sets event to be injected into L2
- while trying to enter L2, a cancel condition is met
- we call vmx_cancel_interrupts but should now avoid filling L1's event
  into the arch event queues - it's kept in vmcs12

> 
>> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
>> vmcs12 *vmcs12)
>>  vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>>  vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>>  
>> -/* clear vm-entry fields which are to be cleared on exit */
>> -if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
>> +/* drop what we picked up for L0 via vmx_complete_interrupts */
>> +vcpu->arch.nmi_injected = false;
>> +kvm_clear_exception_queue(vcpu);
>> +kvm_clear_interrupt_queue(vcpu);
> 
> It would be nice to move these lines out of prepare_vmcs12(), since they
> don't really do anything with vmcs12, and move it into
> nested_vmx_vmexit() (which is the one which called prepare_vmcs12()).

OK.

> 
> Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and
> !PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make
> sure that in the former case, this doesn't clear the interrupt queue after
> we put an interrupt to be injected in it (at first glance it seems fine,
> but these code paths are so convoluted, it's hard to be sure).

I tested both, but none of my tests was close to cover all potential
corner cases. But that unconditional queue clearing surely deserves
attention and critical review.

> 
>> +if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
>> +vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
>> +/*
>> + * Preserve the event that was supposed to be injected
>> + * by emulating it would have been returned in
>> + * IDT_VECTORING_INFO_FIELD.
>> + */
>> +if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
>> +INTR_INFO_VALID_MASK) {
>> +vmcs12->idt_vectoring_info_field =
>> +vmcs12->vm_entry_intr_info_field;
>> +vmcs12->idt_vectoring_error_code =
>> +vmcs12->vm_entry_exception_error_code;
>> +vmcs12->vm_exit_instruction_len =
>> +vmcs12->vm_entry_instruction_len;
>> +vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> 
> I'm afraid I'm missing what you are trying to do here. Why would
> vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be
> true? After all, the processor clears it after each sucessful exit so
> last if() will only succeed on failed entries - but this is NOT the
> case if we're in the enclosing if (note that vmcs12->vm_exit_reason  =
> vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something?

Canceled vmentry as indicated above. Look at vcpu_enter_guest:
kvm_mmu_reload may fail, or we need to handle some async event / perform
some reschedule. But those points are past prepare_vmcs02.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  htt

[Bug 54141] nVMX: Support TPR shadow

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=54141


Nadav Har'El  changed:

   What|Removed |Added

 Blocks||53601




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 53601] nVMX meta-bug

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=53601


Nadav Har'El  changed:

   What|Removed |Added

 Depends on||54141




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 54141] New: nVMX: Support TPR shadow

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=54141

   Summary: nVMX: Support TPR shadow
   Product: Virtualization
   Version: unspecified
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: enhancement
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: n...@math.technion.ac.il
Regression: No


Add support for TPR shadow, which may be important for performance of
Windows L2 guests (which currently don't work anyway - see bug 53641).
Note, however, that TPR shadow is an optional feature, and KVM (as L1) won't
use it if not available to it.

Some of the things we'll need to do (I think):

1. Advertise CPU_BASED_TPR_SHADOW in MSR_IA32_VMX_PROCBASED_CTLS

2. In nested_vmx_exit_handled_cr, in "mov to cr8", consider if we need an "else
if CPU_BASED_TPR_SHADOW" which sets the shadow and only exits if below the
tpr_threshold. This may be unnecessary, because the processor will already do
this if we put vmcs12 desires in vmcs02, and don't merge it with vmcs01. Also,
do we need to change anything in "mov from cr8" in the same function? I don't
think it will ever get called.

3. In prepare_vmcs02, set the TPR-shadow definitions from vmcs12, ignoring L0's
wishes (if I understand correctly, this is the right thing to do):

4. In prepare_vmcs02, if nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW), we need
to set VIRTUAL_APIC_PAGE_ADDR. The code will look something like
struct page *page =
nested_get_page(vcpu, vmcs12->virtual_apic_page_addr);
if (!page)
return 1;
vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
but we need to save this "page" in vmx->nested and nested_release_page() it on
nested_vmx_vmexit() or free_nested().

4. In prepare_vmcs02, set TPR_THRESHOLD as requested by L1. We used to have
this code:
if (vm_need_tpr_shadow(vcpu->kvm) &&
nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold);
But I need to consider if "vm_need_tpr_shadow" is the right thing to check.

5. In prepare_vmcs02, in setting up CPU_BASED_VM_EXEC_CONTROL, probably *leave*
the code which removes the CPU_BASED_TPR_SHADOW from L0's exec_control (we'll
get this bit from vmcs12's exec_control).

6. We used to have the following code in prepare_vmcs02, after exec_control |=
vmcs12->cpu_based_vm_exec_control, to remove the TPR_SHADOW feature even if L1
asked for it in certain cases. I don't see why this was needed:
if (!vm_need_tpr_shadow(vcpu->kvm) ||
vmcs12->virtual_apic_page_addr == 0) {
exec_control &= ~CPU_BASED_TPR_SHADOW;
#ifdef CONFIG_X86_64
exec_control |= CPU_BASED_CR8_STORE_EXITING |
CPU_BASED_CR8_LOAD_EXITING;
#endif
} else if (exec_control & CPU_BASED_TPR_SHADOW) {
#ifdef CONFIG_X86_64
exec_control &= ~CPU_BASED_CR8_STORE_EXITING;
exec_control &= ~CPU_BASED_CR8_LOAD_EXITING;
#endif
}

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [uq/master PATCH] target-i386: kvm: save/restore steal time MSR

2013-02-20 Thread Gleb Natapov

On Tue, Feb 19, 2013 at 11:27:20PM -0300, Marcelo Tosatti wrote:
> 
> Read and write steal time MSR, so that reporting is functional across
> migration.
> 
> Signed-off-by: Marcelo Tosatti 
> 
Reviewed-by: Gleb Natapov 

> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 7577e4f..17c7293 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -792,6 +792,7 @@ typedef struct CPUX86State {
>  #endif
>  uint64_t system_time_msr;
>  uint64_t wall_clock_msr;
> +uint64_t steal_time_msr;
>  uint64_t async_pf_en_msr;
>  uint64_t pv_eoi_en_msr;
>  
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 0cf413d..9ae9d74 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -68,6 +68,7 @@ static bool has_msr_tsc_deadline;
>  static bool has_msr_async_pf_en;
>  static bool has_msr_pv_eoi_en;
>  static bool has_msr_misc_enable;
> +static bool has_msr_kvm_steal_time;
>  static int lm_capable_kernel;
>  
>  bool kvm_allows_irq0_override(void)
> @@ -507,6 +508,8 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  
>  has_msr_pv_eoi_en = c->eax & (1 << KVM_FEATURE_PV_EOI);
>  
> +has_msr_kvm_steal_time = c->eax & (1 << KVM_FEATURE_STEAL_TIME);
> +
>  cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
>  
>  for (i = 0; i <= limit; i++) {
> @@ -1107,6 +1110,10 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>  kvm_msr_entry_set(&msrs[n++], MSR_KVM_PV_EOI_EN,
>env->pv_eoi_en_msr);
>  }
> +if (has_msr_kvm_steal_time) {
> +kvm_msr_entry_set(&msrs[n++], MSR_KVM_STEAL_TIME,
> +  env->steal_time_msr);
> +}
>  if (hyperv_hypercall_available()) {
>  kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_GUEST_OS_ID, 0);
>  kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_HYPERCALL, 0);
> @@ -1360,6 +1367,9 @@ static int kvm_get_msrs(X86CPU *cpu)
>  if (has_msr_pv_eoi_en) {
>  msrs[n++].index = MSR_KVM_PV_EOI_EN;
>  }
> +if (has_msr_kvm_steal_time) {
> +msrs[n++].index = MSR_KVM_STEAL_TIME;
> +}
>  
>  if (env->mcg_cap) {
>  msrs[n++].index = MSR_MCG_STATUS;
> @@ -1445,6 +1455,9 @@ static int kvm_get_msrs(X86CPU *cpu)
>  case MSR_KVM_PV_EOI_EN:
>  env->pv_eoi_en_msr = msrs[i].data;
>  break;
> +case MSR_KVM_STEAL_TIME:
> +env->steal_time_msr = msrs[i].data;
> +break;
>  }
>  }
>  
> diff --git a/target-i386/machine.c b/target-i386/machine.c
> index 8df6a6b..1feb9ca 100644
> --- a/target-i386/machine.c
> +++ b/target-i386/machine.c
> @@ -287,6 +287,24 @@ static bool pv_eoi_msr_needed(void *opaque)
>  return cpu->pv_eoi_en_msr != 0;
>  }
>  
> +static bool steal_time_msr_needed(void *opaque)
> +{
> +CPUX86State *cpu = opaque;
> +
> +return cpu->steal_time_msr != 0;
> +}
> +
> +static const VMStateDescription vmstate_steal_time_msr = {
> +.name = "cpu/steal_time_msr",
> +.version_id = 1,
> +.minimum_version_id = 1,
> +.minimum_version_id_old = 1,
> +.fields  = (VMStateField []) {
> +VMSTATE_UINT64(steal_time_msr, CPUX86State),
> +VMSTATE_END_OF_LIST()
> +}
> +};
> +
>  static const VMStateDescription vmstate_async_pf_msr = {
>  .name = "cpu/async_pf_msr",
>  .version_id = 1,
> @@ -494,6 +512,9 @@ static const VMStateDescription vmstate_cpu = {
>  .vmsd = &vmstate_pv_eoi_msr,
>  .needed = pv_eoi_msr_needed,
>  } , {
> +.vmsd = &vmstate_steal_time_msr,
> +.needed = steal_time_msr_needed,
> +} , {
>  .vmsd = &vmstate_fpop_ip_dp,
>  .needed = fpop_ip_dp_needed,
>  }, {

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Nadav Har'El

Hi,

By the way, if you haven't seen my description of why the current code
did what it did, take a look at
http://www.mail-archive.com/kvm@vger.kernel.org/msg54478.html
Another description might also come in handy:
http://www.mail-archive.com/kvm@vger.kernel.org/msg54476.html

On Wed, Feb 20, 2013, Jan Kiszka wrote about "[PATCH] KVM: nVMX: Rework event 
injection and recovery":
> This aligns VMX more with SVM regarding event injection and recovery for
> nested guests. The changes allow to inject interrupts directly from L0
> to L2.
> 
> One difference to SVM is that we always transfer the pending event
> injection into the architectural state of the VCPU and then drop it from
> there if it turns out that we left L2 to enter L1.

Last time I checked, if I'm remembering correctly, the nested SVM code did
something a bit different: After the exit from L2 to L1 and unnecessarily
queuing the pending interrupt for injection, it skipped one entry into L1,
and as usual after the entry the interrupt queue is cleared so next time
around, when L1 one is really entered, the wrong injection is not attempted.

> VMX and SVM are now identical in how they recover event injections from
> unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
> still contains a valid event and, if yes, transfer the content into L1's
> idt_vectoring_info_field.

> To avoid that we incorrectly leak an event into the architectural VCPU
> state that L1 wants to inject, we skip cancellation on nested run.

I didn't understand this last point.

> @@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct 
> vmcs12 *vmcs12)
>   vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
>   vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
>  
> - /* clear vm-entry fields which are to be cleared on exit */
> - if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
> + /* drop what we picked up for L0 via vmx_complete_interrupts */
> + vcpu->arch.nmi_injected = false;
> + kvm_clear_exception_queue(vcpu);
> + kvm_clear_interrupt_queue(vcpu);

It would be nice to move these lines out of prepare_vmcs12(), since they
don't really do anything with vmcs12, and move it into
nested_vmx_vmexit() (which is the one which called prepare_vmcs12()).

Did you test this both with PIN_BASED_EXT_INTR_MASK (the usual case) and
!PIN_BASED_EXT_INTR_MASK (the case which interests you)? We need to make
sure that in the former case, this doesn't clear the interrupt queue after
we put an interrupt to be injected in it (at first glance it seems fine,
but these code paths are so convoluted, it's hard to be sure).

> + if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
> + vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
> + /*
> +  * Preserve the event that was supposed to be injected
> +  * by emulating it would have been returned in
> +  * IDT_VECTORING_INFO_FIELD.
> +  */
> + if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
> + INTR_INFO_VALID_MASK) {
> + vmcs12->idt_vectoring_info_field =
> + vmcs12->vm_entry_intr_info_field;
> + vmcs12->idt_vectoring_error_code =
> + vmcs12->vm_entry_exception_error_code;
> + vmcs12->vm_exit_instruction_len =
> + vmcs12->vm_entry_instruction_len;
> + vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);

I'm afraid I'm missing what you are trying to do here. Why would
vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) & INTR_INFO_VALID_MASK ever be
true? After all, the processor clears it after each sucessful exit so
last if() will only succeed on failed entries - but this is NOT the
case if we're in the enclosing if (note that vmcs12->vm_exit_reason  =
vmcs_read32(VM_EXIT_REASON)). Maybe I'm missing something?

Nadav.

-- 
Nadav Har'El| Wednesday, Feb 20 2013, 10 Adar 5773
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Change is inevitable, except from a
http://nadav.harel.org.il   |vending machine.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-20 Thread Gleb Natapov

On Tue, Feb 19, 2013 at 03:16:37PM -0600, Scott Wood wrote:
> On 02/19/2013 06:24:18 AM, Gleb Natapov wrote:
> >On Mon, Feb 18, 2013 at 05:01:40PM -0600, Scott Wood wrote:
> >> The ability to set/get attributes is needed.  Sorry, but "get or set
> >> one blob of data, up to 512 bytes, for the entire irqchip" is just
> >> not good enough -- assuming you don't want us to start sticking
> >> pointers and commands in *that* data. :-)
> >>
> >Proposed interface sticks pointers into ioctl data, so why doing
> >the same
> >for KVM_SET_IRQCHIP/KVM_GET_IRQCHIP makes you smile.
> 
> There's a difference between putting a pointer in an ioctl control
> structure that is specifically documented as being that way (as in
> ONE_REG), versus taking an ioctl that claims to be setting/getting a
> blob of state and embedding pointers in it.  It would be like
> sticking a pointer in the attribute payload of this API, which I
> think is something to be discouraged.
If documentation is what differentiate for you between silly and smart
then write documentation instead of new interfaces.

KVM_SET_IRQCHIP/KVM_GET_IRQCHIP is defined to operate on blob of data on
x86, nothing prevent you from adding MPIC specifics to the interface,
Add mpic state into kvm_irqchip structure and if 512 bytes is not enough
for you to transfer the state put pointers there and _document_ them.
But with 512 bytes you can transfer properties inline, so you probably
do not need pointer there anyway. I see you have three properties 2 of
them 32bit and one 64bit.

>It'd also be using
> KVM_SET_IRQCHIP to read data, which is the sort of thing you object
> to later on regarding KVM_IRQ_LINE_STATUS.
> 
Do not see why.

> Then there's the silliness of transporting 512 bytes just to read a
> descriptor for transporting something else.
> 
Yes, agree. But is this enough of a reason to introduce entirely new
interface? Is it on performance critical path? Doubt it, unless you
abuse the interface to send interrupts, but then isn't it silty to
do copy_from_user() twice to inject an interrupt like proposed interface
does?

> >For signaling irqs (I think this is what you mean by "commands")
> >we have KVM_IRQ_LINE.
> 
> It's one type of command.  Another is setting the address.  Another
> is writing to registers that have side effects (this is how MSI
> injection is done on MPIC, just as in real hardware).
> 
Setting the address is setting an attribute. Sending MSI is a command.
Things you set/get during init/migration are attributes. Things you do
to cause side-effects are commands.

> What is the benefit of KVM_IRQ_LINE over what MPIC does?  What real
> (non-glue/wrapper) code can become common?
> 
No new ioctl with exactly same result (well actually even faster since
less copying is done). You need to show us the benefits of the new interface
vs existing one, not vice versa.

> And I really hope you don't want us to do MSIs the x86 way.
> 
What is wrong with KVM_SIGNAL_MSI? Except that you'll need code to glue it
to mpic.

> In the XICS thread, Paul brought up the possibliity of cascaded
> MPICs.  It's not relevant to the systems we're trying to model, but
> if one did want to use the in-kernel irqchip interface for that, it
> would be really nice to be able to operate on a specific MPIC for
> injection rather than have to come up with some sort of global
> identifier (above and beyond the minor flattening we'd need to do to
> represent a single MPIC's interrupts in a flat numberspace).
> 
ARM encodes information in irq field of KVM_IRQ_LINE like that:
  bits:  | 31 ... 24 | 23  ... 16 | 15... 0 |
  field: | irq_type  | vcpu_index |   irq_number|
Why will similar approach not work?

> >> If you mean the way to inject interrupts, it's simpler this way.
> >> Why go out of our way to inject common glue code into a
> >> communication path between hw/kvm/mpic.c in QEMU and
> >> arch/powerpc/kvm/mpic.c in KVM?  Or rather, why make that common
> >> glue be specific to this one function when we could reuse the same
> >> communication glue used for other things, such as device state?
> >You will need glue anyway and I do no see how amount of it is much
> >different one way or the other.
> 
> It uses glue that we need to be present for other things anyway.  If
> it weren't for XICS we wouldn't need a KVM_IRQ_LINE implementation
> at all on PPC.  It may not be a major difference, but it doesn't
> affect anything but MPIC and it seems more straightforward this way.
> 
We are talking about something like 4 lines of userspace code including
bracket. I do not think this is strong point in favor of different
interface.

> >Gluing qemu_set_irq() to
> >ioctl(KVM_IRQ_LINE) or ioctl(KVM_SET_DEVICE_ATTR) is not much
> >different.
> 
> qemu_set_irq() is not glued to either of those.  It's glued to
> kvm_openpic_set_irq(), kvm_ioapic_set_irq(), etc.  It's already not
> generic code.
> 
OK, this does not invalidates my argument though.

> >O

RE: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Zhang, Yang Z

Avi Kivity wrote on 2013-02-20:
> On Wed, Feb 20, 2013 at 4:46 AM, Zhang, Yang Z 
> wrote:
 
 +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ +
 u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + +   /* +
* If external interrupt exists, IF bit is set in rflags/eflags on
 the +* interrupt stack frame, and interrupt will be enabled on
 a return +* from interrupt handler. +*/ +   if
 ((exit_intr_info & (INTR_INFO_VALID_MASK |
> INTR_INFO_INTR_TYPE_MASK)) +
   == (INTR_INFO_VALID_MASK |
> INTR_TYPE_EXT_INTR)) {
 +   unsigned int vector; +   unsigned long
 entry; +   gate_desc *desc; +   struct
 vcpu_vmx *vmx = to_vmx(vcpu); + +   vector = 
 exit_intr_info & INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 + desc
 = (void *)vmx->host_idt_base + vector * 16; +#else +  
 desc = (void *)vmx->host_idt_base + vector * 8; +#endif + + entry =
 gate_offset(*desc); +   asm( +
  "mov %0, %%" _ASM_DX " \n\t" +#ifdef CONFIG_X86_64 +
 "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" +
 "and $0xfff0, %%" _ASM_SP " \n\t" +
 "mov %%ss, %%" _ASM_AX " \n\t" +   "push %%"
 _ASM_AX " \n\t" +   "push %%" _ASM_BX " \n\t"
 +#endif
>>> 
>>> Are we sure no interrupts are using the IST feature?  I guess it's unlikely.
>> Linux uses IST for NMI, stack fault, machine-check, double fault and
>> debug interrupt . No external interrupt will use it. This feature is
>> only for external interrupt. So we don't need to check it here.
> 
> Ok, thanks for checking.
> 
>> 
>>> 
 +   "pushf \n\t"
 +   "pop %%" _ASM_AX " \n\t"
 +   "or $0x200, %%" _ASM_AX " \n\t"
 +   "push %%" _ASM_AX " \n\t"
>>> 
>>> Can simplify to pushf; orl $0x200, %%rsp.
>> Sure.
>> 
 +   "mov %%cs, %%" _ASM_AX " \n\t"
 +   "push %%" _ASM_AX " \n\t"
>>> 
>>> push %%cs
>> "push %%cs" is invalid in x86_64.
> 
> Oops. 'push[lq] $__KERNEL_CS' then.
Is this right? Just copy it from other file.

#define __STR(X) #X
#define STR(X) __STR(X)

#ifdef CONFIG_X86_64
"pushq $"STR(__KERNEL_CS)" \n\t"
#else
"pushl $"STR(__KERNEL_CS)" \n\t"
#endif

#undef STR
#undef __STR
 
>> 
 +   "push intr_return \n\t"
>>> 
>>> push $1f.  Or even combine with the next instruction, and call %rdx.
>> Which is faster? jmp or call?
> 
> Actually it doesn't matter, the processor is clever enough to minimize
> the difference.  But the code is simpler and shorter with 'call'. -- To
Yes, 'call' is better.

Best regards,
Yang


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: VMX: Pass vcpu to __vmx_complete_interrupts

2013-02-20 Thread Jan Kiszka

Cleanup: __vmx_complete_interrupts has no use for the vmx structure.

Signed-off-by: Jan Kiszka 
---

Note: this applies on top of "Rework event injection and recovery"

 arch/x86/kvm/vmx.c |   31 ++-
 1 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d99a519..d6ea4a7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6430,7 +6430,7 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
ktime_to_ns(ktime_sub(ktime_get(), vmx->entry_time));
 }
 
-static void __vmx_complete_interrupts(struct vcpu_vmx *vmx,
+static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
  u32 idt_vectoring_info,
  int instr_len_field,
  int error_code_field)
@@ -6441,46 +6441,43 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
*vmx,
 
idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
-   vmx->vcpu.arch.nmi_injected = false;
-   kvm_clear_exception_queue(&vmx->vcpu);
-   kvm_clear_interrupt_queue(&vmx->vcpu);
+   vcpu->arch.nmi_injected = false;
+   kvm_clear_exception_queue(vcpu);
+   kvm_clear_interrupt_queue(vcpu);
 
if (!idtv_info_valid)
return;
 
-   kvm_make_request(KVM_REQ_EVENT, &vmx->vcpu);
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
 
vector = idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
type = idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
 
switch (type) {
case INTR_TYPE_NMI_INTR:
-   vmx->vcpu.arch.nmi_injected = true;
+   vcpu->arch.nmi_injected = true;
/*
 * SDM 3: 27.7.1.2 (September 2008)
 * Clear bit "block by NMI" before VM entry if a NMI
 * delivery faulted.
 */
-   vmx_set_nmi_mask(&vmx->vcpu, false);
+   vmx_set_nmi_mask(vcpu, false);
break;
case INTR_TYPE_SOFT_EXCEPTION:
-   vmx->vcpu.arch.event_exit_inst_len =
-   vmcs_read32(instr_len_field);
+   vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
/* fall through */
case INTR_TYPE_HARD_EXCEPTION:
if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
u32 err = vmcs_read32(error_code_field);
-   kvm_queue_exception_e(&vmx->vcpu, vector, err);
+   kvm_queue_exception_e(vcpu, vector, err);
} else
-   kvm_queue_exception(&vmx->vcpu, vector);
+   kvm_queue_exception(vcpu, vector);
break;
case INTR_TYPE_SOFT_INTR:
-   vmx->vcpu.arch.event_exit_inst_len =
-   vmcs_read32(instr_len_field);
+   vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
/* fall through */
case INTR_TYPE_EXT_INTR:
-   kvm_queue_interrupt(&vmx->vcpu, vector,
-   type == INTR_TYPE_SOFT_INTR);
+   kvm_queue_interrupt(vcpu, vector, type == INTR_TYPE_SOFT_INTR);
break;
default:
break;
@@ -6489,7 +6486,7 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
*vmx,
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
-   __vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
+   __vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
  VM_EXIT_INSTRUCTION_LEN,
  IDT_VECTORING_ERROR_CODE);
 }
@@ -6498,7 +6495,7 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
if (to_vmx(vcpu)->nested.nested_run_pending)
return;
-   __vmx_complete_interrupts(to_vmx(vcpu),
+   __vmx_complete_interrupts(vcpu,
  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
  VM_ENTRY_INSTRUCTION_LEN,
  VM_ENTRY_EXCEPTION_ERROR_CODE);
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12

2013-02-20 Thread Jan Kiszka

IDT_VECTORING_INFO_FIELD was already read right after vmexit.

Signed-off-by: Jan Kiszka 
---
 arch/x86/kvm/vmx.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7d2fbd2..d99a519 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7368,8 +7368,7 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 
*vmcs12)
 
vmcs12->vm_exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
vmcs12->vm_exit_intr_error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
-   vmcs12->idt_vectoring_info_field =
-   vmcs_read32(IDT_VECTORING_INFO_FIELD);
+   vmcs12->idt_vectoring_info_field = to_vmx(vcpu)->idt_vectoring_info;
vmcs12->idt_vectoring_error_code =
vmcs_read32(IDT_VECTORING_ERROR_CODE);
vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
-- 
1.7.3.4
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: nVMX: Rework event injection and recovery

2013-02-20 Thread Jan Kiszka

This aligns VMX more with SVM regarding event injection and recovery for
nested guests. The changes allow to inject interrupts directly from L0
to L2.

One difference to SVM is that we always transfer the pending event
injection into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1.

VMX and SVM are now identical in how they recover event injections from
unperformed vmlaunch/vmresume: We detect that VM_ENTRY_INTR_INFO_FIELD
still contains a valid event and, if yes, transfer the content into L1's
idt_vectoring_info_field.

To avoid that we incorrectly leak an event into the architectural VCPU
state that L1 wants to inject, we skip cancellation on nested run.

Signed-off-by: Jan Kiszka 
---

Survived moderate testing here and (currently) makes sense to me, but
please review very carefully. I wouldn't be surprised if I'm still
missing some subtle corner case.

 arch/x86/kvm/vmx.c |   57 +++
 1 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index dd3a8a0..7d2fbd2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6489,8 +6489,6 @@ static void __vmx_complete_interrupts(struct vcpu_vmx 
*vmx,
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
-   if (is_guest_mode(&vmx->vcpu))
-   return;
__vmx_complete_interrupts(vmx, vmx->idt_vectoring_info,
  VM_EXIT_INSTRUCTION_LEN,
  IDT_VECTORING_ERROR_CODE);
@@ -6498,7 +6496,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
-   if (is_guest_mode(vcpu))
+   if (to_vmx(vcpu)->nested.nested_run_pending)
return;
__vmx_complete_interrupts(to_vmx(vcpu),
  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6531,21 +6529,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long debugctlmsr;
 
-   if (is_guest_mode(vcpu) && !vmx->nested.nested_run_pending) {
-   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-   if (vmcs12->idt_vectoring_info_field &
-   VECTORING_INFO_VALID_MASK) {
-   vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
-   vmcs12->idt_vectoring_info_field);
-   vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
-   vmcs12->vm_exit_instruction_len);
-   if (vmcs12->idt_vectoring_info_field &
-   VECTORING_INFO_DELIVER_CODE_MASK)
-   vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
-   vmcs12->idt_vectoring_error_code);
-   }
-   }
-
/* Record the guest's net vcpu time for enforced NMI injections. */
if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))
vmx->entry_time = ktime_get();
@@ -6704,17 +6687,6 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);
 
-   if (is_guest_mode(vcpu)) {
-   struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-   vmcs12->idt_vectoring_info_field = vmx->idt_vectoring_info;
-   if (vmx->idt_vectoring_info & VECTORING_INFO_VALID_MASK) {
-   vmcs12->idt_vectoring_error_code =
-   vmcs_read32(IDT_VECTORING_ERROR_CODE);
-   vmcs12->vm_exit_instruction_len =
-   vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
-   }
-   }
-
vmx->loaded_vmcs->launched = 1;
 
vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);
@@ -7403,9 +7375,32 @@ void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 
*vmcs12)
vmcs12->vm_exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
vmcs12->vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
 
-   /* clear vm-entry fields which are to be cleared on exit */
-   if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY))
+   /* drop what we picked up for L0 via vmx_complete_interrupts */
+   vcpu->arch.nmi_injected = false;
+   kvm_clear_exception_queue(vcpu);
+   kvm_clear_interrupt_queue(vcpu);
+
+   if (!(vmcs12->vm_exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) &&
+   vmcs12->vm_entry_intr_info_field & INTR_INFO_VALID_MASK) {
+   /*
+* Preserve the event that was supposed to be injected
+* by emulating it would have been returned in
+* IDT_VECTORING_INFO_FIELD.
+*/
+   if (vmcs_read32(VM_ENTRY_INTR_INFO_FIELD) &
+   INTR_INFO_VALID_MASK) {
+

Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Avi Kivity

On Wed, Feb 20, 2013 at 4:46 AM, Zhang, Yang Z  wrote:
>>>
>>> +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ +
>>> u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + +   /* +
>>>* If external interrupt exists, IF bit is set in rflags/eflags on
>>> the +* interrupt stack frame, and interrupt will be enabled on
>>> a return +* from interrupt handler. +*/ +   if
>>> ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) +
>>>   == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
>>> +   unsigned int vector; +   unsigned long
>>> entry; +   gate_desc *desc; +   struct vcpu_vmx
>>> *vmx = to_vmx(vcpu); + +   vector =  exit_intr_info &
>>> INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 +   desc =
>>> (void *)vmx->host_idt_base + vector * 16; +#else +   desc =
>>> (void *)vmx->host_idt_base + vector * 8; +#endif + +
>>> entry = gate_offset(*desc); +   asm( +
>>>  "mov %0, %%" _ASM_DX " \n\t" +#ifdef CONFIG_X86_64 +
>>> "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" +
>>> "and $0xfff0, %%" _ASM_SP " \n\t" +
>>> "mov %%ss, %%" _ASM_AX " \n\t" +   "push %%"
>>> _ASM_AX " \n\t" +   "push %%" _ASM_BX " \n\t"
>>> +#endif
>>
>> Are we sure no interrupts are using the IST feature?  I guess it's unlikely.
> Linux uses IST for NMI, stack fault, machine-check, double fault and debug 
> interrupt . No external interrupt will use it.
> This feature is only for external interrupt. So we don't need to check it 
> here.

Ok, thanks for checking.

>
>>
>>> +   "pushf \n\t"
>>> +   "pop %%" _ASM_AX " \n\t"
>>> +   "or $0x200, %%" _ASM_AX " \n\t"
>>> +   "push %%" _ASM_AX " \n\t"
>>
>> Can simplify to pushf; orl $0x200, %%rsp.
> Sure.
>
>>> +   "mov %%cs, %%" _ASM_AX " \n\t"
>>> +   "push %%" _ASM_AX " \n\t"
>>
>> push %%cs
> "push %%cs" is invalid in x86_64.

Oops. 'push[lq] $__KERNEL_CS' then.

>
>>> +   "push intr_return \n\t"
>>
>> push $1f.  Or even combine with the next instruction, and call %rdx.
> Which is faster? jmp or call?

Actually it doesn't matter, the processor is clever enough to minimize
the difference.  But the code is simpler and shorter with 'call'.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Zhang, Yang Z

Gleb Natapov wrote on 2013-02-20:
> On Wed, Feb 20, 2013 at 02:46:05AM +, Zhang, Yang Z wrote:
>> Avi Kivity wrote on 2013-02-20:
>>> On Tue, Feb 19, 2013 at 3:39 PM, Yang Zhang 
> wrote:
 From: Yang Zhang 
 
 The "acknowledge interrupt on exit" feature controls processor behavior
 for external interrupt acknowledgement. When this control is set, the
 processor acknowledges the interrupt controller to acquire the
 interrupt vector on VM exit.
 
 After enabling this feature, an interrupt which arrived when target cpu
 is running in vmx non-root mode will be handled by vmx handler instead
 of handler in idt. Currently, vmx handler only fakes an interrupt stack
 and jump to idt table to let real handler to handle it. Further, we
 will recognize the interrupt and only delivery the interrupt which not
 belong to current vcpu through idt table. The interrupt which belonged
 to current vcpu will be handled inside vmx handler. This will reduce
 the interrupt handle cost of KVM.
 
 Also, interrupt enable logic is changed if this feature is turnning
 on: Before this patch, hypervior call local_irq_enable() to enable it
 directly. Now IF bit is set on interrupt stack frame, and will be
 enabled on a return from interrupt handler if exterrupt interrupt
 exists. If no external interrupt, still call local_irq_enable() to
 enable it.
 
 Refer to Intel SDM volum 3, chapter 33.2.
 
 
 +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ +
 u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + +   /* +
* If external interrupt exists, IF bit is set in rflags/eflags on
 the +* interrupt stack frame, and interrupt will be enabled on
 a return +* from interrupt handler. +*/ +   if
 ((exit_intr_info & (INTR_INFO_VALID_MASK |
> INTR_INFO_INTR_TYPE_MASK)) +
   == (INTR_INFO_VALID_MASK |
> INTR_TYPE_EXT_INTR)) {
 +   unsigned int vector; +   unsigned long
 entry; +   gate_desc *desc; +   struct
 vcpu_vmx *vmx = to_vmx(vcpu); + +   vector = 
 exit_intr_info & INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 + desc
 = (void *)vmx->host_idt_base + vector * 16; +#else +  
 desc = (void *)vmx->host_idt_base + vector * 8; +#endif + + entry =
 gate_offset(*desc); +   asm( +
  "mov %0, %%" _ASM_DX " \n\t" +#ifdef CONFIG_X86_64 +
 "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" +
 "and $0xfff0, %%" _ASM_SP " \n\t" +
 "mov %%ss, %%" _ASM_AX " \n\t" +   "push %%"
 _ASM_AX " \n\t" +   "push %%" _ASM_BX " \n\t"
 +#endif
>>> 
>>> Are we sure no interrupts are using the IST feature?  I guess it's unlikely.
>> Linux uses IST for NMI, stack fault, machine-check, double fault and
>> debug interrupt . No external interrupt will use it. This feature is
>> only for external interrupt. So we don't need to check it here.
>> 
>>> 
 +   "pushf \n\t"
 +   "pop %%" _ASM_AX " \n\t"
 +   "or $0x200, %%" _ASM_AX " \n\t"
 +   "push %%" _ASM_AX " \n\t"
>>> 
>>> Can simplify to pushf; orl $0x200, %%rsp.
>> Sure.
>> 
 +   "mov %%cs, %%" _ASM_AX " \n\t"
 +   "push %%" _ASM_AX " \n\t"
>>> 
>>> push %%cs
>> "push %%cs" is invalid in x86_64.
>> 
 +   "push intr_return \n\t"
>>> 
>>> push $1f.  Or even combine with the next instruction, and call %rdx.
>> Which is faster? jmp or call?
>> 
> Wrong question. You need to compare push+jmp with call. I do not see why
Sorry, I didn't express it clearly.  Yes, I want to compare push+jmp with call.

> later will be slower.
I think so. If push+jmp is not faster than call, I will use the latter.

 +   "jmp *%% " _ASM_DX " \n\t"
 +   "1: \n\t"
 +   ".pushsection .rodata \n\t"
 +   ".global intr_return \n\t"
 +   "intr_return: " _ASM_PTR " 1b \n\t"
 +   ".popsection \n\t"
 +   : : "m"(entry) :
 +#ifdef CONFIG_X86_64
 +   "rax", "rbx", "rdx"
 +#else
 +   "eax", "edx"
 +#endif
 +   );
 +   } else
 +   local_irq_enable();
 +}
 +
>> 
>> 
>> Best regards,
>> Yang
>> 
> 
> --
>   Gleb.


Best regards,
Yang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/2] KVM: VMX: enable acknowledge interupt on vmexit

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 02:46:05AM +, Zhang, Yang Z wrote:
> Avi Kivity wrote on 2013-02-20:
> > On Tue, Feb 19, 2013 at 3:39 PM, Yang Zhang  wrote:
> >> From: Yang Zhang 
> >> 
> >> The "acknowledge interrupt on exit" feature controls processor behavior
> >> for external interrupt acknowledgement. When this control is set, the
> >> processor acknowledges the interrupt controller to acquire the
> >> interrupt vector on VM exit.
> >> 
> >> After enabling this feature, an interrupt which arrived when target cpu
> >> is running in vmx non-root mode will be handled by vmx handler instead
> >> of handler in idt. Currently, vmx handler only fakes an interrupt stack
> >> and jump to idt table to let real handler to handle it. Further, we
> >> will recognize the interrupt and only delivery the interrupt which not
> >> belong to current vcpu through idt table. The interrupt which belonged
> >> to current vcpu will be handled inside vmx handler. This will reduce
> >> the interrupt handle cost of KVM.
> >> 
> >> Also, interrupt enable logic is changed if this feature is turnning on:
> >> Before this patch, hypervior call local_irq_enable() to enable it directly.
> >> Now IF bit is set on interrupt stack frame, and will be enabled on a 
> >> return from
> >> interrupt handler if exterrupt interrupt exists. If no external interrupt, 
> >> still
> >> call local_irq_enable() to enable it.
> >> 
> >> Refer to Intel SDM volum 3, chapter 33.2.
> >> 
> >> 
> >> +static void vmx_handle_external_intr(struct kvm_vcpu *vcpu) +{ +  
> >> u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO); + +   /* +
> >>* If external interrupt exists, IF bit is set in rflags/eflags on
> >> the +* interrupt stack frame, and interrupt will be enabled on
> >> a return +* from interrupt handler. +*/ +   if
> >> ((exit_intr_info & (INTR_INFO_VALID_MASK | INTR_INFO_INTR_TYPE_MASK)) +
> >>   == (INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR)) {
> >> +   unsigned int vector; +   unsigned long
> >> entry; +   gate_desc *desc; +   struct vcpu_vmx
> >> *vmx = to_vmx(vcpu); + +   vector =  exit_intr_info &
> >> INTR_INFO_VECTOR_MASK; +#ifdef CONFIG_X86_64 +   desc =
> >> (void *)vmx->host_idt_base + vector * 16; +#else +   desc =
> >> (void *)vmx->host_idt_base + vector * 8; +#endif + +  
> >> entry = gate_offset(*desc); +   asm( + 
> >>  "mov %0, %%" _ASM_DX " \n\t" +#ifdef CONFIG_X86_64 +  
> >> "mov %%" _ASM_SP ", %%" _ASM_BX " \n\t" +  
> >> "and $0xfff0, %%" _ASM_SP " \n\t" +  
> >> "mov %%ss, %%" _ASM_AX " \n\t" +   "push %%"
> >> _ASM_AX " \n\t" +   "push %%" _ASM_BX " \n\t"
> >> +#endif
> > 
> > Are we sure no interrupts are using the IST feature?  I guess it's unlikely.
> Linux uses IST for NMI, stack fault, machine-check, double fault and debug 
> interrupt . No external interrupt will use it.
> This feature is only for external interrupt. So we don't need to check it 
> here.
> 
> > 
> >> +   "pushf \n\t"
> >> +   "pop %%" _ASM_AX " \n\t"
> >> +   "or $0x200, %%" _ASM_AX " \n\t"
> >> +   "push %%" _ASM_AX " \n\t"
> > 
> > Can simplify to pushf; orl $0x200, %%rsp.
> Sure.
> 
> >> +   "mov %%cs, %%" _ASM_AX " \n\t"
> >> +   "push %%" _ASM_AX " \n\t"
> > 
> > push %%cs
> "push %%cs" is invalid in x86_64.
> 
> >> +   "push intr_return \n\t"
> > 
> > push $1f.  Or even combine with the next instruction, and call %rdx.
> Which is faster? jmp or call?
> 
Wrong question. You need to compare push+jmp with call. I do not see why
later will be slower.

> >> +   "jmp *%% " _ASM_DX " \n\t"
> >> +   "1: \n\t"
> >> +   ".pushsection .rodata \n\t"
> >> +   ".global intr_return \n\t"
> >> +   "intr_return: " _ASM_PTR " 1b \n\t"
> >> +   ".popsection \n\t"
> >> +   : : "m"(entry) :
> >> +#ifdef CONFIG_X86_64
> >> +   "rax", "rbx", "rdx"
> >> +#else
> >> +   "eax", "edx"
> >> +#endif
> >> +   );
> >> +   } else
> >> +   local_irq_enable();
> >> +}
> >> +
> 
> 
> Best regards,
> Yang
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH V4 RESEND 15/22] tap: multiqueue support

2013-02-20 Thread Jason Wang

On 02/11/2013 06:28 PM, Markus Armbruster wrote:
> Commit 264986e2 extended NetdevTapOptions without updating the
> documentation.  Hasn't been addressed since.  Must fix for 1.4, in my
> opinion.

Will send a patch to fix this.

Thanks
>
> This is the offending patch:
>
> Jason Wang  writes:
>
>> Recently, linux support multiqueue tap which could let userspace call 
>> TUNSETIFF
>> for a signle device many times to create multiple file descriptors as
>> independent queues. User could also enable/disabe a specific queue through
>> TUNSETQUEUE.
>>
>> The patch adds the generic infrastructure to create multiqueue taps. To 
>> achieve
>> this a new parameter "queues" were introduced to specify how many queues were
>> expected to be created for tap by qemu itself. Alternatively, management 
>> could
>> also pass multiple pre-created tap file descriptors separated with ':' 
>> through a
>> new parameter fds like -netdev tap,id=hn0,fds="X:Y:..:Z". Multiple vhost file
>> descriptors could also be passed in this way.
>>
>> Each TAPState were still associated to a tap fd, which mean multiple 
>> TAPStates
>> were created when user needs multiqueue taps. Since each TAPState contains 
>> one
>> NetClientState, with the multiqueue nic support, an N peers of NetClientState
>> were built up.
>>
>> A new parameter, mq_required were introduce in tap_open() to create 
>> multiqueue
>> tap fds.
> [...]
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 3a4817b..cdd8384 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -2533,6 +2533,7 @@
>>'data': {
>>  '*ifname': 'str',
>>  '*fd': 'str',
>> +'*fds':'str',
>>  '*script': 'str',
>>  '*downscript': 'str',
>>  '*helper': 'str',
>> @@ -2540,7 +2541,9 @@
>>  '*vnet_hdr':   'bool',
>>  '*vhost':  'bool',
>>  '*vhostfd':'str',
>> -'*vhostforce': 'bool' } }
>> +'*vhostfds':   'str',
>> +'*vhostforce': 'bool',
>> +'*queues': 'uint32'} }
>>  
>>  ##
>>  # @NetdevSocketOptions

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v12 rebased 2/8] start vm after resetting it

2013-02-20 Thread Hu Tao

On Thu, Feb 07, 2013 at 11:50:28PM -0200, Marcelo Tosatti wrote:
> On Wed, Jan 23, 2013 at 03:19:23PM +0800, Hu Tao wrote:
> > From: Wen Congyang 
> > 
> > The guest should run after resetting it, but it does not run if its
> > old state is RUN_STATE_INTERNAL_ERROR or RUN_STATE_PAUSED.
> > 
> > We don't set runstate to RUN_STATE_PAUSED when resetting the guest,
> > so the runstate will be changed from RUN_STATE_INTERNAL_ERROR or
> > RUN_STATE_PAUSED to RUN_STATE_RUNNING(not RUN_STATE_PAUSED).
> 
> It appears the last hunk will automatically reset state from 
> RUN_STATE_INTERNAL_ERROR to RUN_STATE_RUNNING ?

Yes.

> 
> I suppose the transition table allows, from RUN_STATE_INTERNAL_ERROR:
> 
>  system_reset
>  cont
> 
> To resume the machine?

True.

I think the purpose of this patch is to always reset and _run_ the guest
by `system_reset', avoiding an additional `cont' following `system_reset'.

> 
> > 
> > Signed-off-by: Wen Congyang 
> > ---
> >  include/block/block.h | 2 ++
> >  qmp.c | 2 +-
> >  vl.c  | 7 ---
> >  3 files changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/block/block.h b/include/block/block.h
> > index ffd1936..5e82ccb 100644
> > --- a/include/block/block.h
> > +++ b/include/block/block.h
> > @@ -366,6 +366,8 @@ void bdrv_disable_copy_on_read(BlockDriverState *bs);
> >  void bdrv_set_in_use(BlockDriverState *bs, int in_use);
> >  int bdrv_in_use(BlockDriverState *bs);
> >  
> > +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs);
> > +
> >  #ifdef CONFIG_LINUX_AIO
> >  int raw_get_aio_fd(BlockDriverState *bs);
> >  #else
> > diff --git a/qmp.c b/qmp.c
> > index 55b056b..5f1bed1 100644
> > --- a/qmp.c
> > +++ b/qmp.c
> > @@ -130,7 +130,7 @@ SpiceInfo *qmp_query_spice(Error **errp)
> >  };
> >  #endif
> >  
> > -static void iostatus_bdrv_it(void *opaque, BlockDriverState *bs)
> > +void iostatus_bdrv_it(void *opaque, BlockDriverState *bs)
> >  {
> >  bdrv_iostatus_reset(bs);
> >  }
> > diff --git a/vl.c b/vl.c
> > index b0bcf1e..1d2edaa 100644
> > --- a/vl.c
> > +++ b/vl.c
> > @@ -534,7 +534,7 @@ static const RunStateTransition 
> > runstate_transitions_def[] = {
> >  { RUN_STATE_INMIGRATE, RUN_STATE_RUNNING },
> >  { RUN_STATE_INMIGRATE, RUN_STATE_PAUSED },
> >  
> > -{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_PAUSED },
> > +{ RUN_STATE_INTERNAL_ERROR, RUN_STATE_RUNNING },
> >  { RUN_STATE_INTERNAL_ERROR, RUN_STATE_FINISH_MIGRATE },
> >  
> >  { RUN_STATE_IO_ERROR, RUN_STATE_RUNNING },
> > @@ -569,7 +569,7 @@ static const RunStateTransition 
> > runstate_transitions_def[] = {
> >  
> >  { RUN_STATE_SAVE_VM, RUN_STATE_RUNNING },
> >  
> > -{ RUN_STATE_SHUTDOWN, RUN_STATE_PAUSED },
> > +{ RUN_STATE_SHUTDOWN, RUN_STATE_RUNNING },
> >  { RUN_STATE_SHUTDOWN, RUN_STATE_FINISH_MIGRATE },
> >  
> >  { RUN_STATE_DEBUG, RUN_STATE_SUSPENDED },
> > @@ -1951,7 +1951,8 @@ static bool main_loop_should_exit(void)
> >  resume_all_vcpus();
> >  if (runstate_check(RUN_STATE_INTERNAL_ERROR) ||
> >  runstate_check(RUN_STATE_SHUTDOWN)) {
> > -runstate_set(RUN_STATE_PAUSED);
> > +bdrv_iterate(iostatus_bdrv_it, NULL);
> > +vm_start();
> >  }
> >  }
> >  if (qemu_wakeup_requested()) {
> > -- 
> > 1.8.0.1.240.ge8a1f5a
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Regards,
Hu Tao
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 54061] guest panic after live migration

2013-02-20 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=54061





--- Comment #2 from Jay Ren   2013-02-20 08:06:16 ---
Marcleo,

yes, after reverting that commit "caf6900f2d8", live-migration can work fine.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

55 matches

Mail list logo