[RFC PATCH v5] add support for Hyper-V reference time counter

2014-01-16 Thread Vadim Rozenfeld
Signed-off: Peter Lieven p...@kamp.de
Signed-off: Gleb Natapov
Signed-off: Vadim Rozenfeld vroze...@redhat.com
 
After some consideration I decided to submit only Hyper-V reference
counters support this time. I will submit iTSC support as a separate
patch as soon as it is ready. 

v1 - v2
1. mark TSC page dirty as suggested by 
Eric Northup digitale...@google.com and Gleb
2. disable local irq when calling get_kernel_ns, 
as it was done by Peter Lieven p...@amp.de
3. move check for TSC page enable from second patch
to this one.

v3 - v4
Get rid of ref counter offset.

v4 - v5
replace __copy_to_user with kvm_write_guest
when updateing iTSC page.

---
 arch/x86/include/asm/kvm_host.h|  1 +
 arch/x86/include/uapi/asm/hyperv.h | 13 +
 arch/x86/kvm/x86.c | 28 +++-
 include/uapi/linux/kvm.h   |  1 +
 4 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ae5d783..33fef07 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -605,6 +605,7 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+   u64 hv_tsc_page;
 
#ifdef CONFIG_KVM_MMU_AUDIT
int audit_point;
diff --git a/arch/x86/include/uapi/asm/hyperv.h 
b/arch/x86/include/uapi/asm/hyperv.h
index b8f1c01..462efe7 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -28,6 +28,9 @@
 /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/
 #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE(1  1)
 
+/* A partition's reference time stamp counter (TSC) page */
+#define HV_X64_MSR_REFERENCE_TSC   0x4021
+
 /*
  * There is a single feature flag that signifies the presence of the MSR
  * that can be used to retrieve both the local APIC Timer frequency as
@@ -198,6 +201,9 @@
 #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK   \
(~((1ull  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
 
+#define HV_X64_MSR_TSC_REFERENCE_ENABLE0x0001
+#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT 12
+
 #define HV_PROCESSOR_POWER_STATE_C00
 #define HV_PROCESSOR_POWER_STATE_C11
 #define HV_PROCESSOR_POWER_STATE_C22
@@ -210,4 +216,11 @@
 #define HV_STATUS_INVALID_ALIGNMENT4
 #define HV_STATUS_INSUFFICIENT_BUFFERS 19
 
+typedef struct _HV_REFERENCE_TSC_PAGE {
+   __u32 tsc_sequence;
+   __u32 res1;
+   __u64 tsc_scale;
+   __s64 tsc_offset;
+} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
+
 #endif
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5d004da..8e685b8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -836,11 +836,12 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
  * kvm-specific. Those are put in the beginning of the list.
  */
 
-#define KVM_SAVE_MSRS_BEGIN10
+#define KVM_SAVE_MSRS_BEGIN12
 static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
+   HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC,
HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_KVM_PV_EOI_EN,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
@@ -1826,6 +1827,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
switch (msr) {
case HV_X64_MSR_GUEST_OS_ID:
case HV_X64_MSR_HYPERCALL:
+   case HV_X64_MSR_REFERENCE_TSC:
+   case HV_X64_MSR_TIME_REF_COUNT:
r = true;
break;
}
@@ -1867,6 +1870,20 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 data)
kvm-arch.hv_hypercall = data;
break;
}
+   case HV_X64_MSR_REFERENCE_TSC: {
+   u64 gfn;
+   HV_REFERENCE_TSC_PAGE tsc_ref;
+   memset(tsc_ref, 0, sizeof(tsc_ref));
+   kvm-arch.hv_tsc_page = data;
+   if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE))
+   break;
+   gfn = data  HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
+   if (kvm_write_guest(kvm, data,
+   tsc_ref, sizeof(tsc_ref)))
+   return 1;
+   mark_page_dirty(kvm, gfn);
+   break;
+   }
default:
vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x 
data 0x%llx\n, msr, data);
@@ -2291,6 +2308,14 @@ static int get_msr_hyperv_pw(struct kvm_vcpu *vcpu, u32 
msr, u64 *pdata)
case HV_X64_MSR_HYPERCALL:
data = kvm-arch.hv_hypercall;
break;
+   case HV_X64_MSR_TIME_REF_COUNT: {
+   data =
+div_u64(get_kernel_ns() + 

[PATCH] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Christian Borntraeger
When starting lots of dataplane devices the bootup takes very long on my
s390 system(prototype irqfd code). With larger setups we are even able to
trigger some timeouts in some components.
Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
long (strace claims up to 0.1 sec) when having multiple CPUs.
This is caused by the  synchronize_rcu and the HZ=100 of s390.
We can defer the freeing outside of the ioctl path by using kfree_rcu.

Please note that we now have to check for a NULL pointer, since
the underlying call_rcu in kfree_rcu cannot handle broken rcu
callbacks.

This patch reduces the boot time till mounting root from 8 to 2 seconds
on my s390 guest with 100 disks.

Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/irqchip.c   | 5 ++---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3d1b0e6..2f19079 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -327,6 +327,7 @@ struct kvm_kernel_irq_routing_entry {
 #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
 
 struct kvm_irq_routing_table {
+   struct rcu_head rcu;
int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
struct kvm_kernel_irq_routing_entry *rt_entries;
u32 nr_rt_entries;
diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 20dc9e4..3e2ebed 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -226,12 +226,11 @@ int kvm_set_irq_routing(struct kvm *kvm,
kvm_irq_routing_update(kvm, new);
mutex_unlock(kvm-irq_lock);
 
-   synchronize_rcu();
-
new = old;
r = 0;
 
 out:
-   kfree(new);
+   if (new)
+   kfree_rcu(new, rcu);
return r;
 }
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Paolo Bonzini
Il 16/01/2014 10:23, Christian Borntraeger ha scritto:
 When starting lots of dataplane devices the bootup takes very long on my
 s390 system(prototype irqfd code). With larger setups we are even able to
 trigger some timeouts in some components.
 Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
 long (strace claims up to 0.1 sec) when having multiple CPUs.
 This is caused by the  synchronize_rcu and the HZ=100 of s390.
 We can defer the freeing outside of the ioctl path by using kfree_rcu.
 
 Please note that we now have to check for a NULL pointer, since
 the underlying call_rcu in kfree_rcu cannot handle broken rcu
 callbacks.
 
 This patch reduces the boot time till mounting root from 8 to 2 seconds
 on my s390 guest with 100 disks.
 
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com

Hi Christian,

please take a look at the thread at
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/116933 - uses
of call_rcu need to be rate limited, otherwise the guest can allocate
arbitrarily large amounts of memory.

Can you check if using a dedicated SRCU (possibly with
synchronize_srcu_expedited in kvm_set_irq_routing) speeds up
KVM_SET_GSI_ROUTING enough?  Back-of-the-envelope, the latency should be
on the order of tens of microseconds.

Thanks,

Paolo

 ---
  include/linux/kvm_host.h | 1 +
  virt/kvm/irqchip.c   | 5 ++---
  2 files changed, 3 insertions(+), 3 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 3d1b0e6..2f19079 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -327,6 +327,7 @@ struct kvm_kernel_irq_routing_entry {
  #ifdef CONFIG_HAVE_KVM_IRQ_ROUTING
  
  struct kvm_irq_routing_table {
 + struct rcu_head rcu;
   int chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS];
   struct kvm_kernel_irq_routing_entry *rt_entries;
   u32 nr_rt_entries;
 diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
 index 20dc9e4..3e2ebed 100644
 --- a/virt/kvm/irqchip.c
 +++ b/virt/kvm/irqchip.c
 @@ -226,12 +226,11 @@ int kvm_set_irq_routing(struct kvm *kvm,
   kvm_irq_routing_update(kvm, new);
   mutex_unlock(kvm-irq_lock);
  
 - synchronize_rcu();
 -
   new = old;
   r = 0;
  
  out:
 - kfree(new);
 + if (new)
 + kfree_rcu(new, rcu);
   return r;
  }
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] KVM changes for 3.13 final

2014-01-16 Thread Paolo Bonzini
Linus,

The following changes since commit 7e22e91102c6b9df7c4ae2168910e19d2bb14cd6:

  Linux 3.13-rc8 (2014-01-12 17:04:18 +0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to 0dce7cd67fd9055c4a2ff278f8af1431e646d346:

  kvm: x86: fix apic_base enable check (2014-01-15 13:42:14 +0100)


Fix for a brown paper bag bug.  Thanks to Drew Jones for noticing.


Andrew Jones (1):
  kvm: x86: fix apic_base enable check

 arch/x86/kvm/lapic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Christian Borntraeger
When starting lots of dataplane devices the bootup takes very long on my
s390 system(prototype irqfd code). With larger setups we are even able
to
trigger some timeouts in some components.
Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
long (strace claims up to 0.1 sec) when having multiple CPUs.
This is caused by the  synchronize_rcu and the HZ=100 of s390.
By changing the code to use a private srcu we can speed things up.

This patch reduces the boot time till mounting root from 8 to 2
seconds on my s390 guest with 100 disks.

I converted most of the rcu routines to srcu. Review for the unconverted
use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
is necessary, though. They look fine to me since they are protected by
outer functions.

In addition, we should also discuss if a global srcu (for all guests) is
fine.

Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 virt/kvm/irqchip.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 20dc9e4..5283eb8 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -26,17 +26,20 @@
 
 #include linux/kvm_host.h
 #include linux/slab.h
+#include linux/srcu.h
 #include linux/export.h
 #include trace/events/kvm.h
 #include irq.h
 
+DEFINE_STATIC_SRCU(irq_srcu);
+
 bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
-   int gsi;
+   int gsi, idx;
 
-   rcu_read_lock();
-   gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin];
+   idx = srcu_read_lock(irq_srcu);
+   gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin];
if (gsi != -1)
hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list,
 link)
@@ -45,7 +48,7 @@ bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, 
unsigned pin)
return true;
}
 
-   rcu_read_unlock();
+   srcu_read_unlock(irq_srcu, idx);
 
return false;
 }
@@ -54,18 +57,18 @@ EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
 void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
 {
struct kvm_irq_ack_notifier *kian;
-   int gsi;
+   int gsi, idx;
 
trace_kvm_ack_irq(irqchip, pin);
 
-   rcu_read_lock();
-   gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin];
+   idx = srcu_read_lock(irq_srcu);
+   gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin];
if (gsi != -1)
hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list,
 link)
if (kian-gsi == gsi)
kian-irq_acked(kian);
-   rcu_read_unlock();
+   srcu_read_unlock(irq_srcu, idx);
 }
 
 void kvm_register_irq_ack_notifier(struct kvm *kvm,
@@ -85,7 +88,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
mutex_lock(kvm-irq_lock);
hlist_del_init_rcu(kian-link);
mutex_unlock(kvm-irq_lock);
-   synchronize_rcu();
+   synchronize_srcu_expedited(irq_srcu);
 #ifdef __KVM_HAVE_IOAPIC
kvm_vcpu_request_scan_ioapic(kvm);
 #endif
@@ -115,7 +118,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level,
bool line_status)
 {
struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
-   int ret = -1, i = 0;
+   int ret = -1, i = 0, idx;
struct kvm_irq_routing_table *irq_rt;
 
trace_kvm_set_irq(irq, level, irq_source_id);
@@ -124,12 +127,12 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq, int level,
 * IOAPIC.  So set the bit in both. The guest will ignore
 * writes to the unused one.
 */
-   rcu_read_lock();
-   irq_rt = rcu_dereference(kvm-irq_routing);
+   idx = srcu_read_lock(irq_srcu);
+   irq_rt = srcu_dereference(kvm-irq_routing, irq_srcu);
if (irq  irq_rt-nr_rt_entries)
hlist_for_each_entry(e, irq_rt-map[irq], link)
irq_set[i++] = *e;
-   rcu_read_unlock();
+   srcu_read_unlock(irq_srcu, idx);
 
while(i--) {
int r;
@@ -226,7 +229,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
kvm_irq_routing_update(kvm, new);
mutex_unlock(kvm-irq_lock);
 
-   synchronize_rcu();
+   synchronize_srcu_expedited(irq_srcu);
 
new = old;
r = 0;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Paolo Bonzini
Il 16/01/2014 13:44, Christian Borntraeger ha scritto:
 I converted most of the rcu routines to srcu. Review for the unconverted
 use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
 is necessary, though. They look fine to me since they are protected by
 outer functions.

They are fine because they do not have lockdep checks
(hlist_for_each_entry_rcu uses rcu_dereference_raw rather than
rcu_dereference, and write-sides do not do rcu lockdep at all).

 In addition, we should also discuss if a global srcu (for all guests) is
 fine.

I think it is.  It is already way cheaper than it used to be, and we're
hardly relying on the sleepable part of srcu.  We just want its faster
detection of grace periods.  One instance should be fine because our
read sides are so small and mostly they are not even preemptable.

Thanks for writing the patch!

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Christian Borntraeger
On 16/01/14 13:59, Paolo Bonzini wrote:
 Il 16/01/2014 13:44, Christian Borntraeger ha scritto:
 I converted most of the rcu routines to srcu. Review for the unconverted
 use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
 is necessary, though. They look fine to me since they are protected by
 outer functions.
 
 They are fine because they do not have lockdep checks
 (hlist_for_each_entry_rcu uses rcu_dereference_raw rather than
 rcu_dereference, and write-sides do not do rcu lockdep at all).
 
 In addition, we should also discuss if a global srcu (for all guests) is
 fine.
 
 I think it is.  It is already way cheaper than it used to be, and we're
 hardly relying on the sleepable part of srcu.  We just want its faster
 detection of grace periods.  One instance should be fine because our
 read sides are so small and mostly they are not even preemptable.
 
 Thanks for writing the patch!
 
 Paolo
 

Will you edit the patch description or shall I resend the patch?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Paolo Bonzini
Il 16/01/2014 14:06, Christian Borntraeger ha scritto:
 Will you edit the patch description or shall I resend the patch?

I can edit the commit message.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I need your trust reply.

2014-01-16 Thread Zakan M
Dear Friend,

I am contacting you to assist me in a transaction of 6,200,000.00 U.S dollars 
to be transferred out from my bank. This transaction is confidential and 100% 
genuine, so we MUST remain fiducially in all our dealings since i cannot 
complete the transaction alone without the help of a foreigner. Therefore, I am 
contacting you in this business for the benefit of both of us and kindly reply 
me on this E-Mail ID zak0...@terra.com for confidential reasons. 

Upon achieving this goal you will be entitled to 30% of the total sum and 70% 
will be mine. So kindly indicate your full interest on assurance of trust to 
enable me give you full details and how this transaction will be executed. I 
look forward to hear from you soon.

Best regards, 
Mr. Zakan .M.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 2/2] add support for Hyper-V partition reference time enlightenment

2014-01-16 Thread Paolo Bonzini
Il 14/01/2014 10:22, Vadim Rozenfeld ha scritto:
 @@ -1883,6 +1884,13 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 data)
   addr = gfn_to_hva(kvm, gfn);
   if (kvm_is_error_hva(addr))
   return 1;
 + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)
 +  ka-use_master_clock) {
 + tsc_ref.tsc_sequence = 1;

I think that after the first migration the sequence number will always
be 2: set_msr_hyperv_pw will set it to 1, and KVM_SET_KVMCLOCK will
increment it.  Can you check this?

The solution would be to propagate the struct msr_data * argument from
kvm_set_msr_common to kvm_set_msr_hyper_pw.  Then, here in this if you
can test msr_info-host_initiated, and increment the tsc_sequence
instead of resetting.  Like

if ((!msr_info-host_initiated || tsc_ref.tsc_sequence) 
boot_cpu_has(X86_FEATURE_CONSTANT_TSC) 
ka-use_master_clock) {
tsc_ref.tsc_sequence = msr_info-host_initiated
? tsc_ref.tsc_sequence + 1
: 1;

 + tsc_ref.tsc_scale = ((1LL  32) /
 +  vcpu-arch.virtual_tsc_khz)  32;
 + tsc_ref.tsc_offset = kvm-arch.kvmclock_offset;
 + }
   if (__copy_to_user((void __user *)addr, tsc_ref, 
 sizeof(tsc_ref)))
   return 1;
   mark_page_dirty(kvm, gfn);
 @@ -3871,6 +3879,27 @@ long kvm_arch_vm_ioctl(struct file *filp,
   local_irq_enable();
   kvm-arch.kvmclock_offset = delta;
   kvm_gen_update_masterclock(kvm);
 + if (kvm-arch.hv_tsc_page  HV_X64_MSR_TSC_REFERENCE_ENABLE) {
 + HV_REFERENCE_TSC_PAGE tsc_ref;
 + struct kvm_arch *ka = kvm-arch;
 + r = kvm_read_guest(kvm, kvm-arch.hv_tsc_page,
 +tsc_ref, sizeof(tsc_ref));
 + if (r)
 + goto out;
 + if (tsc_ref.tsc_sequence
 +  boot_cpu_has(X86_FEATURE_CONSTANT_TSC)
 +  ka-use_master_clock) {
 + tsc_ref.tsc_sequence++;
 + tsc_ref.tsc_scale = ((1LL  32) /
 + __get_cpu_var(cpu_tsc_khz))  32;

Since on migration kvm_set_hyperv_pw will always be called, do you need
to write tsc_ref.scale at all here?

Paolo

 + tsc_ref.tsc_offset = kvm-arch.kvmclock_offset;
 + } else
 + tsc_ref.tsc_sequence = 0;
 + r = kvm_write_guest(kvm, kvm-arch.hv_tsc_page,
 + tsc_ref, sizeof(tsc_ref));
 + if (r)
 + goto out;
 + }
   break;
   }
   case KVM_GET_CLOCK: {
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] KVM: x86: Validate guest writes to MSR_IA32_APICBASE

2014-01-16 Thread Paolo Bonzini
Il 04/01/2014 18:47, Jan Kiszka ha scritto:
 + u64 old_state = vcpu-arch.apic_base 
 + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
 + u64 new_state = msr_info-data 
 + (MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
 + u64 reserved_bits = ((~0ULL)  boot_cpu_data.x86_phys_bits) | 0x2ff |
 + (guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE);
 +

Should this use the guest CPUID instead?

 + if (!msr_info-host_initiated 

Is this check on host_initiated just for backwards compatibility, or is
there another case that I am missing?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/12] KVM: x86: Validate guest writes to MSR_IA32_APICBASE

2014-01-16 Thread Jan Kiszka
On 2014-01-16 15:07, Paolo Bonzini wrote:
 Il 04/01/2014 18:47, Jan Kiszka ha scritto:
 +u64 old_state = vcpu-arch.apic_base 
 +(MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
 +u64 new_state = msr_info-data 
 +(MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE);
 +u64 reserved_bits = ((~0ULL)  boot_cpu_data.x86_phys_bits) | 0x2ff |
 +(guest_cpuid_has_x2apic(vcpu) ? 0 : X2APIC_ENABLE);
 +
 
 Should this use the guest CPUID instead?

Hmm, they may differ... Then yes.

 
 +if (!msr_info-host_initiated 
 
 Is this check on host_initiated just for backwards compatibility, or is
 there another case that I am missing?

The path is taken for both host-initiated and guest-initiated APICBASE
updates. Host-initiated ones are allowed to perform architecturally
invalid state transitions. And the MSR is emulated, so if they like to
set a reserved bit...

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-16 Thread Julian Stecklina
The paravirtualized clock used in KVM and Xen uses a version field to
allow the guest to see when the shared data structure is inconsistent.
The code reads the version field twice (before and after the data
structure is copied) and checks whether they haven't
changed and that the lowest bit is not set. As the second access is not
synchronized, the compiler could generate code that accesses version
more than two times and you end up with inconsistent data.

An example using pvclock_get_time_values:

host starts updating data, sets src-version to 1
guest reads src-version (1) and stores it into dst-version.
guest copies inconsistent data
guest reads src-version (1) and computes xor with dst-version.
host finishes updating data and sets src-version to 2
guest reads src-version (2) and checks whether lower bit is not set.
while loop exits with inconsistent data!

AFAICS the compiler is allowed to optimize the given code this way.

Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
---
 arch/x86/kernel/pvclock.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 42eb330..f62b41c 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time 
*shadow)
 static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
struct pvclock_vcpu_time_info *src)
 {
+   u32 nversion;
+
do {
dst-version = src-version;
rmb();  /* fetch version before data */
@@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct 
pvclock_shadow_time *dst,
dst-tsc_shift = src-tsc_shift;
dst-flags = src-flags;
rmb();  /* test version after fetching data */
-   } while ((src-version  1) || (dst-version != src-version));
+   nversion = ACCESS_ONCE(src-version);
+   } while ((nversion  1) || (dst-version != nversion));
 
return dst-version;
 }
@@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
struct pvclock_vcpu_time_info *vcpu_time,
struct timespec *ts)
 {
-   u32 version;
+   u32 version, nversion;
u64 delta;
struct timespec now;
 
@@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
*wall_clock,
now.tv_sec  = wall_clock-sec;
now.tv_nsec = wall_clock-nsec;
rmb();  /* fetch time before checking version */
-   } while ((wall_clock-version  1) || (version != wall_clock-version));
+   nversion = ACCESS_ONCE(wall_clock-version);
+   } while ((nversion  1) || (version != nversion));
 
delta = pvclock_clocksource_read(vcpu_time);/* time since system 
boot */
delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-16 Thread Jan Beulich
 On 16.01.14 at 15:13, Julian Stecklina jstec...@os.inf.tu-dresden.de 
 wrote:
 The paravirtualized clock used in KVM and Xen uses a version field to
 allow the guest to see when the shared data structure is inconsistent.
 The code reads the version field twice (before and after the data
 structure is copied) and checks whether they haven't
 changed and that the lowest bit is not set. As the second access is not
 synchronized, the compiler could generate code that accesses version
 more than two times and you end up with inconsistent data.
 
 An example using pvclock_get_time_values:
 
 host starts updating data, sets src-version to 1
 guest reads src-version (1) and stores it into dst-version.
 guest copies inconsistent data
 guest reads src-version (1) and computes xor with dst-version.
 host finishes updating data and sets src-version to 2
 guest reads src-version (2) and checks whether lower bit is not set.
 while loop exits with inconsistent data!
 
 AFAICS the compiler is allowed to optimize the given code this way.

I don't think so - this would only be an issue if the conditions used
| instead of ||. || implies a sequence point between evaluating the
left and right sides, and the standard says: The presence of a
sequence point between the evaluation of expressions A and B
implies that every value computation and side effect associated
with A is sequenced before every value computation and side
effect associated with B.

And even if there was a problem (i.e. my interpretation of the
above being incorrect), I don't think you'd need ACCESS_ONCE()
here: The same local variable can't have two different values in
two different use sites when there was no intermediate
assignment to it.

Interestingly the old XenoLinux code uses | instead of || in
both cases, yet only one of the two also entertains the use
of a local variable. I shall fix this (read: Thanks for pointing
out even if in the upstream incarnation this is not a problem).

Jan

 Signed-off-by: Julian Stecklina jstec...@os.inf.tu-dresden.de
 ---
  arch/x86/kernel/pvclock.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)
 
 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 42eb330..f62b41c 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -55,6 +55,8 @@ static u64 pvclock_get_nsec_offset(struct 
 pvclock_shadow_time *shadow)
  static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst,
   struct pvclock_vcpu_time_info *src)
  {
 + u32 nversion;
 +
   do {
   dst-version = src-version;
   rmb();  /* fetch version before data */
 @@ -64,7 +66,8 @@ static unsigned pvclock_get_time_values(struct 
 pvclock_shadow_time *dst,
   dst-tsc_shift = src-tsc_shift;
   dst-flags = src-flags;
   rmb();  /* test version after fetching data */
 - } while ((src-version  1) || (dst-version != src-version));
 + nversion = ACCESS_ONCE(src-version);
 + } while ((nversion  1) || (dst-version != nversion));
  
   return dst-version;
  }
 @@ -135,7 +138,7 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   struct pvclock_vcpu_time_info *vcpu_time,
   struct timespec *ts)
  {
 - u32 version;
 + u32 version, nversion;
   u64 delta;
   struct timespec now;
  
 @@ -146,7 +149,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock 
 *wall_clock,
   now.tv_sec  = wall_clock-sec;
   now.tv_nsec = wall_clock-nsec;
   rmb();  /* fetch time before checking version */
 - } while ((wall_clock-version  1) || (version != wall_clock-version));
 + nversion = ACCESS_ONCE(wall_clock-version);
 + } while ((nversion  1) || (version != nversion));
  
   delta = pvclock_clocksource_read(vcpu_time);/* time since system 
 boot */
   delta += now.tv_sec * (u64)NSEC_PER_SEC + now.tv_nsec;
 -- 
 1.8.4.2
 
 
 ___
 Xen-devel mailing list
 xen-de...@lists.xen.org 
 http://lists.xen.org/xen-devel 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX

2014-01-16 Thread Jan Kiszka
On 2014-01-16 16:08, Paolo Bonzini wrote:
 Il 04/01/2014 18:47, Jan Kiszka ha scritto:
 This is on top of next after merging in the two patches of mine that are
 only present in master ATM.

 Highlights:
  - reworked fix of DR6 reading on SVM
  - full check for invalid writes to IA32_APIC_BASE
  - fixed support for halting in L2 (nVMX)
  - fully emulated preemption timer (nVMX)
  - tracing of nested vmexits (nVMX)

 The patch KVM: nVMX: Leave VMX mode on clearing of feature control MSR
 is included again, unchanged from previous posting.

 Most fixes are backed by KVM unit tests, to be posted soon as well.
 
 I'm applying patches 1-10 (for now to kvm/queue).
 
 For the last two, I prefer to wait for 3.15.

Should we disable the broken features in 3.14?

 
 Also, for patch 11 I would really prefer to use check_nested_events for
 both VMX and SVM.  I will look at SVM next week.

OK, thanks. I will send an updated patch 11 and also a patch on top of 4
to read the physical bit width from the guest cpuid.

Jan

 
 Thanks,
 
 Paolo
 
 Jan Kiszka (12):
   KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
   KVM: SVM: Fix reading of DR6
   KVM: VMX: Fix DR6 update on #DB exception
   KVM: x86: Validate guest writes to MSR_IA32_APICBASE
   KVM: nVMX: Leave VMX mode on clearing of feature control MSR
   KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
   KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
   KVM: nVMX: Clean up handling of VMX-related MSRs
   KVM: nVMX: Fix nested_run_pending on activity state HLT
   KVM: nVMX: Update guest activity state field on L2 exits
   KVM: nVMX: Rework interception of IRQs and NMIs
   KVM: nVMX: Fully emulate preemption timer

  arch/x86/include/asm/kvm_host.h   |   4 +
  arch/x86/include/uapi/asm/msr-index.h |   1 +
  arch/x86/kvm/cpuid.h  |   8 +
  arch/x86/kvm/lapic.h  |   2 +-
  arch/x86/kvm/svm.c|  15 ++
  arch/x86/kvm/vmx.c| 399 
 --
  arch/x86/kvm/x86.c|  67 +-
  7 files changed, 318 insertions(+), 178 deletions(-)

 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] KVM: nVMX: Rework interception of IRQs and NMIs

2014-01-16 Thread Paolo Bonzini
Il 04/01/2014 18:47, Jan Kiszka ha scritto:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Move the check for leaving L2 on pending and intercepted IRQs or NMIs
 from the *_allowed handler into a dedicated callback. Invoke this
 callback at the relevant points before KVM checks if IRQs/NMIs can be
 injected. The callback has the task to switch from L2 to L1 if needed
 and inject the proper vmexit events.
 
 The rework fixes L2 wakeups from HLT and provides the foundation for
 preemption timer emulation.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  arch/x86/include/asm/kvm_host.h |  2 ++
  arch/x86/kvm/vmx.c  | 67 
 +++--
  arch/x86/kvm/x86.c  | 15 +++--
  3 files changed, 53 insertions(+), 31 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index e73651b..d195421 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -764,6 +764,8 @@ struct kvm_x86_ops {
  struct x86_instruction_info *info,
  enum x86_intercept_stage stage);
   void (*handle_external_intr)(struct kvm_vcpu *vcpu);
 +
 + int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
  };
  
  struct kvm_arch_async_pf {
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 1245ff1..ec8a976 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -4620,22 +4620,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, 
 bool masked)
  
  static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
  {
 - if (is_guest_mode(vcpu)) {
 - if (to_vmx(vcpu)-nested.nested_run_pending)
 - return 0;
 - if (nested_exit_on_nmi(vcpu)) {
 - nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 -   NMI_VECTOR | INTR_TYPE_NMI_INTR |
 -   INTR_INFO_VALID_MASK, 0);
 - /*
 -  * The NMI-triggered VM exit counts as injection:
 -  * clear this one and block further NMIs.
 -  */
 - vcpu-arch.nmi_pending = 0;
 - vmx_set_nmi_mask(vcpu, true);
 - return 0;
 - }
 - }
 + if (to_vmx(vcpu)-nested.nested_run_pending)
 + return 0;
  
   if (!cpu_has_virtual_nmis()  to_vmx(vcpu)-soft_vnmi_blocked)
   return 0;
 @@ -4647,19 +4633,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
  
  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
  {
 - if (is_guest_mode(vcpu)) {
 - if (to_vmx(vcpu)-nested.nested_run_pending)
 - return 0;
 - if (nested_exit_on_intr(vcpu)) {
 - nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT,
 -   0, 0);
 - /*
 -  * fall through to normal code, but now in L1, not L2
 -  */
 - }
 - }
 -
 - return (vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
 + return (!to_vmx(vcpu)-nested.nested_run_pending 
 + vmcs_readl(GUEST_RFLAGS)  X86_EFLAGS_IF) 
   !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
   (GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
  }
 @@ -8158,6 +8133,35 @@ static void vmcs12_save_pending_event(struct kvm_vcpu 
 *vcpu,
   }
  }
  
 +static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 +{
 + struct vcpu_vmx *vmx = to_vmx(vcpu);
 +
 + if (vcpu-arch.nmi_pending  nested_exit_on_nmi(vcpu)) {
 + if (vmx-nested.nested_run_pending)
 + return -EBUSY;
 + nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 +   NMI_VECTOR | INTR_TYPE_NMI_INTR |
 +   INTR_INFO_VALID_MASK, 0);
 + /*
 +  * The NMI-triggered VM exit counts as injection:
 +  * clear this one and block further NMIs.
 +  */
 + vcpu-arch.nmi_pending = 0;
 + vmx_set_nmi_mask(vcpu, true);
 + return 0;
 + }
 +
 + if ((kvm_cpu_has_interrupt(vcpu) || external_intr) 
 + nested_exit_on_intr(vcpu)) {
 + if (vmx-nested.nested_run_pending)
 + return -EBUSY;
 + nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
 + }
 +
 + return 0;
 +}
 +
  /*
   * prepare_vmcs12 is part of what we need to do when the nested L2 guest 
 exits
   * and we want to prepare to run its L1 parent. L1 keeps a vmcs for L2 
 (vmcs12),
 @@ -8498,6 +8502,9 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, 
 u32 exit_reason,
   nested_vmx_succeed(vcpu);
   if (enable_shadow_vmcs)
   vmx-nested.sync_shadow_vmcs = true;
 +
 + 

Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX

2014-01-16 Thread Paolo Bonzini
Il 04/01/2014 18:47, Jan Kiszka ha scritto:
 This is on top of next after merging in the two patches of mine that are
 only present in master ATM.
 
 Highlights:
  - reworked fix of DR6 reading on SVM
  - full check for invalid writes to IA32_APIC_BASE
  - fixed support for halting in L2 (nVMX)
  - fully emulated preemption timer (nVMX)
  - tracing of nested vmexits (nVMX)
 
 The patch KVM: nVMX: Leave VMX mode on clearing of feature control MSR
 is included again, unchanged from previous posting.
 
 Most fixes are backed by KVM unit tests, to be posted soon as well.

I'm applying patches 1-10 (for now to kvm/queue).

For the last two, I prefer to wait for 3.15.

Also, for patch 11 I would really prefer to use check_nested_events for
both VMX and SVM.  I will look at SVM next week.

Thanks,

Paolo

 Jan Kiszka (12):
   KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
   KVM: SVM: Fix reading of DR6
   KVM: VMX: Fix DR6 update on #DB exception
   KVM: x86: Validate guest writes to MSR_IA32_APICBASE
   KVM: nVMX: Leave VMX mode on clearing of feature control MSR
   KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
   KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
   KVM: nVMX: Clean up handling of VMX-related MSRs
   KVM: nVMX: Fix nested_run_pending on activity state HLT
   KVM: nVMX: Update guest activity state field on L2 exits
   KVM: nVMX: Rework interception of IRQs and NMIs
   KVM: nVMX: Fully emulate preemption timer
 
  arch/x86/include/asm/kvm_host.h   |   4 +
  arch/x86/include/uapi/asm/msr-index.h |   1 +
  arch/x86/kvm/cpuid.h  |   8 +
  arch/x86/kvm/lapic.h  |   2 +-
  arch/x86/kvm/svm.c|  15 ++
  arch/x86/kvm/vmx.c| 399 
 --
  arch/x86/kvm/x86.c|  67 +-
  7 files changed, 318 insertions(+), 178 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX

2014-01-16 Thread Paolo Bonzini
Il 16/01/2014 16:12, Jan Kiszka ha scritto:
  I'm applying patches 1-10 (for now to kvm/queue).
  
  For the last two, I prefer to wait for 3.15.
 Should we disable the broken features in 3.14?

I was thinking about it---which would have meant applying patches 1-8 only.

In the end I decided not to do it because the patch doesn't affect
KVM-on-KVM emulation.  If the patch already helped you in developing
Jailhouse, so the feature must not be _completely_ broken.  So the churn
of reverting the feature now and reapplying it later is not warranted, IMO.

Thanks,

Paolo

  
  Also, for patch 11 I would really prefer to use check_nested_events for
  both VMX and SVM.  I will look at SVM next week.
 OK, thanks. I will send an updated patch 11 and also a patch on top of 4
 to read the physical bit width from the guest cpuid.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/12] KVM: x86: Fixes for debug registers, IA32_APIC_BASE, and nVMX

2014-01-16 Thread Jan Kiszka
On 2014-01-16 16:20, Paolo Bonzini wrote:
 Il 16/01/2014 16:12, Jan Kiszka ha scritto:
 I'm applying patches 1-10 (for now to kvm/queue).

 For the last two, I prefer to wait for 3.15.
 Should we disable the broken features in 3.14?
 
 I was thinking about it---which would have meant applying patches 1-8 only.

And we would have to disable the option to turn of HLT interception.

 
 In the end I decided not to do it because the patch doesn't affect
 KVM-on-KVM emulation.  If the patch already helped you in developing
 Jailhouse, so the feature must not be _completely_ broken.  So the churn
 of reverting the feature now and reapplying it later is not warranted, IMO.

Well, in the end nested=1 remains the sign that this feature is still
experimental.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH] KVM, XEN: Fix potential race in pvclock code

2014-01-16 Thread Julian Stecklina
On 01/16/2014 04:04 PM, Jan Beulich wrote:
 I don't think so - this would only be an issue if the conditions used
 | instead of ||. || implies a sequence point between evaluating the
 left and right sides, and the standard says: The presence of a
 sequence point between the evaluation of expressions A and B
 implies that every value computation and side effect associated
 with A is sequenced before every value computation and side
 effect associated with B.

This only applies to single-threaded code. Multithreaded code must be
data-race free for that to be true. See

https://lwn.net/Articles/508991/

 And even if there was a problem (i.e. my interpretation of the
 above being incorrect), I don't think you'd need ACCESS_ONCE()
 here: The same local variable can't have two different values in
 two different use sites when there was no intermediate
 assignment to it.

Same comment as above.

Julian



signature.asc
Description: OpenPGP digital signature


nested EPT

2014-01-16 Thread duy hai nguyen
Dear All,

I am having a problem with using nested EPT in my system: In L0
hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do
not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but
does not show in that of L1.

- 'Nested' and 'EPT' are enabled in L0:

$cat /sys/module/kvm_intel/parameters/nested
Y

$cat /sys/module/kvm_intel/parameters/ept
Y

- The libvirt xml file used in L0 has this cpu configuration:

cpu mode='host-passthrough'/

- The kernel version I am using for both L0 and L1 is 3.9.11



Thank you very much.

Best,
Hai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Luiz Capitulino
The pressure notification added to the virtio-balloon driver by this series
is going to be used by the host (QEMU, in this case) to implement automatic
balloning support. More details in patch 3/4 and patch 4/4.

Patch 1/4 adds in-kernel vmpressure support and is not really part of this
series, it's added here for the convenience of someone who wants to try
automatic ballooning. Patch 2/4 is a hack to make in-kernel vmpressure work
for something not related to cgroups, I'll improve it in later versions.

Glauber Costa (1):
  vmpressure: in-kernel notifications

Luiz capitulino (3):
  vmpressure in-kernel: hack
  virtio_balloon: add pressure notification via a new virtqueue
  virtio_balloon: skip inflation if guest is under pressure

 drivers/virtio/virtio_balloon.c | 100 
 include/linux/vmpressure.h  |   6 +++
 include/uapi/linux/virtio_balloon.h |   1 +
 mm/vmpressure.c |  58 +++--
 4 files changed, 151 insertions(+), 14 deletions(-)

-- 
1.8.1.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/4] vmpressure in-kernel: hack

2014-01-16 Thread Luiz Capitulino
From: Luiz capitulino lcapitul...@redhat.com

1. Allow drivers to register private data
2. Allow drivers to pass css=NULL
3. Pass level to the callback

Signed-off-by: Luiz capitulino lcapitul...@redhat.com
---
 include/linux/vmpressure.h |  3 ++-
 mm/vmpressure.c| 13 +
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 9102e53..de416b6 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct 
cgroup_subsys_state *css,
 struct eventfd_ctx *eventfd,
 const char *args);
 extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
-   void (*fn)(void));
+   void (*fn)(void *data, int level),
+   void *data);
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
struct cftype *cft,
struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 730e7c1..4ed0e85 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -132,9 +132,10 @@ static enum vmpressure_levels 
vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
union {
struct eventfd_ctx *efd;
-   void (*fn)(void);
+   void (*fn)(void *data, int level);
};
enum vmpressure_levels level;
+   void *data;
bool kernel_event;
struct list_head node;
 };
@@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
list_for_each_entry(ev, vmpr-events, node) {
if (ev-kernel_event) {
-   ev-fn();
+   ev-fn(ev-data, level);
} else if (vmpr-notify_userspace  level = ev-level) {
eventfd_signal(ev-efd, 1);
signalled = true;
@@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state 
*css,
  * well-defined cgroup aware interface.
  */
 int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
- void (*fn)(void))
+void (*fn)(void *data, int level), void 
*data)
 {
-   struct vmpressure *vmpr = css_to_vmpressure(css);
+   struct vmpressure *vmpr;
struct vmpressure_event *ev;
 
+   vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
+
ev = kzalloc(sizeof(*ev), GFP_KERNEL);
if (!ev)
return -ENOMEM;
 
ev-kernel_event = true;
+   ev-data = data;
ev-fn = fn;
 
mutex_lock(vmpr-events_lock);
list_add(ev-node, vmpr-events);
mutex_unlock(vmpr-events_lock);
+
return 0;
 }
 
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/4] vmpressure: in-kernel notifications

2014-01-16 Thread Luiz Capitulino
From: Glauber Costa glom...@openvz.org

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa glom...@openvz.org
Signed-off-by: Vladimir Davydov vdavy...@parallels.com
Acked-by: Anton Vorontsov an...@enomsg.org
Acked-by: Pekka Enberg penb...@kernel.org
Reviewed-by: Greg Thelen gthe...@google.com
Cc: Dave Chinner dchin...@redhat.com
Cc: John Stultz john.stu...@linaro.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Joonsoo Kim iamjoonsoo@lge.com
Cc: Michal Hocko mho...@suse.cz
Cc: Kamezawa Hiroyuki kamezawa.hir...@jp.fujitsu.com
Cc: Johannes Weiner han...@cmpxchg.org
Signed-off-by: Luiz capitulino lcapitul...@redhat.com
---
 include/linux/vmpressure.h |  5 +
 mm/vmpressure.c| 53 +++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
/* Have to grab the lock on events traversal or modifications. */
struct mutex events_lock;
 
+   /* False if only kernel users want to be notified, true otherwise. */
+   bool notify_userspace;
+
struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct 
cgroup_subsys_state *css,
 struct cftype *cft,
 struct eventfd_ctx *eventfd,
 const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+   void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
struct cftype *cft,
struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels 
vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-   struct eventfd_ctx *efd;
+   union {
+   struct eventfd_ctx *efd;
+   void (*fn)(void);
+   };
enum vmpressure_levels level;
+   bool kernel_event;
struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
mutex_lock(vmpr-events_lock);
 
list_for_each_entry(ev, vmpr-events, node) {
-   if (level = ev-level) {
+   if (ev-kernel_event) {
+   ev-fn();
+   } else if (vmpr-notify_userspace  level = ev-level) {
eventfd_signal(ev-efd, 1);
signalled = true;
}
}
 
+   vmpr-notify_userspace = false;
mutex_unlock(vmpr-events_lock);
 
return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 * we account it too.
 */
if (!(gfp  (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-   return;
+   goto schedule;
 
/*
 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
vmpr-scanned += scanned;
vmpr-reclaimed += reclaimed;
scanned = vmpr-scanned;
+   /*
+* If we didn't reach this point, only kernel events will be triggered.
+* It is the job of the worker thread to clean this up once the
+* notifications are all delivered.
+*/
+   vmpr-notify_userspace = true;
spin_unlock(vmpr-sr_lock);
 
+schedule:
if (scanned  vmpressure_win)
return;
schedule_work(vmpr-work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state 
*css,
 }
 
 

[RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Luiz Capitulino
From: Luiz capitulino lcapitul...@redhat.com

This commit adds support to a new virtqueue called message virtqueue.

The message virtqueue can be used by guests to notify the host about
important memory-related state changes in the guest. Currently, the
only implemented notification is the guest is under pressure one,
which informs the host that the guest is under memory pressure.

This notification can be used to implement automatic memory ballooning
in the host. For example, once learning that the guest is under pressure,
the host could cancel an on-going inflate and/or start a deflate
operation.

Doing this through a virtqueue might seem like overkill, as all
we're doing is to transfer an integer between guest and host. However,
using a virtqueue offers the following advantages:

 1. We can realibly synchronize host and guest. That is, the guest
will only continue once the host acknowledges the notification.
This is important, because if the guest gets under pressure while
inflating the balloon, it has to stop to give the host a chance
to reset num_pages (see next commit)

 2. It's extensible. We may (or may not) want to tell the host
which pressure level the guest finds itself in (ie. low,
medium or critical)

The lightweight alternative is to use a configuration space parameter.
For this to work though, the guest would have to wait the for host
to acknowloedge the receipt of a configuration change update. I could
try this if the virtqueue is too overkill.

Finally, the guest learns it's under pressure by registering a
callback with the in-kernel vmpressure API.

FIXMEs:

 - vmpressure API is missing an de-registration routine
 - not completely sure my changes in virtballoon_probe() are correct

Signed-off-by: Luiz capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 93 +
 include/uapi/linux/virtio_balloon.h |  1 +
 2 files changed, 84 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 5c4a95b..1c3ee71 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -29,6 +29,9 @@
 #include linux/module.h
 #include linux/balloon_compaction.h
 
+#include linux/cgroup.h
+#include linux/vmpressure.h
+
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
  * multiple balloon pages.  All memory counters in this driver are in balloon
@@ -37,10 +40,12 @@
 #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
VIRTIO_BALLOON_PFN_SHIFT)
 #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
 
+#define VIRTIO_BALLOON_MSG_PRESSURE 1
+
 struct virtio_balloon
 {
struct virtio_device *vdev;
-   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
+   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq;
 
/* Where the ballooning thread waits for config to change. */
wait_queue_head_t config_change;
@@ -51,6 +56,8 @@ struct virtio_balloon
/* Waiting for host to ack the pages we released. */
wait_queue_head_t acked;
 
+   wait_queue_head_t message_acked;
+
/* Number of balloon pages we've told the Host we're not using. */
unsigned int num_pages;
/*
@@ -71,6 +78,9 @@ struct virtio_balloon
/* Memory statistics */
int need_stats_update;
struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
+
+   /* Message virtqueue */
+   atomic_t guest_pressure;
 };
 
 static struct virtio_device_id id_table[] = {
@@ -78,6 +88,41 @@ static struct virtio_device_id id_table[] = {
{ 0 },
 };
 
+static inline bool guest_under_pressure(const struct virtio_balloon *vb)
+{
+   return atomic_read(vb-guest_pressure) == 1;
+}
+
+static void vmpressure_event_handler(void *data, int level)
+{
+   struct virtio_balloon *vb = data;
+
+   atomic_set(vb-guest_pressure, 1);
+   wake_up(vb-config_change);
+}
+
+static void tell_host_pressure(struct virtio_balloon *vb)
+{
+   const uint32_t msg = VIRTIO_BALLOON_MSG_PRESSURE;
+   struct scatterlist sg;
+   unsigned int len;
+   int err;
+
+   sg_init_one(sg, msg, sizeof(msg));
+
+   err = virtqueue_add_outbuf(vb-message_vq, sg, 1, vb, GFP_KERNEL);
+   if (err  0) {
+   printk(KERN_WARNING virtio-balloon: failed to send host 
message (%d)\n, err);
+   goto out;
+   }
+   virtqueue_kick(vb-message_vq);
+
+   wait_event(vb-message_acked, virtqueue_get_buf(vb-message_vq, len));
+
+out:
+   atomic_set(vb-guest_pressure, 0);
+}
+
 static u32 page_to_balloon_pfn(struct page *page)
 {
unsigned long pfn = page_to_pfn(page);
@@ -100,6 +145,13 @@ static void balloon_ack(struct virtqueue *vq)
wake_up(vb-acked);
 }
 
+static void message_ack(struct virtqueue *vq)
+{
+   struct virtio_balloon *vb = vq-vdev-priv;
+
+   wake_up(vb-message_acked);
+}
+
 static void tell_host(struct virtio_balloon 

[RFC PATCH 4/4] virtio_balloon: skip inflation if guest is under pressure

2014-01-16 Thread Luiz Capitulino
From: Luiz capitulino lcapitul...@redhat.com

This is necessary for automatic ballooning. If the guest gets
under pressure while there's an on-going inflation operation,
we want the guest to do the following:

 1. Stop on-going inflation
 2. Notify the host we're under pressure
 3. Wait for host's acknowledge

While the guest is waiting the host has the opportunity to
reset num_pages to a value before the guest got into pressure.
This will cancel current inflation.

Signed-off-by: Luiz capitulino lcapitul...@redhat.com
---
 drivers/virtio/virtio_balloon.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1c3ee71..7f5b7d2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -188,8 +188,13 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
mutex_lock(vb-balloon_lock);
for (vb-num_pfns = 0; vb-num_pfns  num;
 vb-num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
-   struct page *page = balloon_page_enqueue(vb_dev_info);
+   struct page *page;
 
+   if (guest_under_pressure(vb)) {
+   break;
+   }
+
+   page = balloon_page_enqueue(vb_dev_info);
if (!page) {
dev_info_ratelimited(vb-vdev-dev,
 Out of puff! Can't get %u 
pages\n,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] balloon: add automatic ballooning support

2014-01-16 Thread Luiz Capitulino
The current balloon device has an important drawback: it's
entirely manual. This largely limits the feature when used
to manage memory on a memory over-committed host.

In order to make the balloon device really useful for
memory-overcommit setups, we have to make it automatic.
This is what this patch is about.

The general idea is simple. When the host is under pressure
it notifies QEMU. QEMU then asks the guest to inflate its
balloon by some amount of pages. When the guest is under
pressure, it also notifies QEMU. QEMU then asks the guest
to deflate its balloon by some amount of pages.

The rest of this commit is divided into the following
sections: Usage, Algorithm, FIXMEs/TODOs, Testing and some
code in the end.

Usage
=

This series depends on a kernel series for the virtio-balloon
driver called add pressure notification via a new virtqueue.
You have to apply that series in your guest kernel to play
with automatic ballooning.

Then, on the QEMU side you can enable automatic ballooning with
the following command-line:

 $ qemu [...] -device virtio-balloon,automatic=true

Algorithm
=

On host pressure:

 1. On boot QEMU registers for Linux kernel's vmpressure
event low. This event is sent by the kernel when it
has started reclaiming memory. For more details, please
read Documentation/cgroups/memory.txt in the kernel's
source

 2. When QEMU is notified on host pressure, it first checks
if the guest is currently in pressure, if it is then
the event is skipped. If the guest is not in pressure
QEMU asks the guest to inflate its balloon (32MB by
default)

NOTE: QEMU will update num_pages whenever an event
  is received and the guest is not in pressure.
  This means that if QEMU receives 10 events in
  a row, num_pages will be updated to 320MB.

On guest pressure:

 1. QEMU is notified by the virtio-balloon driver in the
guest (via message virtqueue) that the guest is under
pressure

 2. QEMU checks if there's an inflate going on. If true,
QEMU rests num_pages to the current balloon value so
that the guest stops inflating (IOW, QEMU cancels
current inflation). QEMU returns

 3. If there's no on-going inflate, QEMU asks the guest
to deflate (32MB by default)

 4. Everytime a guest pressure notification is received,
QEMU sets a hysteresis period of 60 seconds. During
this period the guest is defined to be under pressure
(and inflates will be ignored)

FIXMEs/TODOs


 - The number of pages to inflate/deflate and the memcg path
   are harcoded. Will add command-line options for them

 - The default value of 32MB for inflates/deflates is what
   worked for me in my very specific test-case. This is
   probably not a good default, but I don't how to define
   a good one

 - QEMU register's for vmpressure's level low notification.
   The guest too will notify QEMU on low pressure in the
   guest. The low notification is sent whenever the kernel
   has started reclaiming memory. On the guest side this means
   that it will only give free memory to the host. On the host
   side this means that a host with lots of large freeable
   caches will be considered as being in pressure.

   There two ways to solve this:

1. Register for medium pressure instead of low. This
   solves the problem above but it adds a different
   one: medium is sent when the kernel has started to
   swap, so it's a bit too late

2. Add a new event to vmpressure which is between
   low and medium. The perfect event would be triggered
   before waking up kswapd

 - It would be nice (required?) to be able to dynamically
   enable/disable automatic ballooning. With this patch you
   enable it for the lifetime of the VM

 - I think manual ballooning should be disabled when
   automatic ballooning is enabled, but this is not done
   yet

 - This patch probably doesn't build on windows

Testing
===

Testing is by far the most difficult aspect of this project to
me. It's been hard to find a good way to measure this work. So
take this with a grain of salt.

This is my test-case: a 2G host runs two VMs (guest A and
guest B), each with 1.3G of memory. When the VMs are fully
booted (but idle) the host has around 1.2G of free memory.
Then the VMs do the following:

 1. Guest A runs ebizzy five times in a row, with a chunk
size of 1MB and the following number of chunks:
1024, 824, 624, 424, 224. IOW, the memory usage of
this VM is going down. Let's call it vm-down

 2. Guest B runs ebizzy in similar manner, but it runs
ebizzy with the following number of chunks:
224, 424, 624, 824, 1024. IOW, the memory usage of
this VM is going up. Let's call it vm-up

Also, each ebizzy run takes 60 seconds. And the vm-up one
waits 60 seconds before running ebizzy for the first time.
This gives vm-down time to consume most of the host's pressure
and release it.

Here are the results. This is an avarage 

Re: nested EPT

2014-01-16 Thread Jan Kiszka
On 2014-01-16 17:10, duy hai nguyen wrote:
 Dear All,
 
 I am having a problem with using nested EPT in my system: In L0
 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do
 not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but
 does not show in that of L1.
 
 - 'Nested' and 'EPT' are enabled in L0:
 
 $cat /sys/module/kvm_intel/parameters/nested
 Y
 
 $cat /sys/module/kvm_intel/parameters/ept
 Y
 
 - The libvirt xml file used in L0 has this cpu configuration:
 
 cpu mode='host-passthrough'/
 
 - The kernel version I am using for both L0 and L1 is 3.9.11
 

Update your host kernel (L0), nEPT got merged in 3.12.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested EPT

2014-01-16 Thread Paolo Bonzini
Il 16/01/2014 17:10, duy hai nguyen ha scritto:
 Dear All,
 
 I am having a problem with using nested EPT in my system: In L0
 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do
 not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but
 does not show in that of L1.
 
 - 'Nested' and 'EPT' are enabled in L0:
 
 $cat /sys/module/kvm_intel/parameters/nested
 Y
 
 $cat /sys/module/kvm_intel/parameters/ept
 Y
 
 - The libvirt xml file used in L0 has this cpu configuration:
 
 cpu mode='host-passthrough'/
 
 - The kernel version I am using for both L0 and L1 is 3.9.11

Nested EPT was added in 3.12.  You need that version in L0.

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: troubleshoot live migration

2014-01-16 Thread Marcus Sorensen
Found a workaround.

I don't know why I disabled hpet when the issue was lapic, late
nights. I can't reproduce the issue if I disable kvmclock on the qemu
cmdline, or if I leave the vm at the grub prompt rather than loading
the guest OS.  It kind of sucks to disable kvmclock but at least
migration works again now. I have no idea whether it's the guest KVM
drivers, something in qemu, or the host (probably an incompatibility
in the combination)  but it happens with CentOS 6.5 guests and fully
patched Ubuntu 12.04 guests.

Here's our test environment in case anyone cares to try to reproduce.

2 x intel E5-2650 KVM servers running linux 3.10.26 and qemu 1.6.x or 1.7.0
Start a virtual machine with CentOS 6.5 (or perhaps any common,
current distro) without disabling kvmclock. I tried single, dual, quad
vcpus.
Let the vm run for a few hours (migrations immediately after start or
even 30 minutes later seem to have no problem)
Migrate, and the vm should become unusable. You may be lucky enough to
get console/serial output complaining about lapic, or it may just
consume cpu

On Wed, Jan 15, 2014 at 7:23 AM, Marcus Sorensen shadow...@gmail.com wrote:
 Just an update, I found that with different tools I was able to see a
 repeating 'lapic increasing min_delta_ns' scrolling furiously. I've
 added -no-hpet to the cmdline, but was still able to replicate it.

 On Tue, Jan 14, 2014 at 1:36 PM, Marcus Sorensen shadow...@gmail.com wrote:
 Does anyone have tips on troubleshooting live migration? I'm not sure
 if this should be a qemu question or a kvm one. I've got several
 E5-2650 servers running in test environment, kernel 3.10.26 and qemu
 1.7.0. If I start a VM guest (say ubuntu, debian, or centos), I can
 migrate it around from host to host to host just fine, but if I wait
 awhile (say 1 hour), I try to migrate and it succeeds but the guest is
 hosed. No longer pings, cpu is thrashing. I've tried to strace it and
 don't see anything that other working hosts aren't doing, and I've
 tried gdb but I'm not entirely sure what I'm doing. I tried
 downgrading to qemu 1.6.1. I've found dozens of reports of such
 behavior, but they're all due to other things (migrating between
 different host CPUs, someone thinking it's virtio or memballoon only
 to later find a fix like changing machine type, etc). I'm at a loss.
 I've tried replacing the virtio disk/network with sata/e1000, no
 difference. I'm just trying to figure out where to go from here.

 I've got a core dump of the qemu process that was spinning, and I can
 reproduce it fairly easily. Here's an example qemu cmdline:

 /usr/bin/qemu-system-x86_64 -machine accel=kvm -name VM12 -S -machine
 pc-i440fx-1.7,accel=kvm,usb=off -m 512 -realtime mlock=off -smp
 1,sockets=1,cores=1,threads=1 -uuid
 dd10a210-ab41-4cc6-a8f2-51113dd39515 -no-user-config -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/VM12.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
 -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
 file=/dev/sdd,if=none,id=drive-virtio-disk0,format=raw,cache=none
 -device 
 virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2
 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw,cache=none
 -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1
 -netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=31 -device
 virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:39:28:00:01,bus=pci.0,addr=0x3
 -chardev pty,id=charserial0 -device
 isa-serial,chardev=charserial0,id=serial0 -chardev
 socket,id=charchannel0,path=/var/lib/libvirt/qemu/VM12.agent,server,nowait
 -device 
 virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=VM12.vport
 -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -device
 cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested EPT

2014-01-16 Thread duy hai nguyen
Thanks, Jan and Paolo!

Great! It helps solve the problem.

Sincerely,
Hai

On Thu, Jan 16, 2014 at 12:09 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 16/01/2014 17:10, duy hai nguyen ha scritto:
 Dear All,

 I am having a problem with using nested EPT in my system: In L0
 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do
 not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but
 does not show in that of L1.

 - 'Nested' and 'EPT' are enabled in L0:

 $cat /sys/module/kvm_intel/parameters/nested
 Y

 $cat /sys/module/kvm_intel/parameters/ept
 Y

 - The libvirt xml file used in L0 has this cpu configuration:

 cpu mode='host-passthrough'/

 - The kernel version I am using for both L0 and L1 is 3.9.11

 Nested EPT was added in 3.12.  You need that version in L0.

 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Michael S. Tsirkin
On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote:
 When starting lots of dataplane devices the bootup takes very long on my
 s390 system(prototype irqfd code). With larger setups we are even able
 to
 trigger some timeouts in some components.
 Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
 long (strace claims up to 0.1 sec) when having multiple CPUs.
 This is caused by the  synchronize_rcu and the HZ=100 of s390.
 By changing the code to use a private srcu we can speed things up.
 
 This patch reduces the boot time till mounting root from 8 to 2
 seconds on my s390 guest with 100 disks.
 
 I converted most of the rcu routines to srcu. Review for the unconverted
 use of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
 is necessary, though. They look fine to me since they are protected by
 outer functions.
 
 In addition, we should also discuss if a global srcu (for all guests) is
 fine.
 
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com


That's nice but did you try to measure the overhead
on some interrupt-intensive workloads, such as RX with 10G ethernet?
srcu locks aren't free like rcu ones.

 ---
  virt/kvm/irqchip.c | 31 +--
  1 file changed, 17 insertions(+), 14 deletions(-)
 
 diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
 index 20dc9e4..5283eb8 100644
 --- a/virt/kvm/irqchip.c
 +++ b/virt/kvm/irqchip.c
 @@ -26,17 +26,20 @@
  
  #include linux/kvm_host.h
  #include linux/slab.h
 +#include linux/srcu.h
  #include linux/export.h
  #include trace/events/kvm.h
  #include irq.h
  
 +DEFINE_STATIC_SRCU(irq_srcu);
 +
  bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned pin)
  {
   struct kvm_irq_ack_notifier *kian;
 - int gsi;
 + int gsi, idx;
  
 - rcu_read_lock();
 - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin];
 + idx = srcu_read_lock(irq_srcu);
 + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin];
   if (gsi != -1)
   hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list,
link)
 @@ -45,7 +48,7 @@ bool kvm_irq_has_notifier(struct kvm *kvm, unsigned 
 irqchip, unsigned pin)
   return true;
   }
  
 - rcu_read_unlock();
 + srcu_read_unlock(irq_srcu, idx);
  
   return false;
  }
 @@ -54,18 +57,18 @@ EXPORT_SYMBOL_GPL(kvm_irq_has_notifier);
  void kvm_notify_acked_irq(struct kvm *kvm, unsigned irqchip, unsigned pin)
  {
   struct kvm_irq_ack_notifier *kian;
 - int gsi;
 + int gsi, idx;
  
   trace_kvm_ack_irq(irqchip, pin);
  
 - rcu_read_lock();
 - gsi = rcu_dereference(kvm-irq_routing)-chip[irqchip][pin];
 + idx = srcu_read_lock(irq_srcu);
 + gsi = srcu_dereference(kvm-irq_routing, irq_srcu)-chip[irqchip][pin];
   if (gsi != -1)
   hlist_for_each_entry_rcu(kian, kvm-irq_ack_notifier_list,
link)
   if (kian-gsi == gsi)
   kian-irq_acked(kian);
 - rcu_read_unlock();
 + srcu_read_unlock(irq_srcu, idx);
  }
  
  void kvm_register_irq_ack_notifier(struct kvm *kvm,
 @@ -85,7 +88,7 @@ void kvm_unregister_irq_ack_notifier(struct kvm *kvm,
   mutex_lock(kvm-irq_lock);
   hlist_del_init_rcu(kian-link);
   mutex_unlock(kvm-irq_lock);
 - synchronize_rcu();
 + synchronize_srcu_expedited(irq_srcu);
  #ifdef __KVM_HAVE_IOAPIC
   kvm_vcpu_request_scan_ioapic(kvm);
  #endif
 @@ -115,7 +118,7 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
 irq, int level,
   bool line_status)
  {
   struct kvm_kernel_irq_routing_entry *e, irq_set[KVM_NR_IRQCHIPS];
 - int ret = -1, i = 0;
 + int ret = -1, i = 0, idx;
   struct kvm_irq_routing_table *irq_rt;
  
   trace_kvm_set_irq(irq, level, irq_source_id);
 @@ -124,12 +127,12 @@ int kvm_set_irq(struct kvm *kvm, int irq_source_id, u32 
 irq, int level,
* IOAPIC.  So set the bit in both. The guest will ignore
* writes to the unused one.
*/
 - rcu_read_lock();
 - irq_rt = rcu_dereference(kvm-irq_routing);
 + idx = srcu_read_lock(irq_srcu);
 + irq_rt = srcu_dereference(kvm-irq_routing, irq_srcu);
   if (irq  irq_rt-nr_rt_entries)
   hlist_for_each_entry(e, irq_rt-map[irq], link)
   irq_set[i++] = *e;
 - rcu_read_unlock();
 + srcu_read_unlock(irq_srcu, idx);
  
   while(i--) {
   int r;
 @@ -226,7 +229,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
   kvm_irq_routing_update(kvm, new);
   mutex_unlock(kvm-irq_lock);
  
 - synchronize_rcu();
 + synchronize_srcu_expedited(irq_srcu);

Hmm, it's a bit strange that you also do _expecited here.
What if this synchronize_rcu is replaced by synchronize_rcu_expedited
and no other changes are made?
Maybe that's enough?

  
   new = old;
 

Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Michael S. Tsirkin
On Thu, Jan 16, 2014 at 02:07:19PM +0100, Paolo Bonzini wrote:
 Il 16/01/2014 14:06, Christian Borntraeger ha scritto:
  Will you edit the patch description or shall I resend the patch?
 
 I can edit the commit message.
 
 Paolo

I think we really need to see the effect adding srcu has on interrupt
injection.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Christian Borntraeger
On 16/01/14 19:55, Michael S. Tsirkin wrote:
 On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote:
[...]
 I converted most of the rcu routines to srcu. Review for the unconverted
[...]

 That's nice but did you try to measure the overhead
 on some interrupt-intensive workloads, such as RX with 10G ethernet?
 srcu locks aren't free like rcu ones.

Right, but the read side is only acting on cpu local data structures so the 
overhead
is probably very small in contrast to the hlist work and the injection itself.
You have a more compelling review comment, though:

[...]
 -synchronize_rcu();
 +synchronize_srcu_expedited(irq_srcu);
 
 Hmm, it's a bit strange that you also do _expecited here.

Well, I just did what the original mail thread suggested :-)

 What if this synchronize_rcu is replaced by synchronize_rcu_expedited
 and no other changes are made?
 Maybe that's enough?

Yes, its enough. (seems slightly slower than v2, but fast enough)

Patch below:


[PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

When starting lots of dataplane devices the bootup takes very long
on my s390 system(prototype irqfd code). With larger setups we are even
able to trigger some timeouts in some components.
Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
long (strace claims up to 0.1 sec) when having multiple CPUs.
This is caused by the  synchronize_rcu and the HZ=100 of s390.

Lets use the expedited variant to speed things up as suggested by
Michael S. Tsirkin

Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
---
 virt/kvm/irqchip.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
index 20dc9e4..dbcfde7 100644
--- a/virt/kvm/irqchip.c
+++ b/virt/kvm/irqchip.c
@@ -226,7 +226,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
kvm_irq_routing_update(kvm, new);
mutex_unlock(kvm-irq_lock);
 
-   synchronize_rcu();
+   synchronize_rcu_expedited();
 
new = old;
r = 0;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-01-16 Thread Michael S. Tsirkin
On Thu, Jan 16, 2014 at 09:07:14PM +0100, Christian Borntraeger wrote:
 On 16/01/14 19:55, Michael S. Tsirkin wrote:
  On Thu, Jan 16, 2014 at 01:44:20PM +0100, Christian Borntraeger wrote:
 [...]
  I converted most of the rcu routines to srcu. Review for the unconverted
 [...]
 
  That's nice but did you try to measure the overhead
  on some interrupt-intensive workloads, such as RX with 10G ethernet?
  srcu locks aren't free like rcu ones.
 
 Right, but the read side is only acting on cpu local data structures so the 
 overhead
 is probably very small in contrast to the hlist work and the injection itself.

My testing of VM exit paths shows the overhead of read size RCU
is not negligeable, IIRC it involves memory barriers which
are expensive operations.

 You have a more compelling review comment, though:
 
 [...]
  -  synchronize_rcu();
  +  synchronize_srcu_expedited(irq_srcu);
  
  Hmm, it's a bit strange that you also do _expecited here.
 
 Well, I just did what the original mail thread suggested :-)
 
  What if this synchronize_rcu is replaced by synchronize_rcu_expedited
  and no other changes are made?
  Maybe that's enough?
 
 Yes, its enough. (seems slightly slower than v2, but fast enough)
 
 Patch below:
 
 
 [PATCHv3] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
 
 When starting lots of dataplane devices the bootup takes very long
 on my s390 system(prototype irqfd code). With larger setups we are even
 able to trigger some timeouts in some components.
 Turns out that the KVM_SET_GSI_ROUTING ioctl takes very
 long (strace claims up to 0.1 sec) when having multiple CPUs.
 This is caused by the  synchronize_rcu and the HZ=100 of s390.
 
 Lets use the expedited variant to speed things up as suggested by
 Michael S. Tsirkin
 
 Signed-off-by: Christian Borntraeger borntrae...@de.ibm.com
 ---
  virt/kvm/irqchip.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c
 index 20dc9e4..dbcfde7 100644
 --- a/virt/kvm/irqchip.c
 +++ b/virt/kvm/irqchip.c
 @@ -226,7 +226,7 @@ int kvm_set_irq_routing(struct kvm *kvm,
   kvm_irq_routing_update(kvm, new);
   mutex_unlock(kvm-irq_lock);
  
 - synchronize_rcu();
 + synchronize_rcu_expedited();
  
   new = old;
   r = 0;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v5] add support for Hyper-V reference time counter

2014-01-16 Thread Marcelo Tosatti
On Thu, Jan 16, 2014 at 08:18:37PM +1100, Vadim Rozenfeld wrote:
 Signed-off: Peter Lieven p...@kamp.de
 Signed-off: Gleb Natapov
 Signed-off: Vadim Rozenfeld vroze...@redhat.com
  
 After some consideration I decided to submit only Hyper-V reference
 counters support this time. I will submit iTSC support as a separate
 patch as soon as it is ready. 
 
 v1 - v2
 1. mark TSC page dirty as suggested by 
 Eric Northup digitale...@google.com and Gleb
 2. disable local irq when calling get_kernel_ns, 
 as it was done by Peter Lieven p...@amp.de
 3. move check for TSC page enable from second patch
 to this one.
 
 v3 - v4
 Get rid of ref counter offset.
 
 v4 - v5
 replace __copy_to_user with kvm_write_guest
 when updateing iTSC page.
 
 ---
  arch/x86/include/asm/kvm_host.h|  1 +
  arch/x86/include/uapi/asm/hyperv.h | 13 +
  arch/x86/kvm/x86.c | 28 +++-
  include/uapi/linux/kvm.h   |  1 +
  4 files changed, 42 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index ae5d783..33fef07 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -605,6 +605,7 @@ struct kvm_arch {
   /* fields used by HYPER-V emulation */
   u64 hv_guest_os_id;
   u64 hv_hypercall;
 + u64 hv_tsc_page;
  
   #ifdef CONFIG_KVM_MMU_AUDIT
   int audit_point;
 diff --git a/arch/x86/include/uapi/asm/hyperv.h 
 b/arch/x86/include/uapi/asm/hyperv.h
 index b8f1c01..462efe7 100644
 --- a/arch/x86/include/uapi/asm/hyperv.h
 +++ b/arch/x86/include/uapi/asm/hyperv.h
 @@ -28,6 +28,9 @@
  /* Partition Reference Counter (HV_X64_MSR_TIME_REF_COUNT) available*/
  #define HV_X64_MSR_TIME_REF_COUNT_AVAILABLE  (1  1)
  
 +/* A partition's reference time stamp counter (TSC) page */
 +#define HV_X64_MSR_REFERENCE_TSC 0x4021
 +
  /*
   * There is a single feature flag that signifies the presence of the MSR
   * that can be used to retrieve both the local APIC Timer frequency as
 @@ -198,6 +201,9 @@
  #define HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_MASK \
   (~((1ull  HV_X64_MSR_APIC_ASSIST_PAGE_ADDRESS_SHIFT) - 1))
  
 +#define HV_X64_MSR_TSC_REFERENCE_ENABLE  0x0001
 +#define HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT   12
 +
  #define HV_PROCESSOR_POWER_STATE_C0  0
  #define HV_PROCESSOR_POWER_STATE_C1  1
  #define HV_PROCESSOR_POWER_STATE_C2  2
 @@ -210,4 +216,11 @@
  #define HV_STATUS_INVALID_ALIGNMENT  4
  #define HV_STATUS_INSUFFICIENT_BUFFERS   19
  
 +typedef struct _HV_REFERENCE_TSC_PAGE {
 + __u32 tsc_sequence;
 + __u32 res1;
 + __u64 tsc_scale;
 + __s64 tsc_offset;
 +} HV_REFERENCE_TSC_PAGE, *PHV_REFERENCE_TSC_PAGE;
 +
  #endif
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 5d004da..8e685b8 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -836,11 +836,12 @@ EXPORT_SYMBOL_GPL(kvm_rdpmc);
   * kvm-specific. Those are put in the beginning of the list.
   */
  
 -#define KVM_SAVE_MSRS_BEGIN  10
 +#define KVM_SAVE_MSRS_BEGIN  12
  static u32 msrs_to_save[] = {
   MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
   MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
   HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
 + HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC,
   HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
   MSR_KVM_PV_EOI_EN,
   MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
 @@ -1826,6 +1827,8 @@ static bool kvm_hv_msr_partition_wide(u32 msr)
   switch (msr) {
   case HV_X64_MSR_GUEST_OS_ID:
   case HV_X64_MSR_HYPERCALL:
 + case HV_X64_MSR_REFERENCE_TSC:
 + case HV_X64_MSR_TIME_REF_COUNT:
   r = true;
   break;
   }
 @@ -1867,6 +1870,20 @@ static int set_msr_hyperv_pw(struct kvm_vcpu *vcpu, 
 u32 msr, u64 data)
   kvm-arch.hv_hypercall = data;
   break;
   }
 + case HV_X64_MSR_REFERENCE_TSC: {
 + u64 gfn;
 + HV_REFERENCE_TSC_PAGE tsc_ref;
 + memset(tsc_ref, 0, sizeof(tsc_ref));
 + kvm-arch.hv_tsc_page = data;

Comment 1)

Is there a reason (that is compliance with spec) to maintain
value, for HV_X64_MSR_REFERENCE_TSC wrmsr operation, in case
HV_X64_MSR_TSC_REFERENCE_ENABLE is not set?

If not, should only assign to kvm-arch.hv_tsc_page after proper checks.

 + if (!(data  HV_X64_MSR_TSC_REFERENCE_ENABLE))
 + break;
 + gfn = data  HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
 + if (kvm_write_guest(kvm, data,
 + tsc_ref, sizeof(tsc_ref)))
 + return 1;
 + mark_page_dirty(kvm, gfn);
 + break;
 + }
   default:
   vcpu_unimpl(vcpu, HYPER-V unimplemented wrmsr: 0x%x 
   

Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Rusty Russell
Luiz Capitulino lcapitul...@redhat.com writes:
 From: Luiz capitulino lcapitul...@redhat.com

 This commit adds support to a new virtqueue called message virtqueue.

OK, this needs a lot of thought (especially since reworking the virtio
balloon is on the TODO list for the OASIS virtio technical committee...)

But AFAICT this is a way of explicitly saying no to the host's target
(vs the implicit method of just not meeting the target).  I'm not sure
that gives enough information to the host.  On the other hand, I'm not
sure what information we *can* give.

Should we instead be counter-proposing a target?

What does qemu do with this information?

Thanks,
Rusty.

 The message virtqueue can be used by guests to notify the host about
 important memory-related state changes in the guest. Currently, the
 only implemented notification is the guest is under pressure one,
 which informs the host that the guest is under memory pressure.

 This notification can be used to implement automatic memory ballooning
 in the host. For example, once learning that the guest is under pressure,
 the host could cancel an on-going inflate and/or start a deflate
 operation.

 Doing this through a virtqueue might seem like overkill, as all
 we're doing is to transfer an integer between guest and host. However,
 using a virtqueue offers the following advantages:

  1. We can realibly synchronize host and guest. That is, the guest
 will only continue once the host acknowledges the notification.
 This is important, because if the guest gets under pressure while
 inflating the balloon, it has to stop to give the host a chance
 to reset num_pages (see next commit)

  2. It's extensible. We may (or may not) want to tell the host
 which pressure level the guest finds itself in (ie. low,
 medium or critical)

 The lightweight alternative is to use a configuration space parameter.
 For this to work though, the guest would have to wait the for host
 to acknowloedge the receipt of a configuration change update. I could
 try this if the virtqueue is too overkill.

 Finally, the guest learns it's under pressure by registering a
 callback with the in-kernel vmpressure API.

 FIXMEs:

  - vmpressure API is missing an de-registration routine
  - not completely sure my changes in virtballoon_probe() are correct

 Signed-off-by: Luiz capitulino lcapitul...@redhat.com
 ---
  drivers/virtio/virtio_balloon.c | 93 
 +
  include/uapi/linux/virtio_balloon.h |  1 +
  2 files changed, 84 insertions(+), 10 deletions(-)

 diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
 index 5c4a95b..1c3ee71 100644
 --- a/drivers/virtio/virtio_balloon.c
 +++ b/drivers/virtio/virtio_balloon.c
 @@ -29,6 +29,9 @@
  #include linux/module.h
  #include linux/balloon_compaction.h
  
 +#include linux/cgroup.h
 +#include linux/vmpressure.h
 +
  /*
   * Balloon device works in 4K page units.  So each page is pointed to by
   * multiple balloon pages.  All memory counters in this driver are in balloon
 @@ -37,10 +40,12 @@
  #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
 VIRTIO_BALLOON_PFN_SHIFT)
  #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
  
 +#define VIRTIO_BALLOON_MSG_PRESSURE 1
 +
  struct virtio_balloon
  {
   struct virtio_device *vdev;
 - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
 + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq;
  
   /* Where the ballooning thread waits for config to change. */
   wait_queue_head_t config_change;
 @@ -51,6 +56,8 @@ struct virtio_balloon
   /* Waiting for host to ack the pages we released. */
   wait_queue_head_t acked;
  
 + wait_queue_head_t message_acked;
 +
   /* Number of balloon pages we've told the Host we're not using. */
   unsigned int num_pages;
   /*
 @@ -71,6 +78,9 @@ struct virtio_balloon
   /* Memory statistics */
   int need_stats_update;
   struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 +
 + /* Message virtqueue */
 + atomic_t guest_pressure;
  };
  
  static struct virtio_device_id id_table[] = {
 @@ -78,6 +88,41 @@ static struct virtio_device_id id_table[] = {
   { 0 },
  };
  
 +static inline bool guest_under_pressure(const struct virtio_balloon *vb)
 +{
 + return atomic_read(vb-guest_pressure) == 1;
 +}
 +
 +static void vmpressure_event_handler(void *data, int level)
 +{
 + struct virtio_balloon *vb = data;
 +
 + atomic_set(vb-guest_pressure, 1);
 + wake_up(vb-config_change);
 +}
 +
 +static void tell_host_pressure(struct virtio_balloon *vb)
 +{
 + const uint32_t msg = VIRTIO_BALLOON_MSG_PRESSURE;
 + struct scatterlist sg;
 + unsigned int len;
 + int err;
 +
 + sg_init_one(sg, msg, sizeof(msg));
 +
 + err = virtqueue_add_outbuf(vb-message_vq, sg, 1, vb, GFP_KERNEL);
 + if (err  0) {
 + printk(KERN_WARNING virtio-balloon: failed to send host 
 message 

Re: KVM: MMU: handle invalid root_hpa at __direct_map

2014-01-16 Thread Rom Freiman
Hi everybody,
Marcelo, your suggestion above should work together with the same
patch to __direct_mapping.

After running some test on different kernels, the !VALID_PAGE happens
as following:

Linux localhost.localdomain 3.12.5-200.fc19.x86_64 #1 SMP Tue Dec 17
22:21:14 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux:
INVALID PAGE kvm/arch/x86/kvm/paging_tmpl.h:FNAME(fetch):573: 

Linux localhost.localdomain
3.11.8-200.strato0002.fc19.strato.44671928e2e2.x86_64 #1 SMP Sun Jan 5
18:30:38 IST 2014 x86_64 x86_64 x86_64 GNU/Linux:
INVALID PAGE arch/x86/kvm/mmu.c:__direct_map:2695: 

So u probably should add both patches to kvm.

Here is my complete patch for 3.11, if you like:

diff --git a/kvm/arch/x86/kvm/mmu.c b/kvm/arch/x86/kvm/mmu.c
index 9e9285a..c38e480 100644
--- a/kvm/arch/x86/kvm/mmu.c
+++ b/kvm/arch/x86/kvm/mmu.c
@@ -2691,6 +2691,11 @@ static int __direct_map(struct kvm_vcpu *vcpu,
gpa_t v, int write,
int emulate = 0;
gfn_t pseudo_gfn;

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+return 0;
+   }
+
for_each_shadow_entry(vcpu, (u64)gfn  PAGE_SHIFT, iterator) {
if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL,
@@ -2861,6 +2866,11 @@ static bool fast_page_fault(struct kvm_vcpu
*vcpu, gva_t gva, int level,
bool ret = false;
u64 spte = 0ull;

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+   return false;
+   }
+
if (!page_fault_can_be_fast(vcpu, error_code))
return false;

@@ -3255,6 +3265,11 @@ static u64
walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
struct kvm_shadow_walk_iterator iterator;
u64 spte = 0ull;

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+   return spte;
+   }
+
walk_shadow_page_lockless_begin(vcpu);
for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
if (!is_shadow_present_pte(spte))
@@ -4525,6 +4540,11 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu
*vcpu, u64 addr, u64 sptes[4])
u64 spte;
int nr_sptes = 0;

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+   return nr_sptes;
+   }
+
walk_shadow_page_lockless_begin(vcpu);
for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
sptes[iterator.level-1] = spte;

diff --git a/kvm/arch/x86/kvm/paging_tmpl.h b/kvm/arch/x86/kvm/paging_tmpl.h
index 7769699..202a1fc 100644
--- a/kvm/arch/x86/kvm/paging_tmpl.h
+++ b/kvm/arch/x86/kvm/paging_tmpl.h
@@ -423,6 +423,11 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
if (FNAME(gpte_changed)(vcpu, gw, top_level))
goto out_gpte_changed;

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+   goto out_gpte_changed;
+   }
+
for (shadow_walk_init(it, vcpu, addr);
 shadow_walk_okay(it)  it.level  gw-level;
 shadow_walk_next(it)) {
@@ -674,6 +679,12 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
 */
mmu_topup_memory_caches(vcpu);

+   if (!VALID_PAGE(vcpu-arch.mmu.root_hpa)) {
+   pgprintk(KERN_WARNING invalid page access %s:%s:%d:
%llx\n,  __FILE__, __func__, __LINE__, vcpu-arch.mmu.root_hpa);
+   WARN_ON(1);
+return;
+   }


Another issue I'm struggling with is the next bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1052861

The only thing i'm trying to over there is to run virt-v2v in nested
environment. May be you have any idea?
It's also related to memory pressure on L2. Of coarse L0 does not
crash (using it with the above patch) but L2 is crushing during the
conversion process.

Thanks,
Rom

On Fri, Dec 27, 2013 at 9:43 PM, Marcelo Tosatti mtosa...@redhat.com wrote:
 On Sun, Dec 22, 2013 at 12:56:49PM -0200, Marcelo Tosatti wrote:
 On Sun, Dec 22, 2013 at 11:17:21AM +0200, Rom Freiman wrote:
  Hello everyone,
 
  I've been chasing this bug for a while.
 
  According to my research, this bug fix is works fine for
  3.11.9-200.fc19.x86_64 kernel version (and I also came to almost similar
  solution and really solved the crash).
 
  But, the problem is, that it seems that this patch does not work on 
  3.13.0-rc2+
  - it looks 

Re: internal error End of file from monitor with shutdown vms

2014-01-16 Thread Stefan Hajnoczi
On Thu, Jan 16, 2014 at 8:57 PM, Amr ElGebaly a...@vision-as.com wrote:
 i didn't modify any thing in cgconfig
 (didn't define any limits )
 i using the default config
 [root@Node1 ~]# cat /etc/cgconfig.conf
 #
 #  Copyright IBM Corporation. 2007
 #
 #  Authors: Balbir Singh bal...@linux.vnet.ibm.com
 #  This program is free software; you can redistribute it and/or modify it
 #  under the terms of version 2.1 of the GNU Lesser General Public License
 #  as published by the Free Software Foundation.
 #
 #  This program is distributed in the hope that it would be useful, but
 #  WITHOUT ANY WARRANTY; without even the implied warranty of
 #  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 #
 # See man cgconfig.conf for further details.
 #
 # By default, mount all controllers to /cgroup/controller

 mount {
 cpuset = /cgroup/cpuset;
 cpu = /cgroup/cpu;
 cpuacct = /cgroup/cpuacct;
 memory = /cgroup/memory;
 devices = /cgroup/devices;
 freezer = /cgroup/freezer;
 net_cls = /cgroup/net_cls;
 blkio = /cgroup/blkio;
 }

 [root@Node1 ~]# cat /etc/cg
 cgconfig.conf  cgrules.conf
 cgsnapshot_blacklist.conf
 [root@Node1 ~]# cat /etc/cgrules.conf
 # /etc/cgrules.conf
 #The format of this file is described in cgrules.conf(5)
 #manual page.
 #
 # Example:
 #user controllers destination
 #@student cpu,memory usergroup/student/
 #peter cpu test1/
 #% memory test

 how can i fix that

I'm not sure where the 1 GB limit is coming from.  Perhaps libvirt set it?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Luiz Capitulino
On Fri, 17 Jan 2014 09:10:47 +1030
Rusty Russell ru...@rustcorp.com.au wrote:

 Luiz Capitulino lcapitul...@redhat.com writes:
  From: Luiz capitulino lcapitul...@redhat.com
 
  This commit adds support to a new virtqueue called message virtqueue.
 
 OK, this needs a lot of thought (especially since reworking the virtio
 balloon is on the TODO list for the OASIS virtio technical committee...)
 
 But AFAICT this is a way of explicitly saying no to the host's target
 (vs the implicit method of just not meeting the target).  I'm not sure
 that gives enough information to the host.  On the other hand, I'm not
 sure what information we *can* give.
 
 Should we instead be counter-proposing a target?

The problem is how to estimate a target value. I found it simpler
to just try to obey what the host is asking for (and fail if not
possible) than trying to make the guest negotiate with the host.

 What does qemu do with this information?

There are two possible scenarios:

 1. The balloon driver is currently inflating when it gets under
pressure

QEMU resets num_pages to the current balloon size. This
cancels the on-going inflate

 2. The balloon driver is not inflating, eg. it's possibly sleeping

   QEMU issues a deflate

But note that those scenarios are not supposed to be used with the
current device, they are part of the automatic ballooning feature.
I CC'ed you on the QEMU patch, you can find it here case you didn't
see it:

 http://marc.info/?l=kvmm=138988966315690w=2

 
 Thanks,
 Rusty.
 
  The message virtqueue can be used by guests to notify the host about
  important memory-related state changes in the guest. Currently, the
  only implemented notification is the guest is under pressure one,
  which informs the host that the guest is under memory pressure.
 
  This notification can be used to implement automatic memory ballooning
  in the host. For example, once learning that the guest is under pressure,
  the host could cancel an on-going inflate and/or start a deflate
  operation.
 
  Doing this through a virtqueue might seem like overkill, as all
  we're doing is to transfer an integer between guest and host. However,
  using a virtqueue offers the following advantages:
 
   1. We can realibly synchronize host and guest. That is, the guest
  will only continue once the host acknowledges the notification.
  This is important, because if the guest gets under pressure while
  inflating the balloon, it has to stop to give the host a chance
  to reset num_pages (see next commit)
 
   2. It's extensible. We may (or may not) want to tell the host
  which pressure level the guest finds itself in (ie. low,
  medium or critical)
 
  The lightweight alternative is to use a configuration space parameter.
  For this to work though, the guest would have to wait the for host
  to acknowloedge the receipt of a configuration change update. I could
  try this if the virtqueue is too overkill.
 
  Finally, the guest learns it's under pressure by registering a
  callback with the in-kernel vmpressure API.
 
  FIXMEs:
 
   - vmpressure API is missing an de-registration routine
   - not completely sure my changes in virtballoon_probe() are correct
 
  Signed-off-by: Luiz capitulino lcapitul...@redhat.com
  ---
   drivers/virtio/virtio_balloon.c | 93 
  +
   include/uapi/linux/virtio_balloon.h |  1 +
   2 files changed, 84 insertions(+), 10 deletions(-)
 
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index 5c4a95b..1c3ee71 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -29,6 +29,9 @@
   #include linux/module.h
   #include linux/balloon_compaction.h
   
  +#include linux/cgroup.h
  +#include linux/vmpressure.h
  +
   /*
* Balloon device works in 4K page units.  So each page is pointed to by
* multiple balloon pages.  All memory counters in this driver are in 
  balloon
  @@ -37,10 +40,12 @@
   #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
  VIRTIO_BALLOON_PFN_SHIFT)
   #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
   
  +#define VIRTIO_BALLOON_MSG_PRESSURE 1
  +
   struct virtio_balloon
   {
  struct virtio_device *vdev;
  -   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
  +   struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq;
   
  /* Where the ballooning thread waits for config to change. */
  wait_queue_head_t config_change;
  @@ -51,6 +56,8 @@ struct virtio_balloon
  /* Waiting for host to ack the pages we released. */
  wait_queue_head_t acked;
   
  +   wait_queue_head_t message_acked;
  +
  /* Number of balloon pages we've told the Host we're not using. */
  unsigned int num_pages;
  /*
  @@ -71,6 +78,9 @@ struct virtio_balloon
  /* Memory statistics */
  int need_stats_update;
  struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
  +
  +   /* Message virtqueue */
  +   

Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Luiz Capitulino
On Thu, 16 Jan 2014 20:38:19 -0500
Luiz Capitulino lcapitul...@redhat.com wrote:

  What does qemu do with this information?
 
 There are two possible scenarios:
 
  1. The balloon driver is currently inflating when it gets under
 pressure
 
 QEMU resets num_pages to the current balloon size. This
 cancels the on-going inflate
 
  2. The balloon driver is not inflating, eg. it's possibly sleeping
 
QEMU issues a deflate
 
 But note that those scenarios are not supposed to be used with the
 current device, they are part of the automatic ballooning feature.
 I CC'ed you on the QEMU patch, you can find it here case you didn't
 see it:
 
  http://marc.info/?l=kvmm=138988966315690w=2

By current device I meant outside of automatic ballooning scope.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: nested EPT

2014-01-16 Thread duy hai nguyen
Now I can run an L2 guest (nested guest)  using the kvm kernel module
of kernel 3.12

However, I am facing a new problem when trying to build and use kvm
kernel module from git://git.kiszka.org/kvm-kmod.git: L1 (nested
hypervisor) cannot boot L2  and the graphic console of virt-manager
hangs displaying 'Booting from Hard Disk...'. L1 still runs fine.

Loading kvm_intel with 'emulate_invalid_guest_state=0' in L0 does not
solve the problem. I have also tried with different kernel versions:
3.12.0, 3.12.8 and 3.13.0 without success.

Can you give me some suggestions?

Thank you very much

Best,
Hai

On Thu, Jan 16, 2014 at 1:17 PM, duy hai nguyen hain...@gmail.com wrote:
 Thanks, Jan and Paolo!

 Great! It helps solve the problem.

 Sincerely,
 Hai

 On Thu, Jan 16, 2014 at 12:09 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 16/01/2014 17:10, duy hai nguyen ha scritto:
 Dear All,

 I am having a problem with using nested EPT in my system: In L0
 hypervisor CPUs support vmx and ept; however, L1 hypervisor's CPUs do
 not have ept capability. Flag 'ept' appears in /proc/cpuinfo of L0 but
 does not show in that of L1.

 - 'Nested' and 'EPT' are enabled in L0:

 $cat /sys/module/kvm_intel/parameters/nested
 Y

 $cat /sys/module/kvm_intel/parameters/ept
 Y

 - The libvirt xml file used in L0 has this cpu configuration:

 cpu mode='host-passthrough'/

 - The kernel version I am using for both L0 and L1 is 3.9.11

 Nested EPT was added in 3.12.  You need that version in L0.

 Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html